Remove taxanomies from Hugo sitemap

Most blogging systems offer a way to create user-defined groupings of content called taxanomies. While I like the feature, I do not want them showing in search engines.

Update

2017-09-10: After migrating to use GitLab CI for building and publishing, I now use XMLStarlet in GitLab CI to achieve the same outcome.

Motivation

When I was using WordPress, it was easy to find a plugin that can omit taxanomies from appearing in sitemap. With Hugo, I could not find a way to disable these from showing up on search engines. I also found that in some cases, the taxanomy links were ranked higher than the actual content page. Since my posts and pages are already indexed, there is no value in having taxanomies indexed so I had to remove them. One of the ways is by adding a post-build step.

Before adding post-build step

My batch file for generating public output folder on Hugo was:

rd /s /q public
md public
hugo.exe

The above cleans the output folder, recreates and finally publishes into it. You may not like the idea of deleting and recreating the public folder. The first two lines are optional.

Adding post-build step

I coded a simple Powershell script that cleanses the sitemap.xml file of taxanomies.

[xml]$xml = Get-Content public\sitemap.xml

$ns = New-Object System.Xml.XmlNamespaceManager($xml.NameTable)
$ns.AddNamespace("ns", $xml.DocumentElement.NamespaceURI)

$xpathSelectCriterion = "//ns:url[contains(ns:loc, '/categories/') or contains(ns:loc, '/tags/')]"

$node = $xml.SelectSingleNode($xpathSelectCriterion, $ns)
while ($node -ne $null) {
    $node.ParentNode.RemoveChild($node)
    $node = $xml.SelectSingleNode($xpathSelectCriterion, $ns)
}

$xml.save("public\sitemap.xml")

Then added a line at the end of my batch file to call the script above:

rd /s /q public
md public
hugo.exe
powershell -noexit "& "".\clean-sitemap.ps1"""

Why Powershell

I tried and found Windows batch commands too cumbersome. In the end I needed to achieve the results required using the shortest possible time and using an interpreted language so that I do not have to upload binary files into source control.

Robots.txt

You need to tell search bots that crawling categories and tags taxanomies are disallowed.

User-agent: *
Disallow: /categories/
Disallow: /tags/
Sitemap: https://www.leowkahman.com/sitemap.xml

Remember to point the Sitemap line to your sitemap.xml.

Results

I had to wait a few days for Google to recrawl and purge the taxanomies. To verify that Google has removed the taxanomies from search result, search for site:www.leowkahman.com. They should disappear. You need to replace the domain with yours of course.