Hugo static site generator generates sitemap.xml that include priority zero URLs. These are the taxanomies, tags and categories. To prevent these from appearing in Google Search results, GitLab CI can be configured to remove unwanted XML nodes (priority 0) from sitemap.xml.
Need to get rid of taxanomies from sitemap.xml so that they do not appear in Google Search results. Previously, I built and published my blog from my laptop. I had a PowerShell script for removing Hugo taxanomies. After migrating to GitLab CI, I needed a new workaround.
You need to have experience in setting up GitLab CI.
Otherwise, please read my earlier post Hugo Static Site Generator with CI Deployment using GitLab.
The high level idea is simple; let Hugo generate sitemap.xml as-is. Then remove priority 0 URLs from it.
There are several ways to do this; using i.e. XMLStarlet to remove XML nodes by XPath syntax or Regular Expression. For this, I decided to go with XMLStarlet as I feel it is easier for XML.
Previously, I was using NodeJS 6.11.x Alpine Docker image which was on an older Alpine 3.4. According to Alpine Linux Packages list, XMLStarlet is available for Alpine 3.5 or higher. As such, I had to change to Node 8.4.0 Alpine Docker image which uses Alpine 3.6.
Below is a snippet of my new GitLab CI YAML
image: node:8.4.0-alpine before_script: - apk update && apk add openssl ca-certificates xmlstarlet git - npm install - PATH=$(npm bin):$PATH - npm run version pages: script: - npm run build - xml ed --inplace -N x="http://www.sitemaps.org/schemas/sitemap/0.9" -d "//x:url[x:priority='0']" ./public/sitemap.xml artifacts: paths: - public only: - master
The key bits to point out are:
node:8.4.0-alpine- to utilise a NodeJS Docker image that uses Alpine 3.6.
apk add xmlstarlet- to install XMLStarlet package for editing XML.
xml ed ... //x:url[x:priority='0']- command to remove priority 0 URL nodes.
For the results, you can inspect my blog’s sitemap.xml.