GitLab CI remove priority zero from Hugo sitemap

Hugo static site generator generates sitemap.xml that include priority zero URLs. These are the taxanomies, tags and categories. To prevent these from appearing in Google Search results, GitLab CI can be configured to remove unwanted XML nodes (priority 0) from sitemap.xml.

Motivation

Need to get rid of taxanomies from sitemap.xml so that they do not appear in Google Search results. Previously, I built and published my blog from my laptop. I had a PowerShell script for removing Hugo taxanomies. After migrating to GitLab CI, I needed a new workaround.

Pre-requisites

You need to have experience in setting up GitLab CI.

Otherwise, please read my earlier post Hugo Static Site Generator with CI Deployment using GitLab.

The solution

The high level idea is simple; let Hugo generate sitemap.xml as-is. Then remove priority 0 URLs from it.

There are several ways to do this; using i.e. XMLStarlet to remove XML nodes by XPath syntax or Regular Expression. For this, I decided to go with XMLStarlet as I feel it is easier for XML.

Previously, I was using NodeJS 6.11.x Alpine Docker image which was on an older Alpine 3.4. According to Alpine Linux Packages list, XMLStarlet is available for Alpine 3.5 or higher. As such, I had to change to Node 8.4.0 Alpine Docker image which uses Alpine 3.6.

Below is a snippet of my new GitLab CI YAML .gitlab-ci.yml:

image: node:8.4.0-alpine
before_script:
  - apk update && apk add openssl ca-certificates xmlstarlet git
  - npm install
  - PATH=$(npm bin):$PATH
  - npm run version
pages:
  script:
  - npm run build
  - xml ed --inplace -N x="http://www.sitemaps.org/schemas/sitemap/0.9" -d "//x:url[x:priority='0']" ./public/sitemap.xml
artifacts:
  paths:
  - public
  only:
  - master

The key bits to point out are:

  1. node:8.4.0-alpine - to utilise a NodeJS Docker image that uses Alpine 3.6.
  2. apk add xmlstarlet - to install XMLStarlet package for editing XML.
  3. xml ed ... //x:url[x:priority='0'] - command to remove priority 0 URL nodes.

For the results, you can inspect my blog’s sitemap.xml.

Alternative solution

If you are using a device that is not supported by Alpine Linux, you may consider this solution. As at January 2018, Alpine is still unavailable for ARMv7.

Use regular NodeJS image.

image: node:8

Below is a snippet of my package.json file making use of pretty-xml:

"scripts": {
  ...
  "build-minify-xml": "cat public/sitemap.xml | pretty-xml --minify | sed -r 's/<url><loc>[^<]+<\\/loc>(<lastmod>[^<]+<\\/lastmod>)?<priority>0<\\/priority><\\/url>//g' | cat > public/sitemap.min.xml && mv public/sitemap.min.xml public/sitemap.xml"
  ...
},
"dependencies": {
  "pretty-xml": "^1.2.1"
}