Skip to content

Testing a sitemap with bash, sed and curl

During the migration from WordPress to Jekyll I wanted to confirm ensure I didn't break any existing permalinks and leave 404s hanging around. The process is trivial but something I may want to repeat.

First, download the sitemap.xml.gz output by the All In One SEO WordPress pluging. Crack this XML in a simplistic way and write out all of the known URLs to a simple text file.

gunzip sitemap.xml.gz
grep \<loc\> sitemap.xml | sed 's/.*<loc>//' | sed 's|</loc>||' > sitemap.txt

Next, create a script testsitemap.sh that reads from stdin and makes an HTTP request for each line and records the HTTP response code.

#!/bin/bash
while read line
do
    response=$(curl --write-out %{http_code} --silent --output /dev/null $line)
    echo $line '\t' $response
    sleep 1
done

Finally, pull the whole thing together and list and 404s.

cat sitemap.txt | bash testsitemap.sh | grep 404