Skip to main content

Testing a sitemap with bash, sed and curl

During the migration from WordPress to Jekyll I wanted to confirm ensure I didn't break any existing permalinks and leave 404s hanging around. The process is trivial but something I may want to repeat.

First, download the sitemap.xml.gz output by the All In One SEO WordPress pluging. Crack this XML in a simplistic way and write out all of the known URLs to a simple text file.

{% highlight bash %} gunzip sitemap.xml.gz grep \ sitemap.xml | sed 's/.*//' | sed 's|||' > sitemap.txt

Next, create a script that reads from stdin and makes an HTTP request for each line and records the HTTP response code.

{% highlight bash %}


while read line do response=$(curl --write-out %{http_code} --silent --output /dev/null $line) echo $line '\t' $response sleep 1 done

Finally, pull the whole thing together and list and 404s.

{% highlight bash %} cat sitemap.txt | bash | grep 404