Classic Logic

XKCD Post Counter

It was a lazy Sunday afternoon and I was going through XKCD. I wondered how many posts XKCD has in total. The neat thing about XKCD is that their URLs are clean, and all their new posts get a number that’s one higher than the previous post. The site itself is very minimal and straightforward; apparently no weird frameworks doing magic.

I randomly typed https://xkcd.com/5000/, and it gave me a 404.

I didn’t want to randomly try a bunch of URLs to find out how many posts there are. So I decided to write a quick script that would do a binary search to find out how many posts there are. Binary search because a brute-force linear search of – who knows, 4000? – URLs would be inefficient both for me and whoever owns XKCD servers

The basic idea is to:

  1. Start with the interval [0, 10000]
  2. Find out the midpoint of this interval
  3. If the midpoint is a valid URL, the final XKCD post is in [midpoint, upper limit].
  4. If the midpoint is not a valid URL, the ifnal XKCD post is in [lower limit, midpoint].
  5. Repeat until you reach an interval of one element.

We divide the search space into two each time we recurse, so the algorithm should terminate in log(n) steps, where n is the initial inteval size; ie., this is an efficient O(log n) algorithm.

Didn’t take long to code this up in Python. Here’s the GitHub Gist of the complete program:

Here’s the output when I ran it with an initial interval of 0 and 10000:

$ python main.py
checking https://xkcd.com/5000/ ...got 404
checking https://xkcd.com/2500/ ...got 404
checking https://xkcd.com/1250/ ...got 200
checking https://xkcd.com/1875/ ...got 200
checking https://xkcd.com/2187/ ...got 200
checking https://xkcd.com/2343/ ...got 200
checking https://xkcd.com/2421/ ...got 200
checking https://xkcd.com/2460/ ...got 404
checking https://xkcd.com/2440/ ...got 200
checking https://xkcd.com/2450/ ...got 404
checking https://xkcd.com/2445/ ...got 404
checking https://xkcd.com/2442/ ...got 200
checking https://xkcd.com/2443/ ...got 404
Count: 2442

It returned the count as 2442; which seems to be correct at the time of writing this post (28 March 2021).


Update (21 May 2022)

Kevin Cox from Canada pointed out this could have been done with the following simple one-liner:

curl https://xkcd.com | sed -nE 's_.* property="og:url" content="https://xkcd.com/([0-9]+)/.*_\1_p'

It works.

It turns out the XKCD homepage (which always lists the latest post) has a meta-tag og:url containing the full URL to that post. Cox’s approach apparently grabs the post count from there. In addition (as Cox pointed out), there’s a link to this (latest) comic on the homepage which I could have scraped. Either of those approaches would have done the job by making just a single request to XKCD, whereas my approach was logarithmic in the number of XKCD posts (O(1) vs O(log n)).

What I took away from that was I shouldn’t forget to take a step back and really understand what I’m trying to solve, and see if there’s a better way than naive methods.

Admittedly, I have a lot to learn; it was a fun exercise nevertheless :)