The Hacker Way to Download an Entire Tree of Webpages

Sunday, 17 April 2016

("Hacker" as defined in The Jargon File).

Have you ever wanted to download an entire tree of webpages onto your computer so that you can check it out later, offline?

Why might you want to do this? Let’s say you’re traveling on an airplane without an internet connection, and you as a developer want to work on that Android app you were developing back when you were in the ground.

If you’re a rather proficient programmer, you might as well go work with it.

However, if you are like me, you might want to refer the official documentation for Android occasionally while you’re working on that app. And on an airplane without an internet connection, there’s nothing you can really do about it.

Except if you have saved the webpages you might want to refer beforehand while you were on the ground.

And saving webpages using the traditional Ctrl + S is a naive approach. There’s only so many webpages you can practically save in this manner. And then again, the links within the webpage referring to other parts of the webpage (or even other webpages) won’t work.

(And if you’re the sort of person who makes use of the mouse to save webpages, at least start using Ctrl + S).

Fortunately though, GNU (and its derivatives/cousins) provides a clean professional utility to save a webpage: wget.

wget is a command-line utility to pull things off the web. Through the clever use of its command-line options, you do get real flexibility. Say you want to just pull in only the PDF files in a webpage. You can do that with a single command using wget.

wget comes installed by default on most Linux distros, although you might have to install it manually if you’re on a Mac. I leave it to you to figure out how to install wget on your particular OS.

Let’s start things off by downloading a simple webpage onto your system. I suggest you create and move to a different directory to keep all the downloaded pages at a single location.

The syntax for wget is as follows: wget [OPTIONS] URL

Let’s take this for a test-drive; download the Wikipedia entry for Linux:

wget https://en.wikipedia.org/wiki/Linux

That would download the particular page onto the current directory.

As I mentioned before, wget offers a lot more than just doing a simple save of a webpage. You can checkout the full documentation for wget by opening your terminal and typing in man wget.

Saving a Tree of Webpages Using WGET

To give you an example of the types of things that wget could do, let’s see how we can save an entire tree of webpages while preserving all the links within each of those pages.

For this example, I’m showing you how to download the documentation pages for the Legion* programming model, so that you can refer to its documentation even when you’re not connected to the internet (like an airplane without internet, for instance).

So, go ahead and type onto your terminal:

wget --no-clobber --convert-links -r -p -E http://legion.stanford.edu

I’m assuming that you have a reasonably fast internet connection. Depending on the size of the webpage that you’re downloading, the process could take some time.

So let me explain the various options in the above command:

Option	Description
`--no-clobber`	Prevents repeated download of the same file
`--convert-links`	Converts the links within the downloaded webpage to point to the local files rather than the file of the actual remote web-server
`-r`	Recursive. Retrieve files recursively
`-p`	Causes `wget` to download all files necessary to properly display a given webpage, such as images and sounds.

(Checkout the complete documentation of wget by typing man wget on your terminal).

Here’s a second example. One of my professor saves all the lab files on his webpage students to download. More often than not, it’s a bunch of directories containing other directories, files etc. To download them all in one go, I simply run:

wget --recursive --no-parent --cut-dirs=2 -nH -R "index.html*" http://www.cs.unm.edu/~crandall/secprivspring18/lab2stuff/

The URL at the end is the directory where all the relevant files are located at. Check out the manpage to see what --cut-dirs and -nH arguments do.

Some websites have mechanisms that prevent things like the wget from automatically downloading all the content; wget comes with mechanisms to bypass that. Take for instance the following version of the above command:

wget --no-clobber --convert-links --random-wait -r -p -E -e robots=off -U mozilla http://legion.stanford.edu

I encourage you to go check out the documentation for wget to figure out exactly what those options mean.

There’s another upside to saving things this way; because the whole webpage tree has been stored locally, the pages would load quicker; useful if you’re on a part of the world that has limited bandwidth.

Now, there are a couple of things you should keep in mind while saving webpages offline:

Webpages tend to evolve over time; you don’t want to keep referring to pages saved 3 years ago.
Some webpages have some copyright restrictions. You might want to keep that in mind.
Some webpages (like the Android documentation, for instance) are quite large, and it might not really be practical to save all of the Android documentation onto your system.

*(Legion (the example I used in the demo above) is a really cool parallel programming framework developed by Stanford University and NVIDIA. You might want to check it out if you’re interested in parallel computing).

Classic Logic

The Hacker Way to Download an Entire Tree of Webpages

Saving a Tree of Webpages Using WGET