Jameser's Tech Tips

Tuesday, July 04, 2006

Tip #7: Duplicating a Website Using GNU's wget

Occasionally you come across a utility that was originally developed for Linux/Unix, and you are extremely thankful that someone took the time to extend the application to the Windows platform... GNU's wget is an example of just such an application... (Thank you Heiko Herold!)

Linux users are probably already aware of the power of wget, but to briefly summarize, wget is a command line tool for retrieving data from the intenet... Even though the concept of the application is simple, the capabilities and features are seemingly endless... Basically, any URL (a web page, an FTP file, etc.) can be easily captured and saved as a local copy... It is indispensible for use in non-interactive internet access in scripts and batch files...

The Win32 port of wget can be downloaded here... The application is distributed as a zip file, and can be unzipped into C:\wget for this example... For linux versions, you should be able to apt-get, emerge, yum, etc. the most recent version...

For today's tip, we'll be using wget to download an an entire website for offline viewing, for distribution or demonstration to clients, or to save as a browsable archive... I think that it's important to mention that this process should not be used on extremely large sites... Okay, with that out of the way lets proceed...

Start by opening a command prompt (Start/Run/cmd) and changing directories to the location where you installed wget...
cd \wget (or wherever you installed to)

There are a few parameters we'll need to pass to allow us to get all of the data we're trying to mirror, as well modify the links within the documents to point to the local copy instead of the original internet location... I'll explain each after the command below:
wget -r -H -k -Dwhateversite.com,www.whateversite.com www.whateversite.com

The -r switch tells wget that we'd like to recursively go through the site gathering files... This is the main switch that will provide the mirroring effect we are looking for...

The -H switch tells wget to "span hosts", meaning to grab files which are not located on exactly the same domain we specify... This is necessary to grab files from domain.com as well as www.domain.com... It is very important to use this switch in combination with the -D switch, which specifically defines the domains we will be gathering files from... Using the -H without the -D could result in downloading content until your drive fills to capacity... For our -D we'll be using the www and non-www domains...

The -k switch causes the downloaded files to have their links replaced with relative links which point to the local copy you've made... Without this switch the files would be exact replicas and try to link to their original internet locations...

The final parameter to be passed is the domain you are seeking to copy... It should be one of the domains listed in your -D switch...

Upon completion, you should have a new directory called www.whateversite.com, containing all files from the site, with corrected links to point to the relative location in your archive copy...

If you have any questions, please leave a comment...

2 Comments:

  • At 4/20/2007 10:26 PM, Anonymous Anonymous said…

    You know, I always used to hate Windows. I am a big Linux fan.

    Lately however I've been seeing lots of things on the net that could make Windows usable for me (eg; Cygwin). With the help of your site, I may finally be able to put my copy of Windows XP to good use.

     
  • At 2/27/2008 7:39 AM, Anonymous Anonymous said…

    I want to use wget as an static library with headers...and i want to use it from C++ program to download things...means i will control Wget through C++ application...can i do this...If yes then how??

     

Post a Comment

<< Home