home | codereading | contact | github | math | misc | notes | patches | tech | README


Copyting Sites

To download a page, do:

$ wget -O output-file.html url

To download a page with all its dependencies (images etc.), do:

$ wget -p output-file.html url

To copy a whole site recursivelly, do:

$ wget -p -r <website>

To copy a whole site, recursivelly, with HTTPS (SSL), ignoring certificate, do:

$ wget -p -r --no-check-certificate <website>

To copy a file trought HTTP authentication use:

$ wget --http-user=<user> --http-password=pass <url

Combine both HTTPS and HTTP options as necessary.

Some sites require a referer, which means that you can't go to that page without clicking some link. So you the --referer=www.foo.com option.

To ignore robots.txt (they sometimes block access to a important directory, such as /img), use the -e robots=off option.