Here are some observations from writing web-crawling robots.
At some point, many of us (in the IT security industry) will need to write a robot which scrapes lots of web sites. By "lots", I mean a very large number, run by arbitrary parties. Not just a few run by well-behaved, cooperative entities.
Most owners of web servers try to make them compatible - but this is not guaranteed. Even with the best of intentions, we'll probably find things which go wrong.
Faulty DNS
* Returns too large responses
* Returns private addresses in "A" responses
Server hangs / timeout
* Connection timeout
* Timeout waiting for response
* Connection hang during headers or response
Bad responses
* Connection closed after request
* Connection closed while transmitting headers
* Connection closed while transmitting bodies
HTTP
* Garbled response
* Bad status code
* too many headers
* single very long header
HTTP Redirects
* Redirect to relative URI
* Redirect loop
* Redirect to private sites, e.g. not-qualified names, private IPs
* 301 / 302 status, no Location: header
Content
* high-bit set in HTTP headers - but not valid utf8
* No declared encoding, but not ascii or latin1
* Wrong declared encoding
* Unknown declared encoding (e.g. sjis variations)
* Inconsistent encoding in Content-type header, HTML * bad byte sequence for declared or detected encoding
* Non-html content served with html content-type (e.g. image, pdf)
SSL
* Bad certificates. If we don't care, it might be better not to attempt to verify certificates.
* Things which break our SSL library
Misc
* 200 status even for pages which do not exist
* 301 / 302 status for pages which do not exist (expecting 404)
* robots.txt served as html
* Unexpectedly large content
Framers, ad-injectors
* Frame somebody else's content
* Use javascript to display someone else's content with other (advert) elements layered or obstructing
Spam
* Some web sites exist to spam search engines
* These often contain large numbers of host names, linking to each other - "Link farms"
* Spam will cause us to waste resource and "dilute" good content (for statistical analysis, etc)
Bad ideas:
Depending on your use-case, it might be a good idea to "back off" a site which returns errors (particularly 5xx or network-layer) and try again later.
If carrying out a very large-scale activity, automating blacklisting is desirable.
Intro
At some point, many of us (in the IT security industry) will need to write a robot which scrapes lots of web sites. By "lots", I mean a very large number, run by arbitrary parties. Not just a few run by well-behaved, cooperative entities.
Most owners of web servers try to make them compatible - but this is not guaranteed. Even with the best of intentions, we'll probably find things which go wrong.
Behaviour observed
Faulty DNS
* Returns too large responses
* Returns private addresses in "A" responses
Server hangs / timeout
* Connection timeout
* Timeout waiting for response
* Connection hang during headers or response
Bad responses
* Connection closed after request
* Connection closed while transmitting headers
* Connection closed while transmitting bodies
HTTP
* Garbled response
* Bad status code
* too many headers
* single very long header
HTTP Redirects
* Redirect to relative URI
* Redirect loop
* Redirect to private sites, e.g. not-qualified names, private IPs
* 301 / 302 status, no Location: header
Content
* high-bit set in HTTP headers - but not valid utf8
* No declared encoding, but not ascii or latin1
* Wrong declared encoding
* Unknown declared encoding (e.g. sjis variations)
* Inconsistent encoding in Content-type header, HTML * bad byte sequence for declared or detected encoding
* Non-html content served with html content-type (e.g. image, pdf)
SSL
* Bad certificates. If we don't care, it might be better not to attempt to verify certificates.
* Things which break our SSL library
Misc
* 200 status even for pages which do not exist
* 301 / 302 status for pages which do not exist (expecting 404)
* robots.txt served as html
* Unexpectedly large content
Framers, ad-injectors
* Frame somebody else's content
* Use javascript to display someone else's content with other (advert) elements layered or obstructing
Spam
* Some web sites exist to spam search engines
* These often contain large numbers of host names, linking to each other - "Link farms"
* Spam will cause us to waste resource and "dilute" good content (for statistical analysis, etc)
Advice
Robustness
If the process crashes, we have a problem. A web crawler needs to be able to recover from unexpected errors.
- Set timeouts to reasonable value. Defaults are typically too high
- Check that timeouts work at any stage
- Expect large responses; limit size if possible.Don't assume that anything is valid utf-8 bytes (even if it's required to be by some spec)Take metadata with a pinch of salt e.g. Content-length does not imply anything about the size of content!Be aware of race conditions. If you look again, something might disappear, appear or change. (example: do a HEAD request, see Content-type, you do a GET request, it has changed)
Making sense
We all hope that everything makes sense. However, it's not that simple. What encoding should we interpret things as? What content-type is really present?
Some sites serve data with incorrect metadata, but missing metadata is far more common.
A large proportion of Russian web sites are encoded in Windows-1251 without any metadata. A significant proportion of Japanese sites use Shift_JIS (or its many variants) without metadata.
Sometimes we just have to try to guess. There are definitely cases where we're going to see garbage and need to be able to identify it so we can ignore it.
Performance
If we've got a lot of work to do, we want to get through it as quickly as possible. Or at least fast enough.
Ideas:
- Parallel fetching. Any serious robot is going to need to do lots of this. so consider multiprocessing, or asynchronous frameworks. For large scale it might need to be split amongst several hosts
- HTTP HEAD method. If we only want the headers, use HEAD. This can save a lot of bandwidth, all servers support it.
- HTTP/1.1 Range requests. We can ask for, say, the first 10k of a page using a "Range" request. Not all servers support it, but we can fail gracefully
- gzip content - if your client supports it and there are no interoperability problems.
Bad ideas:
- Keep-alive or pipelining. Can cause interoperability problems, usually unnecessary. These are latency optimisations for web-browsers. (Possibly desirable when getting lots of pages from the same site on SSL)
- Caching, proxies. It would be better for the application to behave intelligently and avoid requesting the same data more than once.
Depending on your use-case, it might be a good idea to "back off" a site which returns errors (particularly 5xx or network-layer) and try again later.
Decoupling
It's probably a good idea, to maximise throughput, to decouple different stages with queues of work between them. It might also make our code cleaner, easier to test and possibly more robust (we can possibly retry any stage which fails). For example, decouple
* Fetching robots.txt (if you use it)
* Fetching other entities
* Parsing and processing
* Scheduling / prioritisation
It might also be worthwhile to decouple DNS requests from actual fetching.
One of the reasons to decouple is that parsing takes lots of memory, but fetching requires a lot of waiting for the network. We don't want to wait for the network a lot while simultaneously using a lot of memory. Doing fetching and parsing in different processes means we can let the parser make a mess of our heaps (i.e. heap fragmentation, possibly leaks) and occasionally call _exit to clean it all up without impacting the fetch latency.
Last resorts - nuclear options
Ask a human
We could add "alarm" conditions, to have the crawler ask a human when it encounters something unexpected. This may be useful, for example, to try to decipher a page in the wrong encoding.
Blacklisting
If we persistently see bad sites which are spam, causing robustness problems or just plain nonsense, we can blacklist them.
- Blacklisting host names (or domain names) is ok
- Blacklisting IPv4 addresses is better (the supply is much more limited!)
If carrying out a very large-scale activity, automating blacklisting is desirable.