Tuesday 15 September 2015

Headless web browsers: PhantomJS and SlimerJS

What is headless web browsing?


It's using a web-browser like application to do automated fetching and analysis of web pages, without a human user present.This is different from simply fetching HTML content via HTTP; headless web browsers typically also load images, process Javascript code, CSS and layout the page content (albeit in an invisible way). 

The developer can then use scripting (usually Javascript) to examine the page as it is laid out in memory, as if in a "real" web browser, to look at the style of text, etc.

We could even use OCR to look for text within images shown in the page.

Why?

  • More effectively analyse the content of pages. Lots of pages nowadays contain a huge amount of "boiler plate" uninteresting text, often in HTML elements without semantic meaning (e.g. DIV). Only by using CSS (and sometimes Javascript) are we able to have a computer see the page as a human would
  • Generation of screenshots
  • Getting metadata which are dynamically written by scripts, etc, such as Javascript-created links.
  • Automated testing of web applications

What tools are available?

Several. Traditionally some users have hacked their own solutions using either a web browser extension, or embedding a web browser in a C++ program (often Webkit).

Here I'm looking at PhantomJS and SlimerJS.

PhantomJS and SlimerJS essentially perform the same task - to run developer-specified Javascript code in the context of an automated web browser, without using a real web browser.

PhantomJS is based on Webkit; SlimerJS is based on Mozilla / Firefox.

PhantomJS

Two versions of PhantomJS are available - the 1.9 series and the 2.0 series. The main difference is that the 2.0 series uses a more recent version of Webkit.

Unfortunately, last time I tested them, neither is very good for browsing lots of real web pages "in the wild". NB: This may be fixed when you read this, test it yourself!

  • Lots of memory usage
  • Slow
  • Prone to crashing; diagnosing crashes is very difficult
  • v1.9 has an out-of-date Webkit which has less feature support
  • v2.0 seems to leak memory very badly.

So probably PhantomJS is ok for some automated testing scenarios, particularly if you have a "single page application", or only a small number of pages tested.

But accessing large numbers of "real" web pages quickly breaks it, and it's not easy to fix.

Essentially the problem is that Webkit is now an abandoned fork (Apple and Google have both forked off from it) and bugs don't get fixed upstream. PhantomJS does not usually apply bugfixes to Webkit itself.

PhantomJS is a C++ executable that includes most of Webkit inside its binary. This is OK, as it's almost completely standalone, but it means that compiling it is VERY time consuming, particularly on limited resources. For example, on a Raspberry Pi I was able to run PhantomJS, but building it will take days (a more powerful system is really required). On a modern x86 system compiling is much quicker, but can still take 1 hour; the link step uses several Gb of memory (not really a problem on a server, but careful if building in a memory-limited VM).

Linux binaries are also available from the web site, which is handy :)

SlimerJS

SlimerJS is a completely different beast from PhantomJS. It is not a C++ binary and doesn't attempt to embed the engine directly in its own application. Instead, it uses an obscure feature of Firefox to run an alternative "user interface application" which provides an environment which is almost identical to PhantomJS.

This has benefits and drawbacks

  • It is not completely headless. It doesn't require user input, but it won't work without an X server on Linux (this is easily fixed using Xvfb). Under Windows, visible windows may be shown unless running an an alternate desktop, or as a service.
  • The web browser used is really identical to the Firefox version you're using - all the same features are available.
  • If you update Firefox, SlimerJS updates too (pro: good for security; con: it might break)

SlimerJS is under moderately active development, but has a much smaller user community than PhantomJS.

  • Performance of SlimerJS (using Firefox 40) seems MUCH BETTER than PhantomJS in general
  • Stability seems much better too (although I have had a few crashes)
  • The same APIs are supported, but doucmentation is mostly worse (example: the filesystem objects are barely documented)

Wrap up

So there you are - headless web browsing IS a niche application, but it is very useful in its place. I like SlimerJS because its overall design approach seems to work better in the general case.

It would be interesting to have a SlimerJS / PhantomJS type application which uses GoogleChrome as its web browser. I imagine one may appear, if it does not already exist.