Tuesday 15 September 2015

Headless web browsers: PhantomJS and SlimerJS

What is headless web browsing?


It's using a web-browser like application to do automated fetching and analysis of web pages, without a human user present.This is different from simply fetching HTML content via HTTP; headless web browsers typically also load images, process Javascript code, CSS and layout the page content (albeit in an invisible way). 

The developer can then use scripting (usually Javascript) to examine the page as it is laid out in memory, as if in a "real" web browser, to look at the style of text, etc.

We could even use OCR to look for text within images shown in the page.

Why?

  • More effectively analyse the content of pages. Lots of pages nowadays contain a huge amount of "boiler plate" uninteresting text, often in HTML elements without semantic meaning (e.g. DIV). Only by using CSS (and sometimes Javascript) are we able to have a computer see the page as a human would
  • Generation of screenshots
  • Getting metadata which are dynamically written by scripts, etc, such as Javascript-created links.
  • Automated testing of web applications

What tools are available?

Several. Traditionally some users have hacked their own solutions using either a web browser extension, or embedding a web browser in a C++ program (often Webkit).

Here I'm looking at PhantomJS and SlimerJS.

PhantomJS and SlimerJS essentially perform the same task - to run developer-specified Javascript code in the context of an automated web browser, without using a real web browser.

PhantomJS is based on Webkit; SlimerJS is based on Mozilla / Firefox.

PhantomJS

Two versions of PhantomJS are available - the 1.9 series and the 2.0 series. The main difference is that the 2.0 series uses a more recent version of Webkit.

Unfortunately, last time I tested them, neither is very good for browsing lots of real web pages "in the wild". NB: This may be fixed when you read this, test it yourself!

  • Lots of memory usage
  • Slow
  • Prone to crashing; diagnosing crashes is very difficult
  • v1.9 has an out-of-date Webkit which has less feature support
  • v2.0 seems to leak memory very badly.

So probably PhantomJS is ok for some automated testing scenarios, particularly if you have a "single page application", or only a small number of pages tested.

But accessing large numbers of "real" web pages quickly breaks it, and it's not easy to fix.

Essentially the problem is that Webkit is now an abandoned fork (Apple and Google have both forked off from it) and bugs don't get fixed upstream. PhantomJS does not usually apply bugfixes to Webkit itself.

PhantomJS is a C++ executable that includes most of Webkit inside its binary. This is OK, as it's almost completely standalone, but it means that compiling it is VERY time consuming, particularly on limited resources. For example, on a Raspberry Pi I was able to run PhantomJS, but building it will take days (a more powerful system is really required). On a modern x86 system compiling is much quicker, but can still take 1 hour; the link step uses several Gb of memory (not really a problem on a server, but careful if building in a memory-limited VM).

Linux binaries are also available from the web site, which is handy :)

SlimerJS

SlimerJS is a completely different beast from PhantomJS. It is not a C++ binary and doesn't attempt to embed the engine directly in its own application. Instead, it uses an obscure feature of Firefox to run an alternative "user interface application" which provides an environment which is almost identical to PhantomJS.

This has benefits and drawbacks

  • It is not completely headless. It doesn't require user input, but it won't work without an X server on Linux (this is easily fixed using Xvfb). Under Windows, visible windows may be shown unless running an an alternate desktop, or as a service.
  • The web browser used is really identical to the Firefox version you're using - all the same features are available.
  • If you update Firefox, SlimerJS updates too (pro: good for security; con: it might break)

SlimerJS is under moderately active development, but has a much smaller user community than PhantomJS.

  • Performance of SlimerJS (using Firefox 40) seems MUCH BETTER than PhantomJS in general
  • Stability seems much better too (although I have had a few crashes)
  • The same APIs are supported, but doucmentation is mostly worse (example: the filesystem objects are barely documented)

Wrap up

So there you are - headless web browsing IS a niche application, but it is very useful in its place. I like SlimerJS because its overall design approach seems to work better in the general case.

It would be interesting to have a SlimerJS / PhantomJS type application which uses GoogleChrome as its web browser. I imagine one may appear, if it does not already exist.
 

3 comments:

Juan Renoldi said...

I couldn't agree with you more, I've been working on something with PhantomJS for over 2 weeks and got tired of applying workarounds, hacks and endless testing to get what I wanted, the online resources and docs were so many that I thought "this is it"... 2 hrs ago I downloaded SlimerJS and I'm already getting better results and stability on the tests I'm running.
+performance +flash rendering and with Xvfb headless usage is not an issue.

Mark Robson said...

Since I wrote this post, both PhantomJS and SlimerJS have improved, but I think the core issues are still the same.

Hopefully there is a new release of SlimerJS soon (which contains a little bit of my code)

Anonymous said...

Hi Mark,

Your observations are interesting, as I have been testing various platforms and open-source screenshot utilities over the past couple of weeks. I am considering to add a failover or fallback to my existing, proprietary capture code powering my automated screenshot service.

In my testing, PhantomJS performs very quickly and captures almost twice as fast as my own code (very impressive!). However, I've just run a few captures back-to-back and haven't really stress tested it yet. So, even though I'm testing with v2.1.1, I'm wondering if I'll see the memory leak and crash issues you encountered (and I've read about from others as well). My service currently captures about 100 million screenshots a month and needs something extremely stable. And the fact that PhantomJS removed support for Flash, back in v1.5, means that I cannot consider it anyway. Workarounds for Flash support are too troublesome to bother with.

So, I've considered to add SlimerJS to my short list, but as you mentioned; it's not completely headless. The most suprising part of your post is that you found SlimerJS to perform "much better" than PhantomJS. Now, I really want to try it out!

Thanks for sharing your results!

Cheers,

Brandon