Tuesday 26 April 2016

How to use "cron" to run periodic scheduled jobs

"cron" is the Unix (Linux, etc) scheduler which runs regularly scheduled jobs. This post is not meant to be the "man" page (it has one of those already), but ideas how to use "cron" in a robust way.

Setting up cron jobs

There are at least *three* ways of configuring cron jobs on a modern Linux system; technically these are extensions, but they're so quasi-standard, they're even (possibly) available on FreeBSD :)

  • Per-user "crontab" file. This can be edited using crontab -e, or replaced by crontab . If you are installing system-level software, you probably don't want to use this. Each user can have only one crontab file.
  • System-wide "crontab" file, usually /etc/crontab. This is usually managed by the distribution / package manager, and you probably don't want to change this; there is only one.
  • Per-package "crontab" files - usually kept in /etc/cron.d. There are multiple files, usually one per package (or per app). This is the best solution if you're distributing software as a package, on multiple machines, and want the installation to be robust and repeatable.

The system-wide crontab files, give the option of running cron jobs as any OS user (who must exist, obviously!)

More tips on configuration:

  • A few environment variables can be set, typically  things like PATH, SHELL and importantly MAILTO
  • If you don't set MAILTO=, then the stdout and stderr will be sent by email to the user who owns the cron job.  This is seldom what they want nowadays.

When to schedule jobs

Don't schedule daily jobs between midnight and 2am. Cron jobs are scheduled in local time. The sysadmin might not have configured the machine for UTC, therefore there is a possibility that some cron jobs are repeated, missed etc, during changes of time zone or daylight saving time.

In general, I don't like to schedule daily jobs at all, unless they aren't very critical.

For important stuff, it's probably better to have an hourly job just check the (UTC) hour, so it can avoid time-zone dependency.

A feature of "cron" which is little-known, but available on most systems, is the "@reboot" jobs, which are run shortly after the system boots. Such jobs are useful to perform cleanup work that otherwise might not get done at all (e.g. a 6am job, when the user seldom has the machine powered on at 6am).

Anacron

Some (i.e. most Linux) systems have a small script called "anacron", which runs jobs hourly, daily, weekly etc, with no particularly fixed schedule (because they run in sequence with other jobs which may take some time). 

This could be used as an alternative to "cron", however it's got more limitations and, in particular, runs everything as root.

Multiple instances

"cron" mostly does not care about multiple instances, and will execute more than one copy of your job. This is almost never desirable, so if there is any probability whatsoever of this happening, you should prevent multiple instances.

The "flock" shell program might be handy (recipes are in the man page), or creating a file and exclusively locking it (e.g. in Python, C or your favourite language).

If you have a slow job (say, a backup, or something which relies on the network), and a second instance starts, there is a good chance that the 2nd instance will slow down the first instance, then a third instance starts, until the whole system comes down with too many instances of a cron job.

Stampeding herds

If your software will be installed on many machines sharing infrastructure (e.g. a network, a server, a VM host), then it may be useful to try to avoid a stampeding herd.

"cron" has the unfortunately property that it usually runs jobs at the exact same moment (usually to the same second), if identically configured on several machines. If your cron job depends on something, it can cause problems or failures.

The obvious solution is "random sleep"; sleeping a random amount of time (usually only seconds/ minutes) before doing any work which may require infrastructure access. In some cases though, a random sleep can increase resource usage, imagine this sequence:
  • Start up program
  • Load loads of huge libraries
  • connect to database server
  • random sleep (0... 300 seconds)
  • perform work which takes, maybe 5 seconds
  • exit
 The random sleep is doing more harm than good. Be sure to place "random sleep" before taking too many resources!

Another option that I've seen occasionally, is to dynamically generate the "crontab" at install-time with a random value for the minute field.

Error handling

One of the nice things about running in a "cron" job is that "log & exit" is often sufficient error handling. This is particularly true if it's an hourly job, where the next hour would be an ideal time to retry whatever failed.

Some types of errors (e.g. network problems) might just go away if we try again 1 hour later. Other types (e.g. out of disc space) might need someone to fix them.

The stdout / stderr from a cron job is often lost, so it usually important to log to a file or system log. 

Alternatives

Maybe you don't need "cron" at all?

Historically, many popular web applications have just done periodic clean-up work in response to a user request. Sometimes random requests are chosen, or on particular user activity.

Some third party services exist, to simply "hit" a PHP script (such apps are usually written in PHP, which I'm not necessarily advocating) on a regular basis.


Writing a permanently-running "daemon" process has some advantages over "cron", although it does incur more initial work, more "boiler plate" code. If work is required very frequently (say, several times per hour) it might be more convenient or have better performance to run it in a daemon.


No comments: