Mark's stream of verbiage: robustness

Showing posts with label robustness. Show all posts

Tuesday, 26 April 2016

How to use "cron" to run periodic scheduled jobs

"cron" is the Unix (Linux, etc) scheduler which runs regularly scheduled jobs. This post is not meant to be the "man" page (it has one of those already), but ideas how to use "cron" in a robust way.

Setting up cron jobs

There are at least *three* ways of configuring cron jobs on a modern Linux system; technically these are extensions, but they're so quasi-standard, they're even (possibly) available on FreeBSD :)

Per-user "crontab" file. This can be edited using crontab -e, or replaced by crontab . If you are installing system-level software, you probably don't want to use this. Each user can have only one crontab file.
System-wide "crontab" file, usually /etc/crontab. This is usually managed by the distribution / package manager, and you probably don't want to change this; there is only one.
Per-package "crontab" files - usually kept in /etc/cron.d. There are multiple files, usually one per package (or per app). This is the best solution if you're distributing software as a package, on multiple machines, and want the installation to be robust and repeatable.

The system-wide crontab files, give the option of running cron jobs as any OS user (who must exist, obviously!)

More tips on configuration:

A few environment variables can be set, typically things like PATH, SHELL and importantly MAILTO
If you don't set MAILTO=, then the stdout and stderr will be sent by email to the user who owns the cron job. This is seldom what they want nowadays.

When to schedule jobs

Don't schedule daily jobs between midnight and 2am. Cron jobs are scheduled in local time. The sysadmin might not have configured the machine for UTC, therefore there is a possibility that some cron jobs are repeated, missed etc, during changes of time zone or daylight saving time.

In general, I don't like to schedule daily jobs at all, unless they aren't very critical.

For important stuff, it's probably better to have an hourly job just check the (UTC) hour, so it can avoid time-zone dependency.

A feature of "cron" which is little-known, but available on most systems, is the "@reboot" jobs, which are run shortly after the system boots. Such jobs are useful to perform cleanup work that otherwise might not get done at all (e.g. a 6am job, when the user seldom has the machine powered on at 6am).

Anacron

Some (i.e. most Linux) systems have a small script called "anacron", which runs jobs hourly, daily, weekly etc, with no particularly fixed schedule (because they run in sequence with other jobs which may take some time).

This could be used as an alternative to "cron", however it's got more limitations and, in particular, runs everything as root.

Multiple instances

"cron" mostly does not care about multiple instances, and will execute more than one copy of your job. This is almost never desirable, so if there is any probability whatsoever of this happening, you should prevent multiple instances.

The "flock" shell program might be handy (recipes are in the man page), or creating a file and exclusively locking it (e.g. in Python, C or your favourite language).

If you have a slow job (say, a backup, or something which relies on the network), and a second instance starts, there is a good chance that the 2nd instance will slow down the first instance, then a third instance starts, until the whole system comes down with too many instances of a cron job.

Stampeding herds

If your software will be installed on many machines sharing infrastructure (e.g. a network, a server, a VM host), then it may be useful to try to avoid a stampeding herd.

"cron" has the unfortunately property that it usually runs jobs at the exact same moment (usually to the same second), if identically configured on several machines. If your cron job depends on something, it can cause problems or failures.

The obvious solution is "random sleep"; sleeping a random amount of time (usually only seconds/ minutes) before doing any work which may require infrastructure access. In some cases though, a random sleep can increase resource usage, imagine this sequence:

Start up program
Load loads of huge libraries
connect to database server
random sleep (0... 300 seconds)
perform work which takes, maybe 5 seconds
exit

The random sleep is doing more harm than good. Be sure to place "random sleep" before taking too many resources!

Another option that I've seen occasionally, is to dynamically generate the "crontab" at install-time with a random value for the minute field.

Error handling

One of the nice things about running in a "cron" job is that "log & exit" is often sufficient error handling. This is particularly true if it's an hourly job, where the next hour would be an ideal time to retry whatever failed.

Some types of errors (e.g. network problems) might just go away if we try again 1 hour later. Other types (e.g. out of disc space) might need someone to fix them.

The stdout / stderr from a cron job is often lost, so it usually important to log to a file or system log.

Alternatives

Maybe you don't need "cron" at all?

Historically, many popular web applications have just done periodic clean-up work in response to a user request. Sometimes random requests are chosen, or on particular user activity.

Some third party services exist, to simply "hit" a PHP script (such apps are usually written in PHP, which I'm not necessarily advocating) on a regular basis.

Writing a permanently-running "daemon" process has some advantages over "cron", although it does incur more initial work, more "boiler plate" code. If work is required very frequently (say, several times per hour) it might be more convenient or have better performance to run it in a daemon.

Friday, 15 April 2016

How to correctly make "latest" symlinks

"Latest symlink"

A "latest" symlink, is a symbolic link (on Linux, Unix etc) which links to the "latest" version of a file.

Suppose we have a file which takes some effort to create, which is generated periodically or in response to some stimulus (e.g. user activity). Then we want to create a "latest version" symlink.

Ideally the properties should be

latest symlink always points at the latest version (duuh!)
latest symlink always exists
latest symlink never points at a partially completed, broken, missing or otherwise bad file

Sometimes people do this in a way which won't work.

How to create a symlink

Dead easy, right? Just call the "symlink" function.


 int symlink(const char *oldpath, const char *newpath);



 DESCRIPTION

       symlink()  creates  a  symbolic  link  named newpath which contains the

       string oldpath.

But if we call "symlink" and the newpath param points to an already existing file (including a symlink) then it will return EEXIST.

The wrong (obvious) way

try:
unlink(dest)
except OSError:
pass
symlink(src, dest)

The correct (not so obvious) way


 templink = dest + '.temp'

symlink(src, templink)
rename(templink, dest)

Why?

Because we want to avoid a race condition where the destination symbolic link does not exist. Renaming files is atomic and will instantly replace the existing link with a new one; no other program can possibly see a non-existing file.

Other wrong ways

Some possibly common, but wrong (or even wronger) ways to do this

Just create the "latest" file using our "do lots of work" process directly. This is really bad, as during file creation, another process can see a partially completed file. If your code looks like this: write file header; do lots of work to create file body; write file footer, then there is a really good chance that another process sees an incomplete file.
Create a different file, then copy the file (file copying isn't atomic)

Friday, 3 January 2014

The most common cause of unavailability

Hi, Happy new year.

I've done a lot of work on high-availability systems. There is a lot of writing on high-availability systems - how to implement failover, hot-spare systems, load-balancers etc.

However, most of these seem to make an assumption: humans are infallible.

In practice, this is not always the case.

In fact, I'd say that probably about 75% of downtime is caused by human errors, cock-ups, mistakes. I'm not an expert, but I suspect that it's about the same proportion as air crashes caused by pilot (or someone else's) error.

So, it's the human, stupid. PBKAC (problem between keyboard and chair).

Here are some possible fixes:

Give human less work to do

We can avoid SOME human errors by having systems automatically configure themselves, setup, or perform sanity checks before accepting settings.

"Blindly accepting" instructions given by an idiot human, is what computers are generally very good at, and often results in chaos.

Automatic configuration is less likely to make a mistake (when was the last time you saw a DHCP server give out a duplicate IP address? Never?)
Testing a configuration provided by a human, BEFORE applying it, might prevent a mistake.

HOWEVER

Automatic configuration is a "clever" piece of software which can be difficult to test. Sometimes autoconfiguration tests are mind-blowingly difficult to set up. This means it's not likely to be very well tested.
If autoconfig gets it wrong, it's likely to do it on a large scale.
Automatic configuration probably means that the humans have no idea how it works, which may be bad.
Autoconfig might not correctly handle "very unusual" circumstances.

Give human more work to do

Perversely, we could give the humans more work. Something that someone does every day, they're unlikely to get wrong. The procedures which are done infrequently are more likely to be a problem.

Less code to test. Manual configuration.
Humans (ops engineer, support engineer) become used to getting manual configuration correct
Humans might be able to set up unusual configurations required for unusual circumstances. (Corollary: such unusual configurations probably won't have been tested!)

My view

This is a completely opinionated rant...

Humans cannot be trusted to do anything
Make everything either automatic, or hard-coded
Give users as few options as possible. Perhaps let them make some cosmetic changes (colours, window layout?). Don't let them change the TCP port number your application uses.
There is usually far less value in allowing a parameter to be changed than forcing it to a reasonable hard-coded value.
Sales engineers want to see screens full of knobs and dials? Give them placebo ones which do little or nothing (i.e. don't break it).
If you put a big red button in the reactor control room, marked "Do not press.", someone will press it sooner or later.

On testing...

If your program fails to correctly find the server when running on Chinese-language Windows Vista connected to two VPNs at once, it's probably not really important.
If your software can't find the server for anyone who typoes the address, it's more important.

Of course users sometimes want to configure "policy". This is understandable, but try not to enable them to create stupid policies (or at least, make sure it's strongly discouraged!). A policy of "delete all emails instantly and irretrievably" sounds more like a bug to me.

Tuesday, 27 October 2009

When commit appears to fail

So you're using explicit transactions. Everything appears to work (every query gives the expected result) until you get to COMMIT.

Then you get an exception thrown from COMMIT. What happened?

Usually this would be because the server has been shut down, or you've lost the connection.

The problem is, that you can't assume that the commit failed, but you also can't assume it succeeded.

A robust application must make NO ASSUMPTION about whether a failed commit did, indeed, commit the transaction or not. It can safely assume that either all or none of it was committed, but can't easily tell which.

So the only way to really know is to have your application somehow remember that the transaction MIGHT have failed, and check later.

Possible solutions:

Ignore it and deal with any inconsistencies manually, or decide that you don't care :)
Write your entire transaction such that if it is repeated having been committed previously, it is a no-op or has no harmful side effects (e.g. change INSERT to INSERT IGNORE to avoid unique index violations). This is rather difficult for complex transactions, but works for some.
Check, in your code, even if the commit apparently succeeded that it really did. If you discover that it didn't, then retry or report failure to the caller / user.
Perform another transaction to "undo" or "cancel" whatever the original transaction did if commit failed (Problem 1: this may itself fail to commit; problem 2: for a short time, an inconsistency exists)

None of these is ideal, and I'd like to think that this never happens. But if you do enough transactions, it's going to happen sooner or later (networks and servers do fail, unfortunately)

In testing, we've found that this happens so infrequently that even deliberate malice (kill -9 of mysql while loads of transactions are processing etc) fails to reproduce it reliably. In the end, a set of iptables rules which blocks the response from a commit was constructed.

Sunday, 21 December 2008

MySQL running out of disc space

Running out of disc space is not a good situation. However, if it does happen, it would be nice to have some control over what happens.

We use MyISAM. When you run out of disc space, MyISAM just sits there and waits. And waits, and waits, apparently forever, for some space to become available.

This is not good, because an auditing/logging application (which ours is) may have lots of available servers which it could send its data to - getting an error from one would simply mean that the data could be audited elsewhere.

But if the server just hangs, and waits, the application isn't (currently) smart enough to give up and try another server, so it hangs the audit process too. Which means that audit data starts to back up, and customers wonder why they can't see recent data in their reports etc.

There has to be a better way. I propose

A background thread monitors the disc space level every few seconds
When it falls below a critical level (still more than can reasonably be filled up in a few seconds), force the server to become read-only
When in this mode, modifications to the data fail, quickly, with an error code which tells the process (or develope) exactly what the problem is (Out of disc space)
When the disc space falls below some threshold, the read-only mode is turned back off.

That way, clients get what they expect - either quick service for inserts, or a fast error telling them what's wrong (Go away, I'm full, audit your data somewhere else)

Drizzle etc, should do this.

Or perhaps, it's a job for the storage engine?

Happy Christmas.

Mark's stream of verbiage