- Software only, using shared-nothing (must)
- Stores arbitrarily large (actually 2G would be enough) items of binary data accessed by a key (a short-ish string would do) specified at store-time. Items would be stored across multiple storage nodes.
- No single point of failure (preferable, a single point of failure which does not immediately impact the service would be acceptable)
- Keeps N copies of each item in different nodes, specifyable either in config or at store-time
- Automatic repairer to re-duplicate items following a storage node's demise (or administrative removal)
- Automatic expiry of old data after a time specified at store-time
- Managability: all nodes would share the same config; nodes can be administratively added and removed without any explicit config changes to other nodes.
- Storage management: nodes should be able to be configured to use a given amount of maximum space; nodes should be able to be put into "readonly" mode where new data are not accepted
- Automatic balancing of load for storage of new items
- Monitoring: some cluster-aware monitoring tools which could report on the number of available nodes, total space available, which nodes were almost full, how much data is broken and being repaired, etc.
Tahoe seems to be the closest so far.
Of course things like Amazon S3 must do at least most of the above internally, but they aren't open source, indeed you can't even buy it except as a service.