Wednesday, September 12, 2007

ZODB vs Durus

Soon after, I posted my last article about ZODB performance against SQLite3, I got a polite comment from Michael Watkins reminding me of Durus. Durus is a simpler object database inpired by ZODB. Despite not having many of the features of ZODB, such as multi-threaded storage access, multiple storage backends, asynchronous IO, versions, undo and conflict resolution (according to Durus own FAQ), It is a very capable database. So I decided to adapt my benchmark script and pitch Durus against ZODB. Please note that my benchmark code is very simple and does not explore well the differences between Durus and ZODB. A better comparison is left as an exercise to the reader. ;-)

Despite the simplicity of my test code, There was one suprinsing result of my test. Both databases used files as storages, but the file size for Durus was 3.7MB for a million records, while ZODB file size was 23.7MB !!!

Both database systems offer the option of packing their stores, to reduce size, but this feature was not used. Besides, to pack a ZODB storage file, the same ammount of free disk space is required, wich only makes matters worse for ZODB. Please, also check Michael's Blog for a very interesting benchmark of Durus vs cPickle.

Here is the code:

import time, os, glob
import ZODB
from ZODB import FileStorage, DB
import pylab as P

from durus.file_storage import FileStorage as FS
from durus.connection import Connection


def zInserts(n):
print "Inserting %s records into ZODB"%n
for i in xrange(n):
dbroot[i] = {'name':'John Doe','sex':1,'age':35}
connection.transaction_manager.commit()

def DurusInserts(n):
print "Inserting %s records into Durus"%n
for i in xrange(n):
Droot[i] = {'name':'John Doe','sex':1,'age':35}
conndurus.commit()

recsize = [1000,5000,10000,50000,100000,200000,400000,600000,800000,1000000]
zperf = []
durusperf =[]
for n in recsize:
# remove old databases
if os.path.exists('testdb.fs'):
[os.remove(i) for i in glob.glob('testdb.fs*')]
if os.path.exists('test.durus'):
os.remove('test.durus')
# setup ZODB storage
dbpath = 'testdb.fs'
storage = FileStorage.FileStorage(dbpath)
db = DB(storage)
connection = db.open()
dbroot = connection.root()
#Setting up durus database
conndurus = Connection(FS("test.durus"))
Droot = conndurus.get_root()
#begin tests
t0 = time.clock()
zInserts(n)
t1 = time.clock()
# closing and reopening ZODB' database to make sure
# we are reading from file and not from some memory cache
connection.close()
db.close()
storage = FileStorage.FileStorage(dbpath)
db = DB(storage)
connection = db.open()
dbroot = connection.root()
t2 = time.clock()
print "Number of records read from ZODB: %s"%len(dbroot.items())
t3 = time.clock()
ztime = (t1-t0)+(t3-t2)
zperf.append(ztime)
print 'Time for ZODB: %s seconds\n'%ztime
t4 = time.clock()
DurusInserts(n)
t5 = time.clock()
conndurus = Connection(FS("test.durus"))
Droot = conndurus.get_root()
t6 = time.clock()
print "Number of records read from Durus: %s"%len(Droot.items())
t7 = time.clock()
Dtime = (t5-t4)+(t7-t6)
durusperf.append(Dtime)
print 'Time for Durus with db on Disk: %s seconds\n'%Dtime
P.plot(recsize,zperf,'-v',recsize,durusperf,'-^')
P.legend(['ZODB','Durus'])
P.xlabel('inserts')
P.ylabel('time(s)')
P.show()

4 comments:

Anonymous said...

If what you say about Durus not supporting undo (aka history of object versions) is true, it probably accounts for the difference in file sizes.

Anonymous said...

Flávio - I'm not too surprised that the curves are very similar; the design principles underlying Durus are intentionally the same as ZODB - the authors of Durus had been using ZODB in their applications and wanted a fairly similar looking approach. I can't speak to why they reimplemented ZODB but perhaps there is a FAQ on that lying about somewhere.

Durus was intentionally designed to be simpler. It hits the key functionality of ZODB - easy Python centric persistence; offers a StorageServer that supports multiple clients over the network (or on the same machine), similar to ZEO; but it doesn't offer the "undo" features or MVCC and other things that ZODB offers.

Being simpler, one can sit down and read the Durus code in a single sitting and come away with a pretty good feeling for how it works (although you don't have to in order to use it).

Recently I wrote a Postgresql and generic DBAPI (sqlite coming soon) back end for Durus - the bulk of the work and test code I got done in a single sitting. I don't think I could tackle that with ZODB with the same level of success.

I would use either tool without thinking too hard about it, but by philosophy I tend to pick the simplest tool that is appropriate for the job at hand which is more or less why I use Durus.

Either would be a find addition to anyone's tool it. There is an awful lot of Python solution code (web oriented and otherwise) being written that could benefit from using a Python object database.

Its fun, fast, and natural feeling.

Mind you... coming from SQL there is sometimes a period of mind-set adjustment one has to go through before "aha" sets in.

Anonymous said...

Durus has a switch that causes the object records to be compressed before they are written to disk. By default, compression is on. This costs some CPU time but tends to save much space on the disk. I think that accounts almost entirely for the difference in file size between Durus and ZODB.

Wheat said...

It would be interesting to run these benchmarks against PyTables. You don't get features such as concurrent writes with PyTables, but you do produce self-describing binary data in HDF5, so it's possible to share data with non-Python programmers. HDF5 also does interesting things with data compression, so it should generate smaller data files:

http://www.pytables.org

And the ZODB does rock :)

ccp

Amazon