POSKeyError During Commit

by Ross Patterson last modified 2012-11-21T20:08:12+02:00
A terrifying tale of ZODB errors that only showed during the hardest time to debug.

Believe it or not, I'm a POSKeyError virgin, I've never run into one... Until now!

I was working on a very small Plone 3.1.1 to 4.2.2 upgrade. Not too many add-ons, simple skin/theme customizations, 1,810 objects in ~800MB. No problem, right?

Wrong. I got a POSKeyError after the upgrade step that converts files and images to BLOBs. The wierd part is that it only happened during the transaction.commit(), not during ZODB object access:

  File "/opt/src/eggs/Zope2-2.13.18-py2.7.egg/Zope2/App/startup.py", line 301, in commit
  File "/opt/src/eggs/transaction-1.1.1-py2.7.egg/transaction/_manager.py", line 89, in commit
    return self.get().commit()
  File "/opt/src/eggs/transaction-1.1.1-py2.7.egg/transaction/_transaction.py", line 329, in commit
  File "/opt/src/eggs/transaction-1.1.1-py2.7.egg/transaction/_transaction.py", line 443, in _commitResources
  File "/opt/src/eggs/ZODB3-3.10.5-py2.7-linux-x86_64.egg/ZODB/Connection.py", line 572, in commit
    oid, serial, transaction)
  File "/opt/src/eggs/ZODB3-3.10.5-py2.7-linux-x86_64.egg/ZODB/BaseStorage.py", line 416, in checkCurrentSerialInTransaction
    committed_tid = self.getTid(oid)
  File "/opt/src/eggs/ZODB3-3.10.5-py2.7-linux-x86_64.egg/ZODB/FileStorage/FileStorage.py", line 770, in getTid
    pos = self._lookup_pos(oid)
  File "/opt/src/eggs/ZODB3-3.10.5-py2.7-linux-x86_64.egg/ZODB/FileStorage/FileStorage.py", line 403, in _lookup_pos
    raise POSKeyError(oid)
POSKeyError: 0x055350

I tried all the tools in ZODB.scripts and zc.zodbdgc but they couldn't tell me anything about the oid in question. That's probably because it was a different oid every time I attempted the upgrade step on the same ZODB.

Without being able to use the typical tools to find out what ZODB objects were referring to those oids I was really stuck. Then David Glick came to my rescue. Actually, he was staying at my apartment so I drafted him. :-) He suggested that I might have to create a mapping of oids to stack frames at the time when objects were added to the transaction so that I could figure out what had been going on once I knew which oid would be the problem. This worked very well and much thanks to David.

I added the following to the place where the objects were being added to the transaction, ZODB.Connection.Connection.readCurrent():

if not hasattr(self, 'XXX'):
    self.XXX = {}
self.XXX[ob._p_oid] = sys._getframe()

Then once I had my POSKeyError in pdb.post_mortem() I could invoke pdb at the place where the offending oid had been added:

import pdb; pdbr = pdb.Pdb() ; pdbr.reset() ; pdbr.interaction(self.XXX[oid], None)

Using that I was able to find out what objects were involved. They were all portal_catalog/path index Trees.IIBTree.IITreeSet objects. What's odd is that the error happens even if I do a catalog clear and rebuild first.

After much more debugging and discussion, we determined that the ZODB.Connection.Connection.readCurrent() support was involved somehow. Also, as I deleted the images that were involved I noticed another migration error earlier on went away for each image I removed. That's the key, when the BLOB migration fails for one file or image, it uses a savepoint to rollback just that image's migration and proceed with the rest. It sure seems like there's a bug in the ZODB where when a savepoint is rolled back the objects registered with the transaction are cleaned up but the objects registered for readCurrent are not. Gee, someone should write a test case. :-)

In the meantime, it turns out the reason the image migrations were failing was something else all together but I thought I should share this debugging technique for capturing stack information when you can't know which stack you need to inspect until later. I also hope this ZODB bug gets squashed.