Import

For merge we need four indexes, three book identifiers, ISBN, OCLC, LCCN, and normalised title truncated to 25 characters as a fall back. These indexes are implemented using dbmhash. The problem with this is that only one process can read and write the indexes at a time. The solution is to build an import server that handles reading and writing the indexes. The most obvious way of implementing interprocess communication is with HTTP. The import server will be written using web.py.

The HTTP method GET is used for searching the indexes, and POST for adding a new record to the database and updating the indexes.

The import server handles updating the database to avoid conflicts when generating new keys for authors and editions.

Index fields are passed as GET parameters, lists are joined with '_'. The response is in JSON. For example:

For the fields: {'isbn': ['0415045568', '0391025511'], 'title': ['phenomenology of percepti']}

The URL is: http://wiki-beta.us.archive.org:9020/?isbn=0415045568_0391025511&title=phenomenology+of+percepti

And the response is:

{"fields": {"isbn": ["0415045568"], "title": ["phenomenology of percepti"]}, "pool": {"isbn": [1366447, 10187591], "title": [1366447, 8071041, 10146455, 10187591, 10198188, 10198270, 10568028, 13557619, 13620735, 17343673]}}

The numbers in the response are database IDs.

Roberts wishlist

  • Lending waiting list
  • Multiple copies to loan, pull books from shelves in libraries
  • Search inside filtered by collection
  • All archive.org books on Open Library

Bugs

Todo

Notes

clear out ivm29

onix_wiley_crawl 
elsevier_covers_crawl
onix_princeton