programming journal

2015-04-11 Saturday

Halifax CSV works, but description is collapsed into a single field, web is better
CSV export from Halifax doesn't include today's transactions. I bet it is the same for TSB and Lloyds. I put the date range as 10 April to 11 April, it has yesterday's transactions, but not today's. Export interface throws an error if you put tomorrow's date.

2015-04-09 Thursday

The jQuery replacement for select boxes: https://select2.github.io/
http://flask-admin.readthedocs.org/en/v1.1.0/

2014-02-27 Thursday

Library for HTML in Python: https://github.com/Knio/dominate
Found via http://thechangelog.com/dominate-html-python/
http://amsul.ca/pickadate.js/ also from http://thechangelog.com/

2012-04-11 Wednesday

Thinking about date range searches, could use calendar icon from bootstrap. Found interesting existing date range dropdown code.
Petabox fixed truncated book cover bug, need to rerun scribe loader to add missing books

2011-12-27 Tuesday

Debugged problem connecting to Library of Congress FTP from ol-edward-dev: solved by running modprobe ip_conntrack_ftp. Thought this might solve the catalog-upload.archive.org problem, but it didn't.
Looked at OCLC WorldCat FAST - no download link. Requested a copy of data via feedback form, received to out-of-office replies.

2011-10-31 Monday

Fixed bookreader bug and sent pull request: https://github.com/openlibrary/bookreader/pull/15
Need to fix bad chars for solr rebuild

2011-01-15 Saturday

total: 153500
no files.xml: 393
no scandata: 556
no MARC binary: 401
not a book: 2505
bad MARC: 0
good: 150218

2011-01-12 Wednesday

trying to understand ebook_count and fix it so it updates

2010-12-16 Thursday

producing new database dump: ia331511:/0/edward/datadump

2010-12-08 Wednesday

Current data dump location: ia331504:/0/edward/datadump

Find works algorithm:

First we start with a list of books, for example books by the same author.
Build map of title redirects:
1. Find existing redirects for works by this author.
2. For each work find redirects that point to work
3. Map from original title of redirect to work title
Skip any without a title.
Combine title_prefix and title.
Strip any trailing dots or spaces
Remove phrases in parenthesis: Edition, Press, Print, Plays, Collection, Publication, Novels, Mysteries, Book Series, Classics Library, Classics, Books
Normalize title:
1. Strip text at end in square brackets
2. Remove "and" if surrounded by spaces
3. Lowercase
4. Remove "the" or "a" followed by a space from the start
5. Remove non-word characters including spaces
Retrieve the work_title. Get it from the original source if the record is from MARC.
Don't use the work_title if it is one of these: Publications, Works. English, Missal, Works, Report, Letters, Calendar, Bulletin, Plays, Sermons, Correspondence, Bill, Bills, Selections, Selected works, Selected works. English, The Novels, "Laws, etc"
Normalize the work title using the same method as normalizing the title.
Build map of normalized book titles to normalized work titles, using:
- frequency list of pairs of normalized title and work_title, skip any book without a work title
- frequency list of normalized titles
Build a mapping from normalized work_title back to frequency list of work titles.
Add normalized titles and work_titles from map of existing works and redirects to reverse work_title table and title map.

2010-11-22 Monday

Mike recommends gevent for Python event based programming. It uses greenlets.
Installed gevent on virtual machine iavm23
design for open library zigzag iterator

2010-10-22 Friday

Talked to SJ Klein, asked for feature requests, he says add age groups. Some children's libraries have catalogued books for reading level, we should have that information in Open Library.

We should add popularity, possibly how many times a book has been borrowed from libraries, help to filter out boring stuff.

I had the idea of sort by length of Wikipedia article.

More language information, English level 3, Spanish level 4, that kind of thing. Interesting to note that something is a picture book, language doesn't matter as much.

Include information about number of pictures.

Image search example: http://ia340940.us.archive.org/~edward/images.php?file=/1/items/americanstandard00ameriala/americanstandard00ameriala_abbyy.gz
New users: http://openlibrary.org/recentchanges/register#all

2010-10-19 Tuesday

~~http://openlibrary.org/search/inside/miscellaneouswo00howagoog?q=feveral gives an error~~ fixed
~~move to using /fulltext/inside.php on datanodes, latest inside.py is deployed.~~ done
handle fields in search inside queries
write test for search inside
make sure subjects don't get lost when merging works
try sharding solr on ia331509, ia331510 and ia331511
~~move solr host to openlibrary.yml~~ done

2010-10-14 Thursday

crawling Amazon, starting 2009-01-01, 10GB per ARC. dumping files in /0/amazon/

2010-10-13 Wednesday

should try termVectors=on, termPositions=on and termOffsets=on and re-index inside solr.

2010-09-29 Wednesday

bad MARC, looks like it came from OL: lostincitystorie00jone, daysofatonement00greg, sanfranciscobaya00pneu, metroland00barn
OL key is in metasource.xml

2010-09-28 Tuesday

sample solr in ~/scratch/sample_solr_search

2010-09-26 Saturday

These two look interesting: http://github.com/fizx/parsley http://scrapy.org/

2010-09-14 Tuesday

Problem with duplicates in search index. Same count, different case. Need to debug.
solr_author_merge.py throws AuthorRedirect. The authors are correct in the database, wrong in the log. Need to catch AuthorRedirect and use log.
scribe loader crashed on MARC record with blank 100a: lincolncentenary00horn

2010-08-30 Monday

Using the word vendor instead of bookseller
Thinking about identifiers

2010-08-27 Friday

Need to check abbyy: americanpsycholo00amer and americanpublicop00erik - answer: print disabled
We could use the word 'supplier' instead of 'bookseller'

2010-08-12 Thursday

Thinking about speeding up solr update using a cache of data from the Open Library database.

2010-07-29 Thursday

Search walmart for books: http://www.walmart.com/search/search-ng.do?search_constraint=3920&search_query=ISBN
http://www.booksamillion.com/product/0143038257
http://search.barnesandnoble.com/mobile/e/9780143038252

2010-05-16 Wednesday

Overdrive: 75810 books, 46139 with an original ISBN
Amazon lookup code is in: ia331504:/2/edward/20century/avail_check.py

2010-06-08 Tuesday

Fix search page titles
Rebuild solr index
Work subjects file for Anand
Maybe make /search?ftokens=pfbyekbzsflz redirect to /subjects/social_life_and_customs

2010-06-04 Friday

Invalid MARC XML: datafield tag="g050"

2010-06-01 Tuesday

Todo:

Fix covers, load missing ones for print disabled books
Make scribe handle works
Move bits off ia331504
Documentation
Read e-mail
Write biography
Read CVs for new developer.

2010-05-11 Tuesday

Sent fresh job description to George
Trying to load missing Google books without scandate
Stopped Unhanded Exception by disabling DataTable cookie

Breakdown of print disabled books not loaded:

2206 items that end with mbp and have no MARC XML
1983 duplicate scans
662 items with MARC that says they aren't books
288 items not include in the search engine because of a bug in infogami

Need to write mail explaining the situation with print-disabled and MARC records that say they're not books.

Need to update subject index with Protected DAISY and Accessible books.

2010-05-04 Tuesday

Found 53 editions that are made up of duplicates within the print disabled collection, sent a mail about it.
Pool can't support new longer book keys, uses varchar(16), added code to turn them into old style book keys
Import server runs from /1/edward/src/openlibrary/openlibrary/catalog/importer/import_server.py like this: "python import_server.py 9020"
Running a bunch of code to load print disabled books:
- adding print disabled identifiers to existing records: 6.79%: 3615 of 53591 (saves using save_many every 50 books)
- adding the 'Accessible book' subject to works with scans: 6.35% complete (using save_many, every 100 works)
- loading new print disabled books: 2.89%: 776/26828 (not using save many, saves each book individually)
- solr_update is watching the log for changes and upating the search engine in chunks of 100 works

2010-05-03 Monday

Waiting on Anand's database migration
Downloading all printed disabled MARC, ready to analyse
Might need to rewrite code to use new prefixes for books and authors
Give priority to author merging
Avoid loading books with bad dates in the distant future, use 260c instead
Don't give spelling suggestions that have no results
2215 of the printdisabled books don't have MARC
639 print disabled books aren't monographs
77160 books in print disabled collection
74306 will be loaded
Brewster asked about SFPL MARC records on openlibrary.org, not part of ol_data collection
Alexis moved all data in the marcrecords collection to ol_data.
Found 53 scanned books without files.xml, Hank says it is a permissions problem
Redownloaded 53 scanned books with missing files.xml, now down to 43 books with missing files.xml
Hank says: "Nearly all of these are on ia301516:/1/items where a secondary-to-primary disk rescue is in progress."
Ralf says we need to replace a failing hard drive in ia331507
I'm rebuilding the author index from an old author dump.
No work record for 1984 by George Orwell.
can't run work finder or solr update while JSON API is broken, it can't handle queries that include something like title=null.
Jeff pointed out we are missing a an edition for Cibola by Alice Walworth Graham, I think it is because of the database migration.
work Solr was missing textSpell fieldtype, always suggesting errors in searches not written in lowercase - fixed
MARC record says books is an electronic resource when it isn't.

2010-04-30 Friday

To load sfpl book descriptions I need to have mapping from source_record -> edition -> work

2010-04-29 Thursday

CD to delete

2010-04-28 Wednesday

Fix up subjects, dump commas:

World War, 1939-1945 --> World War (1939-1945)
Nigeria -- Civil War, 1967-1970 -> "Nigeria Civil War (1967-1970)" and "Nigeria"

Need to rebuild subject list from latest dump.

Get to use: from openlibrary.solr.work_subject import get_marc_subjects

Building fresh file of work subjects

Building file of changes for Anand

Work finder is running, but sometimes crashes, recording some logs, might help debugging

Solr update is running, it runs work finder for author merges

Downloading MARC and META XML for printdisabled books

2010-04-26 Monday

Mary pointed missing books:

Made HTML dump of Henrik Ibsen works to assist debugging of work finder.

Challenge with the work finder is splitting up these works:

OL52404W: From Ibsen's workshop
OL52412W: The collected works of Henrik Ibsen

2010-04-22 Thursday

Time to make an edition search index. 24012766 editions with authors in /3/edward/db_dump/edition_dump2

Need to add work description to search engine. Maybe add created and last modified.

After work finder runs make work updater a two step process

2010-04-21 Wednesday

Work finder has difficulty splitting existing works.

OL52397W should be split into three works:

Brev 1845-1905
Brev veksling med Christiania Theater, 1878-1899
Henrik Ibsens brevveksling med Christiania Theater 1878-1899
The correspondence of Henrik Ibsen

Add logging to work finder code for updating works.

Known problems

Add a pager to subject and author searches.

We don't have ISBN 0001982370 in Open Library yet.

Author merge crash, on "assert cur['type'] == '/type/author_role'" - fixed.

Usergroup and /type/volume edit pages don't work on upstream.

Karen wants title and subtitle as two input boxes in librarian mode.

WorkBot messed up title of work

Need to avoid having commas in subjects, they get broken when saving.

Need to try and include place names in time subjects about wars.

we now have subtitle support on the work edit page

I just changed the title of 'Flatland' to 'Flatland: a romance of many dimensions'. When I hit save it gets split on the ': ' into title and subtitle, here is my edit:

http://upstream.openlibrary.org/works/OL118420W/Flatland?b=5&m=diff

Can't delete work with editions.

    if delete:
        if self.edition:
            self.delete(self.edition.key, comment=comment)

        if self.work and self.work.edition_count == 0:
            self.delete(self.work.key, comment=comment)
        return

multiple work titles: epistolaeadattic03ciceuoft
multiple work titles: epistolaeadattic00ciceuoft
multiple work titles: epistolaeadattic02ciceuoft
multiple work titles: epistolaeadattic04ciceuoft
multiple titles: resoflegkentucky00kent
multiple titles: thermodynamicsh05woodgoog
multiple titles: ueberdiesprache00burmgoog
multiple titles: volume00goog
multiple titles: worksofrightrevb00strauoft
multiple titles: yesterdaytodaya08bickgoog

Mad MARC

Big TOC
LC classification: serial has 050, but only 050d

MARC code breakdown:

{'`': 1, 'a': 188926, ' ': 5, 'c': 921, 'b': 153, 'e': 112, 'g': 6, 'i': 1, 'k': 24, 'j': 17, 'm': 1, 'p': 488, 's': 8, 't': 967, 'v': 226, 'y': 179, 'x': 408}
{'a': 598, ' ': 830, 'c': 284, 'b': 13, 'd': 63, 'm': 110279, 's': 80352, 'p': 15, 'S': 9}

18920 works_with_bad_subjects in /3/edward/db_dump

http://upstream.openlibrary.org/search?q=%22thesis+for+BA%22

archive.org metadata table is sometimes out of date. Search engine is more current.

There are 8402 scanned books that have 'MARC Source' in the format field, but don't have ';MARC;'.

I'm looking at MARC records of scanned books that haven't been loaded into Open Library. The first one I've come to is:

Afganistan by Angus Hamilton (1874-1913) http://www.archive.org/details/00hamigoog http://upstream.openlibrary.org/show-marc/00hamigoog/00hamigoog_meta.mrc:0:2387

The title field, tag 245 appears twice in the MARC record:

245 $6 01 $a Afganistan $c A. Gamilʹton ; perevod s angliĭskago S.P. Golubinova.
245 $6 01 $a Афганистанъ $c А. Гамильтонъ ; переводъ съ английскаго С.П. Голубинова.

I think the best thing to do with a record like this is use the first title.

Next year we can work out multilingual records and load the second title as well.

2010-04-20 Tuesday

Spotted that solr_update.py was ignoring save_many. Extracted a list of 46357 works updated/created by the WorkBot, passing to search engine.

Todo

Before release

change solr_update.py to handle save_many. Not difficult.
make work finder run for author merge. Maybe from solr_update.py when it sees an author merge?
update MARC import to search for an existing work, or create a new work, maybe just run the work finder?
add resume after crash to work finder
support user created subjects in search engine
add missing subjects to work pages
remove bad subjects like /subject/History from work pages
updates to subject search index

Maybe before release

load books with MARC that have the scan_date set to null
load Google books without MARC records
loaded missing books from Amazon using sitemap

After release

currently there are a handful of multi-volume works in Open Library, load the rest
research serials and load into Open Library. Find number of serials missing from Open Library
add subjects from amazon to work subject field
add ability to search for encrypted DAISY files.

More scanned books

~/scans/goog_no_marc_not_ol contains a list of 317,697 google books not without a MARC record, not in OL.

Can't transfer the file by e-mail or scp.

2010-04-19 Monday

873025 editions with scans

1566659 scans possible to load, match this criteria:

scanner is not null
noindex is null
mediatype = 'texts'
curatestate = 'approved' or curatestate is null
scandate is not null

3133 census items skipped.

Planning to download Meta XML, MARC XML and MARC binary for all 1566659 scanned text items.

First building list of machines where items live. Total run time looks to be 4.5 hours.

Of these 903,127 have 'MARC Binary' in the format field.

1,010,205 records have MARC in the format field.

2010-04-16 Friday

Search for OLID

Redirects to correct page. Done

Scanned books without authors

Scanned books without authors don't have works, so they don't make it into the search engine.

This book has an author on archive.org, but not on Open Library:

Guidance manual for landfill sites receiving municipal waste

The scan was added to an existing book record that didn't have an author.

Need to try adding author data to scanned books if it is available. Need to find some numbers.

Confused work finder

Pride and prejudice

Contains Emma, Pride and prejudice, Sense and sensibility and Mansfield Park because of books like this:

title: The novels work title: Pride and Prejudice
title: The novels. work title: Mansfield Park
title: The novels. work title: Mansfield Park
title: The novels. work title: Sense and sensibility
title: The novels. work title: Emma
title: The novels. work title: Sense and sensibility
title: The novels work title: Sense and sensibility
title: The novels. work title: Emma
title: The novels. work title: Pride and prejudice
title: The novels. work title: Persuasion
title: The novels. work title: Pride and prejudice
title: The novels. work title: Northanger Abbey

2010-04-15 Thursday

Auto-merge authors

Extracted list of authors from data dump. Now trying to build a file of authors grouped by birth and death dates.

Found 21640 (birth, death) pairs with possible matching authors and 261394 authors.

Scanned books in work search

Extracted list of authors touched by ImportBot since 2010-03-01, total is 106934. Includes some non-scanned books. Running work finder for each author.

Author merge

Live, but not complete, still need to debug solr updates.

Example book

2010-04-14 Wednesday

Amazon

Downloaded sitemap_dp_*.xml.gz sitemaps. 2504 files, total size is 1.6G

40,000 URLs per sitemap file.

Generating a list of ISBN from database dump, to compare with Amazon sitemap data.

Spell suggest

Bogus answers from spell suggest if search terms aren't in lowercase.

ImportBot removes work link

This shouldn't happen

Not sure why this happened, I can't reproduce it.

Scanned books without works

Here is a list: ia331504:/2/edward/solr/2010-04-07/missing_work

Made a list of editions that once had a work, but no longer do, 253 of them. Most are edits by the ImportBot in December.

Rebuilding list of missing_works to include previous versions, sorted by version (done).

There are 203,701 scanned editions without works attached.

Looking at how many editions have a title that the work finder skips.

Scanned books that are skipped by the work finder by title:

24 Correspondence
59 Letters
472 Publications
556 Report
69 Plays
25 Calendar
676 Works
315 Sermons
662 Bulletin

no author: 125653
one author: 41861
multiple authors: 33305

Merge authors

Should find some way to auto merge. For example Sándor Ferenczi).

Search pages

Need to add paging to subject and author search pages. Should probably tokenize date fields so I can search for the term 'century' and the year within a full date.

2010-04-13 Tuesday

Amazon

Trying to download Amazon best seller pages by crawling the subject facets in the search engine. Turns out I was only grabbing top level facets, need to go deeper.

Use sitemap instead of search.

Missing works for scanned books

I'm missing works for some scanned books because they don't have an author. Need to add works without an author.

show-marc

Should add link to generate MARC XML for imported MARC records and include nicer display of MARC records for IA MARC records.

Two sources have identical MARC records, the IA record was retrieved from OL.

merge authors

pass action="merge-authors" to save_many