2015-04-11 Saturday

  • Halifax CSV works, but description is collapsed into a single field, web is better
  • CSV export from Halifax doesn't include today's transactions. I bet it is the same for TSB and Lloyds. I put the date range as 10 April to 11 April, it has yesterday's transactions, but not today's. Export interface throws an error if you put tomorrow's date.

2015-04-09 Thursday

2014-02-27 Thursday

2012-04-11 Wednesday

  • Thinking about date range searches, could use calendar icon from bootstrap. Found interesting existing date range dropdown code.
  • Petabox fixed truncated book cover bug, need to rerun scribe loader to add missing books

2011-12-27 Tuesday

  • Debugged problem connecting to Library of Congress FTP from ol-edward-dev: solved by running modprobe ip_conntrack_ftp. Thought this might solve the catalog-upload.archive.org problem, but it didn't.
  • Looked at OCLC WorldCat FAST - no download link. Requested a copy of data via feedback form, received to out-of-office replies.

2011-10-31 Monday

2011-01-15 Saturday

  • total: 153500
  • no files.xml: 393
  • no scandata: 556
  • no MARC binary: 401
  • not a book: 2505
  • bad MARC: 0
  • good: 150218

2011-01-12 Wednesday

  • trying to understand ebook_count and fix it so it updates

2010-12-16 Thursday

  • producing new database dump: ia331511:/0/edward/datadump

2010-12-08 Wednesday

  • Current data dump location: ia331504:/0/edward/datadump

Find works algorithm:

  1. First we start with a list of books, for example books by the same author.
  2. Build map of title redirects:
    1. Find existing redirects for works by this author.
    2. For each work find redirects that point to work
    3. Map from original title of redirect to work title
  3. Skip any without a title.
  4. Combine title_prefix and title.
  5. Strip any trailing dots or spaces
  6. Remove phrases in parenthesis: Edition, Press, Print, Plays, Collection, Publication, Novels, Mysteries, Book Series, Classics Library, Classics, Books
  7. Normalize title:
    1. Strip text at end in square brackets
    2. Remove "and" if surrounded by spaces
    3. Lowercase
    4. Remove "the" or "a" followed by a space from the start
    5. Remove non-word characters including spaces
  8. Retrieve the work_title. Get it from the original source if the record is from MARC.
  9. Don't use the work_title if it is one of these: Publications, Works. English, Missal, Works, Report, Letters, Calendar, Bulletin, Plays, Sermons, Correspondence, Bill, Bills, Selections, Selected works, Selected works. English, The Novels, "Laws, etc"
  10. Normalize the work title using the same method as normalizing the title.
  11. Build map of normalized book titles to normalized work titles, using:
    • frequency list of pairs of normalized title and work_title, skip any book without a work title
    • frequency list of normalized titles
  12. Build a mapping from normalized work_title back to frequency list of work titles.
  13. Add normalized titles and work_titles from map of existing works and redirects to reverse work_title table and title map.

2010-11-22 Monday

2010-10-22 Friday

Talked to SJ Klein, asked for feature requests, he says add age groups. Some children's libraries have catalogued books for reading level, we should have that information in Open Library.

We should add popularity, possibly how many times a book has been borrowed from libraries, help to filter out boring stuff.

I had the idea of sort by length of Wikipedia article.

More language information, English level 3, Spanish level 4, that kind of thing. Interesting to note that something is a picture book, language doesn't matter as much.

Include information about number of pictures.

2010-10-19 Tuesday

  • http://openlibrary.org/search/inside/miscellaneouswo00howagoog?q=feveral gives an error fixed
  • move to using /fulltext/inside.php on datanodes, latest inside.py is deployed. done
  • handle fields in search inside queries
  • write test for search inside
  • make sure subjects don't get lost when merging works
  • try sharding solr on ia331509, ia331510 and ia331511
  • move solr host to openlibrary.yml done

2010-10-14 Thursday

  • crawling Amazon, starting 2009-01-01, 10GB per ARC. dumping files in /0/amazon/

2010-10-13 Wednesday

  • should try termVectors=on, termPositions=on and termOffsets=on and re-index inside solr.

2010-09-29 Wednesday

  • bad MARC, looks like it came from OL: lostincitystorie00jone, daysofatonement00greg, sanfranciscobaya00pneu, metroland00barn
  • OL key is in metasource.xml

2010-09-28 Tuesday

  • sample solr in ~/scratch/sample_solr_search

2010-09-26 Saturday

2010-09-14 Tuesday

  • Problem with duplicates in search index. Same count, different case. Need to debug.
  • solr_author_merge.py throws AuthorRedirect. The authors are correct in the database, wrong in the log. Need to catch AuthorRedirect and use log.
  • scribe loader crashed on MARC record with blank 100a: lincolncentenary00horn

2010-08-30 Monday

  • Using the word vendor instead of bookseller
  • Thinking about identifiers

2010-08-27 Friday

  • Need to check abbyy: americanpsycholo00amer and americanpublicop00erik - answer: print disabled
  • We could use the word 'supplier' instead of 'bookseller'

2010-08-12 Thursday

  • Thinking about speeding up solr update using a cache of data from the Open Library database.

2010-07-29 Thursday

  • Search walmart for books: http://www.walmart.com/search/search-ng.do?search_constraint=3920&search_query=ISBN
  • http://www.booksamillion.com/product/0143038257
  • http://search.barnesandnoble.com/mobile/e/9780143038252

2010-05-16 Wednesday

  • Overdrive: 75810 books, 46139 with an original ISBN
  • Amazon lookup code is in: ia331504:/2/edward/20century/avail_check.py

2010-06-08 Tuesday

2010-06-04 Friday

  • Invalid MARC XML: datafield tag="g050"

2010-06-01 Tuesday

Todo:

  • Fix covers, load missing ones for print disabled books
  • Make scribe handle works
  • Move bits off ia331504
  • Documentation
  • Read e-mail
  • Write biography
  • Read CVs for new developer.

2010-05-11 Tuesday

  • Sent fresh job description to George
  • Trying to load missing Google books without scandate
  • Stopped Unhanded Exception by disabling DataTable cookie

Breakdown of print disabled books not loaded:

2206 items that end with mbp and have no MARC XML
1983 duplicate scans
662 items with MARC that says they aren't books
288 items not include in the search engine because of a bug in infogami

Need to write mail explaining the situation with print-disabled and MARC records that say they're not books.

2010-05-04 Tuesday

  • Found 53 editions that are made up of duplicates within the print disabled collection, sent a mail about it.
  • Pool can't support new longer book keys, uses varchar(16), added code to turn them into old style book keys
  • Import server runs from /1/edward/src/openlibrary/openlibrary/catalog/importer/import_server.py like this: "python import_server.py 9020"

  • Running a bunch of code to load print disabled books:

    • adding print disabled identifiers to existing records: 6.79%: 3615 of 53591 (saves using save_many every 50 books)
    • adding the 'Accessible book' subject to works with scans: 6.35% complete (using save_many, every 100 works)
    • loading new print disabled books: 2.89%: 776/26828 (not using save many, saves each book individually)
    • solr_update is watching the log for changes and upating the search engine in chunks of 100 works

2010-05-03 Monday

  • Waiting on Anand's database migration
  • Downloading all printed disabled MARC, ready to analyse
  • Might need to rewrite code to use new prefixes for books and authors
  • Give priority to author merging
  • Avoid loading books with bad dates in the distant future, use 260c instead
  • Don't give spelling suggestions that have no results
  • 2215 of the printdisabled books don't have MARC
  • 639 print disabled books aren't monographs
  • 77160 books in print disabled collection
  • 74306 will be loaded
  • Brewster asked about SFPL MARC records on openlibrary.org, not part of ol_data collection
  • Alexis moved all data in the marcrecords collection to ol_data.
  • Found 53 scanned books without files.xml, Hank says it is a permissions problem
  • Redownloaded 53 scanned books with missing files.xml, now down to 43 books with missing files.xml
  • Hank says: "Nearly all of these are on ia301516:/1/items where a secondary-to-primary disk rescue is in progress."
  • Ralf says we need to replace a failing hard drive in ia331507
  • I'm rebuilding the author index from an old author dump.
  • No work record for 1984 by George Orwell.
  • can't run work finder or solr update while JSON API is broken, it can't handle queries that include something like title=null.
  • Jeff pointed out we are missing a an edition for Cibola by Alice Walworth Graham, I think it is because of the database migration.
  • work Solr was missing textSpell fieldtype, always suggesting errors in searches not written in lowercase - fixed
  • MARC record says books is an electronic resource when it isn't.

2010-04-30 Friday

  • To load sfpl book descriptions I need to have mapping from source_record -> edition -> work

2010-04-29 Thursday

CD to delete

2010-04-28 Wednesday

Fix up subjects, dump commas:

  • World War, 1939-1945 --> World War (1939-1945)
  • Nigeria -- Civil War, 1967-1970 -> "Nigeria Civil War (1967-1970)" and "Nigeria"

Need to rebuild subject list from latest dump.

Get to use: from openlibrary.solr.work_subject import get_marc_subjects

Building fresh file of work subjects

Building file of changes for Anand

Work finder is running, but sometimes crashes, recording some logs, might help debugging

Solr update is running, it runs work finder for author merges

Downloading MARC and META XML for printdisabled books

2010-04-26 Monday

Mary pointed missing books:

Made HTML dump of Henrik Ibsen works to assist debugging of work finder.

Challenge with the work finder is splitting up these works:

2010-04-22 Thursday

Time to make an edition search index. 24012766 editions with authors in /3/edward/db_dump/edition_dump2

Need to add work description to search engine. Maybe add created and last modified.

After work finder runs make work updater a two step process

2010-04-21 Wednesday

Work finder has difficulty splitting existing works.

OL52397W should be split into three works:

  • Brev 1845-1905
  • Brev veksling med Christiania Theater, 1878-1899
  • Henrik Ibsens brevveksling med Christiania Theater 1878-1899
  • The correspondence of Henrik Ibsen

Add logging to work finder code for updating works.

Known problems

Add a pager to subject and author searches.

We don't have ISBN 0001982370 in Open Library yet.

Author merge crash, on "assert cur['type'] == '/type/author_role'" - fixed.

Usergroup and /type/volume edit pages don't work on upstream.

Karen wants title and subtitle as two input boxes in librarian mode.

WorkBot messed up title of work

Need to avoid having commas in subjects, they get broken when saving.

Need to try and include place names in time subjects about wars.

we now have subtitle support on the work edit page

I just changed the title of 'Flatland' to 'Flatland: a romance of many dimensions'. When I hit save it gets split on the ': ' into title and subtitle, here is my edit:

http://upstream.openlibrary.org/works/OL118420W/Flatland?b=5&m=diff

Can't delete work with editions.

    if delete:
        if self.edition:
            self.delete(self.edition.key, comment=comment)

        if self.work and self.work.edition_count == 0:
            self.delete(self.work.key, comment=comment)
        return

multiple work titles: epistolaeadattic03ciceuoft
multiple work titles: epistolaeadattic00ciceuoft
multiple work titles: epistolaeadattic02ciceuoft
multiple work titles: epistolaeadattic04ciceuoft
multiple titles: resoflegkentucky00kent
multiple titles: thermodynamicsh05woodgoog
multiple titles: ueberdiesprache00burmgoog
multiple titles: volume00goog
multiple titles: worksofrightrevb00strauoft
multiple titles: yesterdaytodaya08bickgoog

Mad MARC

MARC code breakdown:

{'`': 1, 'a': 188926, ' ': 5, 'c': 921, 'b': 153, 'e': 112, 'g': 6, 'i': 1, 'k': 24, 'j': 17, 'm': 1, 'p': 488, 's': 8, 't': 967, 'v': 226, 'y': 179, 'x': 408}
{'a': 598, ' ': 830, 'c': 284, 'b': 13, 'd': 63, 'm': 110279, 's': 80352, 'p': 15, 'S': 9}

18920 works_with_bad_subjects in /3/edward/db_dump

http://upstream.openlibrary.org/search?q=%22thesis+for+BA%22

archive.org metadata table is sometimes out of date. Search engine is more current.

There are 8402 scanned books that have 'MARC Source' in the format field, but don't have ';MARC;'.

I'm looking at MARC records of scanned books that haven't been loaded into Open Library. The first one I've come to is:

Afganistan by Angus Hamilton (1874-1913) http://www.archive.org/details/00hamigoog http://upstream.openlibrary.org/show-marc/00hamigoog/00hamigoog_meta.mrc:0:2387

The title field, tag 245 appears twice in the MARC record:

245 $6 01 $a Afganistan $c A. Gamilʹton ; perevod s angliĭskago S.P. Golubinova.
245 $6 01 $a Афганистанъ $c А. Гамильтонъ ; переводъ съ английскаго С.П. Голубинова.

I think the best thing to do with a record like this is use the first title.

Next year we can work out multilingual records and load the second title as well.

2010-04-20 Tuesday

Spotted that solr_update.py was ignoring save_many. Extracted a list of 46357 works updated/created by the WorkBot, passing to search engine.

Todo

Before release

  • change solr_update.py to handle save_many. Not difficult.
  • make work finder run for author merge. Maybe from solr_update.py when it sees an author merge?
  • update MARC import to search for an existing work, or create a new work, maybe just run the work finder?
  • add resume after crash to work finder
  • support user created subjects in search engine
  • add missing subjects to work pages
  • remove bad subjects like /subject/History from work pages
  • updates to subject search index

Maybe before release

  • load books with MARC that have the scan_date set to null
  • load Google books without MARC records
  • loaded missing books from Amazon using sitemap

After release

  • currently there are a handful of multi-volume works in Open Library, load the rest
  • research serials and load into Open Library. Find number of serials missing from Open Library
  • add subjects from amazon to work subject field
  • add ability to search for encrypted DAISY files.

More scanned books

~/scans/goog_no_marc_not_ol contains a list of 317,697 google books not without a MARC record, not in OL.

Can't transfer the file by e-mail or scp.

2010-04-19 Monday

873025 editions with scans

1566659 scans possible to load, match this criteria:

  • scanner is not null
  • noindex is null
  • mediatype = 'texts'
  • curatestate = 'approved' or curatestate is null
  • scandate is not null

3133 census items skipped.

Planning to download Meta XML, MARC XML and MARC binary for all 1566659 scanned text items.

First building list of machines where items live. Total run time looks to be 4.5 hours.

Of these 903,127 have 'MARC Binary' in the format field.

1,010,205 records have MARC in the format field.

2010-04-16 Friday

Search for OLID

Redirects to correct page. Done

Scanned books without authors

Scanned books without authors don't have works, so they don't make it into the search engine.

This book has an author on archive.org, but not on Open Library:

Guidance manual for landfill sites receiving municipal waste

The scan was added to an existing book record that didn't have an author.

Need to try adding author data to scanned books if it is available. Need to find some numbers.

Confused work finder

Pride and prejudice

Contains Emma, Pride and prejudice, Sense and sensibility and Mansfield Park because of books like this:

title: The novels work title: Pride and Prejudice
title: The novels. work title: Mansfield Park
title: The novels. work title: Mansfield Park
title: The novels. work title: Sense and sensibility
title: The novels. work title: Emma
title: The novels. work title: Sense and sensibility
title: The novels work title: Sense and sensibility
title: The novels. work title: Emma
title: The novels. work title: Pride and prejudice
title: The novels. work title: Persuasion
title: The novels. work title: Pride and prejudice
title: The novels. work title: Northanger Abbey

2010-04-15 Thursday

Auto-merge authors

Extracted list of authors from data dump. Now trying to build a file of authors grouped by birth and death dates.

Found 21640 (birth, death) pairs with possible matching authors and 261394 authors.

Scanned books in work search

Extracted list of authors touched by ImportBot since 2010-03-01, total is 106934. Includes some non-scanned books. Running work finder for each author.

Author merge

Live, but not complete, still need to debug solr updates.

Example book

2010-04-14 Wednesday

Amazon

Downloaded sitemap_dp_*.xml.gz sitemaps. 2504 files, total size is 1.6G

40,000 URLs per sitemap file.

Generating a list of ISBN from database dump, to compare with Amazon sitemap data.

Spell suggest

Bogus answers from spell suggest if search terms aren't in lowercase.

ImportBot removes work link

This shouldn't happen

Not sure why this happened, I can't reproduce it.

Scanned books without works

Here is a list: ia331504:/2/edward/solr/2010-04-07/missing_work

Made a list of editions that once had a work, but no longer do, 253 of them. Most are edits by the ImportBot in December.

Rebuilding list of missing_works to include previous versions, sorted by version (done).

There are 203,701 scanned editions without works attached.

Looking at how many editions have a title that the work finder skips.

Scanned books that are skipped by the work finder by title:

24 Correspondence
59 Letters
472 Publications
556 Report
69 Plays
25 Calendar
676 Works
315 Sermons
662 Bulletin

no author: 125653
one author: 41861
multiple authors: 33305

Merge authors

Should find some way to auto merge. For example Sándor Ferenczi).

Search pages

Need to add paging to subject and author search pages. Should probably tokenize date fields so I can search for the term 'century' and the year within a full date.

2010-04-13 Tuesday

Amazon

Trying to download Amazon best seller pages by crawling the subject facets in the search engine. Turns out I was only grabbing top level facets, need to go deeper.

Use sitemap instead of search.

Missing works for scanned books

I'm missing works for some scanned books because they don't have an author. Need to add works without an author.

show-marc

Should add link to generate MARC XML for imported MARC records and include nicer display of MARC records for IA MARC records.

Two sources have identical MARC records, the IA record was retrieved from OL.

merge authors

pass action="merge-authors" to save_many