Friday, January 12, 2018

Japanese Press Translations

Not long ago, I wrote about food shortages described in the Japanese Press Translations.  At the time, I was trying to improve discoverability of the collection by linking to relevant topics in online encyclopedias.  Now my work continues on a more technical level.  My current project involves mining the collection for key words and phrases.  This has been done by extracting subject headings derived from TEI text forms of the pieces and running them through a program called Voyant Tools.

Each individual document within the Japanese Press Translation is divided into articles, and each article is given a separate item heading within the TEI text.  Mina Rakhra provided me with a list of item headings derived from the Japanese Press Translation TEI text, which allows for machine-readable texts.  I removed all data from these headings aside from the titles themselves, and used a program called Voyant Tools to export a list of terms featured most frequently.  Some results are shown below.

Some Keyword Mining results


Voyant tools provides a useful and user-friendly interface to interact with text.  It allows the user to view the text in many different ways, including lists of words and phrases, word clouds, and even line graphs.  I had some fun selecting different terms, seeing how frequently each appeared over the year covered, and trying to determine some correlation between changes.  Even in just a short time working with the data, I noticed some trends in the text.  Terminology and topics discussed changed over time, partially corresponding with the Japanese general election of 1946.  Although historical analysis is not the goal of this project, these tools could be useful for a scholar interested in exploring a text at a deeper level.  It may be worth exploring for both students and professional academics.


Voyant tools UI


Graph of term frequency


What is the purpose of this endeavor?  The primary benefit is the use of these keywords for aid in searching the collection.  As it stands, the pieces are all titled by topic and number alone.  From a browsing page, the individual documents are difficult to distinguish and potentially intimidating to the casual user.  The collection can be searched by term through the TEI text, which is excellent for a user with a specific topic in mind but less useful to the casual user.  The keywords collected through this project could be displayed on a browsing page or otherwise, allowing for easier and faster movement through documents.

In addition to the potential UI benefits of this project, the keywords produced can be reconciled with the Library of Congress's FAST system.  The FAST system (or Faceted Application of Subject Terminology) is derived from the Library of Congress's subject headings of LCSH.  It attempts to make the LCSH more accessible and usable, and reconciling our system with FAST could improve compatibility with other systems.

As the project stands, I have some raw data and a basic understanding of the work needed for LCSH reconciliation.  I will be meeting with Mina, Bill Ghezzi, and Shaun Akhtar over the coming months to discuss possible implementation of work.  Hopefully, we'll be able to integrate it into the user interface and search functionality of the Japanese Press Translations as we develop the library's display platforms in the future.


Written by Kevin Warstadt



































No comments:

Post a Comment