This part of the article describes the project background, purpose and some of the challenges of data collection. For other parts of this article click on the links here: Summary, Part 2, Part 3.

A Big Data exploration of Wikipedia

The introduction of massive digitized and born digital text archives and the emerging algorithms, computational methods, and computing platforms capable of exploring them has revolutionized the Humanities, Arts, and Social Sciences (HASS) disciplines over the past decade.  These days, scholars are able to explore historical patterns of human society across billions of book pages dating back more than three centuries or to watch the pulse of contemporary civilization moment by moment through hundreds of millions of microblog posts with a click of a mouse.  The scale of these datasets and the methods used to analyze them has led to a new emphasis on interactive exploration, “test[ing] different assumptions, different datasets, and different algorithms … Figure[ing] out whether you’re asking the right questions, and … pursuing intriguing possibilities that you’d otherwise have to drop for lack of time.”(1) Data scholars leverage off-the-shelf tools and plug-and-play data pipelines to rapidly and iteratively test new ideas and search for patterns to let the data “speak for itself.”  They are also increasingly becoming cross-trained experts capable of rapid ad-hoc computing, analysis, and synthesis.  At Facebook, “on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of [those] analyses to other members of the organization.”(1)

The classic image of the solitary scholar spending a professional lifetime examining the most nuanced details of a small collection of works is slowly giving way to the collaborative researcher exploring large-scale patterns across millions or even billions of works.  A driving force of this new approach to scholarship is the concept of whole-corpus analysis, in which data mining tools are applied to every work in a collection.  This is in contrast to the historical model of a researcher searching for specific works and analyzing only the trends found in that small set of documents.  There are two reasons for this shift towards larger-scale analysis: more complex topics being explored and the need for baseline indicators.  Advances in computing power have made it possible to move beyond the simple keyword searches of early research to more complex topics, but this requires more complex search mechanisms.  To study topical patterns in how books of the nineteenth century described “The West” using a traditional keyword search, one would have to compile a list of every city and landmark in the Western United States and construct a massive Boolean “OR” statement potentially including several million terms.  Geographic terms are often ambiguous (“Washington” can refer both to the state on the West coast and the US capital on the East coast; 40% of US locations share their name with another location elsewhere in the US) and so in addition to being impractical, the resulting queries would have a very high false-positive rate. Instead, algorithms can be applied to identify and disambiguate each geographic location in each document, annotating the text with their approximate locations, allowing native geographic search of the text.

The creation of baselines has also been a strong factor in driving whole-corpus analysis.  Search for the raw number of mentions by year of nearly any keyword in a digitized book collection 1800-1900 and the resulting graph will likely show a strong increase in the use of the term over that century.  The problem with this measure is that the total number of books published in each year that have been digitized is not constant: it increases at a linear to exponential rate depending on the book collection.  This means that nearly any word will show a significant increase in the total number of raw mentions simply because the universe of text has increased.  To compensate for this, measurement tools like the Google Ngrams viewer (2) calculate a word’s popularity each year not as the absolute number of mentions, but rather as the percentage of all words published that year.  This effectively measures the “rate” at which a word is used, essentially normalizing the impact of the increasing number of books each year.  Yet, to do this, Google had to compute the total list of all unique words in all books published in each year, creating a whole-corpus baseline.  Similarly, when calculating shifts in the “tone” towards a topic or its spatial association, corpus baselines are needed to determine whether the observed changes are specifically associated with that topic, or whether they merely reflect corpus-wide trends over that period.

Into this emerging world of Big Data Humanities, Arts, and Social Sciences (HASS) scholarship, a collaboration with supercomputing company Silicon Graphics International (SGI) leveraged their new 4,000-core 64TB-shared-memory UV2 supercomputer to apply this interactive exploration approach to telling the story of Wikipedia’s chronicle of world history.  Launched a little over a decade ago, Wikipedia has become an almost indispensable part of daily life, housing 22 million articles across 285 languages that are accessed more than 2.7 billion times a month from the United States alone. Today Alexa ranks it the 6th most popular site on the entire web and it has become one of the largest general web-based reference works in existence (3). It is also unique among encyclopedias in that in addition to being a community product of millions of contributors, Wikipedia actively encourages the downloading of its complete contents for data mining.  In fact, it even has a dedicated download site containing the complete contents of the site in XML format ready for computer processing (4). This openness has made it one of the most widely-used data sources for data mining, with Google Scholar returning more than 400,000 articles either studying or referencing Wikipedia.

As an encyclopedia, Wikipedia is essentially a massive historical daybook cataloging global activity through history arrayed by date and location.  Yet, most of the literature on Wikipedia thus far has focused on its topical knowledge, examining the linking structure of Wikipedia (which pages link to which other pages and what category tags are applied where) or studied a small number of entries intensively (5). Few studies have explored the historical record captured on Wikipedia’s pages.  In fact, one of the few previous studies to explore Wikipedia as a historical record visualized just 14,000 events cross-linked from entries that had been manually tagged by human contributors with both date and geographic location information (6). No study has delved into the contents of the pages themselves and looked at every location and every date mentioned across all four million English-language entries and the picture of history they yield from the collective views of the millions of contributors that have built Wikipedia over the past decade.

The Big Data workflow: acquiring the data

The notion of exploring Wikipedia’s view of history is a classic Big Data application: an open-ended exploration of “what’s interesting” in a large data collection leveraging massive computing resources.  While quite small in comparison to the hundreds-of-terabytes datasets that are becoming increasingly common in the Big Data realm of corporations and governments, the underlying question explored in this Wikipedia study is quite similar: finding overarching patterns in a large collection of unstructured text, to learn new things about the world from those patterns, and to do all of this rapidly, interactively, and with minimal human investment.

As their name suggests, all Big Data projects begin with the selection and acquisition of data.  In the HASS disciplines the data acquisition process can involve months of searching, license negotiations with data vendors, and elaborate preparations for data transfer.  Data collections at these scales are often too large to simply download over the network (some collections can total hundreds of terabytes or even petabytes) and so historically have been shipped on USB drives.  While most collections fit onto just one or two drives, the largest collections can require tens, hundreds, or even thousands of high-capacity USB drives or tape cartridges.  Some collections are simply too large to move or may involve complex licensing restrictions that prevent them from being copied en-mass.  To address this, some data vendors are beginning to offer small local clusters housed at their facilities where researchers can apply for an allocation to run their data mining algorithms on the vendor’s own data mining cluster and retrieve just the analytical results, saving all of the data movement concerns.

In some cases it is possible to leverage the high-speed research networks that connect many academic institutions to download smaller collections via the network.  Some services require specialized file transfer software that may utilize network ports that are blocked by campus firewalls or may require that the receiving machine install specialized software or obtain security certificates that may be difficult at many institutions.  Web-based APIs that allow files to be downloaded via standard authenticated web requests are more flexible and supported on most academic computing resources.  Such APIs also allow for nearly unlimited data transfer parallelism as most archives consist of massive numbers of small documents which can be parallelized simply by requesting multiple documents at once.  Not all web-based APIs are well-suited for bulk transfers, however.  Some APIs only allow documents to be requested a page at a time, requiring 600 individual requests to separately download each page of a single 600 page book.  At the very minimum, APIs must allow the retrieval of an entire work at a time as a single file, either as a plain ASCII file with new page characters indicating page boundaries (where applicable) or in XML format.  Applications used to manage the downloading workflow must be capable of automatically restarting where they left off, since the downloading process can often take days or even weeks and can frequently be interrupted by network outages and hardware failures.  The most flexible APIs allow an application to query the master inventory of all works, selecting only those works matching certain criteria (or a list of all documents), and downloading a machine-friendly CSV or XML output that includes a direct link to download each document.  Data mining tools are often developed for use on just one language, so a project might wish to download only English language works, for example.

Many emerging projects perform data mining on the full textual content of each work, and thus require access to the Optical Character Recognition (OCR) output (7). However, handwritten works, works scanned from poor-quality originals (such as heavily-scratched service microform), or works that make use of Fraktur or other specialized fonts, are highly resistant to OCR and thus normally do not yield usable OCR output.  Some archives OCR every document and include the output as-is, leading to 10MB files of random garbage characters, while others filter poor-quality OCR through an automated or manual review processes.  Those that exclude poor-quality OCR should indicate through a metadata flag or other means that the OCR file has been specifically excluded from this work.  Otherwise, it is difficult for automated downloading tools to distinguish between a work where the OCR file has been specifically left out and a technical error that prevented the file from being downloaded (and thus should be requeued to try again).  For those documents that include OCR content, archives should include as much metadata as possible on the specific organization scanning the work, the library it was scanned from, the scanning software and imaging system, and the specific OCR software and version used.  This information can often be used to incorporate domain knowledge about scanning practices or imaging and OCR pipeline nuances that can be used to optimize or enhance the processing of the resultant text.

Yet, perhaps the greatest challenge in the data acquisition process is policy-based rather than technical.  Unlike copyright status, for which there are clear guidelines in determining whether a work has entered the public domain (at least in the United States), there are no national policies or recommendations on what content should be made available for data mining.  In some cases archives may have received data from a commercial vendor or other source that may permit browsing, but not computational analysis.  In others, funding sources or institutional policy may permit data mining only by researchers at the home institution, or grant them exclusive early access.  Some archives permit unrestricted data mining on some content and only “non-consumptive” analysis of other material.  Yet, despite this varied landscape of access, few archives have written policies regarding data mining or clear guidelines on what material is available for analysis.  Most critically, however, while many archives include a flag for each work indicating whether it has entered public domain, no major archive today has a similar flag to indicate whether a work is available for data mining and under what restrictions.  This can cause long delays as archives must evaluate which material can be data mined, in some cases having to create policies and manually review content first.  As data mining becomes more commonplace, it is hoped that new national and international guidelines will be formed to help standardize the determination process and that archives will begin to include item-level metadata that indicates the availability of an item for data mining to vastly simplify this process.

Summary
Part 2: Data processing and Analytical methodologies

In part 2 of this article, the author describes the data processing and analytical methodologies applied to the Wikipedia content.

Part 3: Data analytics and Visualization

In part 3 of this article, the author describes the analytical methodologies and visualization of knowledge extracted from the Wikipedia data.

References

1. Loukides, M.  (2010). “What is Data Science?” http://radar.oreilly.com/2010/06/what-is-data-science.html
2. Google books Ngram Viewer. (online).  http://books.google.com/ngrams/
3. Wikipedia.  (online).  http://en.wikipedia.org/wiki/Wikipedia
4. Wikipedia: Database download. (online).  http://en.wikipedia.org/wiki/Wikipedia:Database_download
5. Giles, J. (2005).  “Special Report: Internet encyclopedias go head to head.” Nature.  http://www.nature.com/nature/journal/v438/n7070/full/438900a.html
6. Lloyd, G. (2011).  “A history of the world in 100 seconds.” Ragtag.info.  http://www.ragtag.info/2011/feb/2/history-world-100-seconds/
7. Leetaru, K. (2011). “Data Mining Methods for the Content Analyst: An Introduction to the Computational Analysis of Informational Content.” Routledge.
VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)