In this part of the article Kalev Leetaru describes the methods by which the Wikipedia data was stored processed and analyzed. For other parts of this article click on the links here: Summary, Part 1, Part 3.

Storing the data for processing

Once the data arrives, it must be processed into a format that can be read by the analysis tools.  Many collections are stored in proprietary or discipline-specific formats, requiring preparation and data reformatting stages.  One large digital book archive arrives as two million ZIP files containing 750 million individual ASCII files, one for each page of each book in the archive.  Few computer file systems can handle that many tiny files, and most analysis software expects to see each book as a single file.  Thus, before any analysis can begin, each of these ZIP files must be uncompressed and the individual page files reformatted as a single ASCII or XML file per book.  Other common delivery formats include PDF, EPUB, and DjVu, requiring similar preprocessing stages to extract the text layers.  While XML is becoming a growing standard for the distribution of text content, the XML standard defines only how a file is structured, leaving individual vendors to decide the specific XML encoding scheme they prefer.  Thus, even when an archive is distributed as a single XML file, preprocessing tools will be needed to extract the fields of interest.  In the case of Wikipedia, the complete four million entry archive is available as a single XML file for download directly from their website and uses a fairly simple XML schema, making it easy to extract the text of each entry.

As the fields of interest are extracted from the source data, they must be stored in a format amenable to data analysis.  In cases where only one or two software packages will be used for the analysis, data can simply be converted into a file format they support.  If multiple software packages will be used, it may make more sense to convert the data to an intermediate representation that can easily be converted to and from the other formats on demand.  Relational database servers offer a variety of features such as indexes and specialized algorithms designed for datasets too large to fit into memory that enable high-speed efficient searching, browsing, and basic analysis of even very large collections, and many filters are available to convert to and from major file formats.  Some servers, like the free edition of MySQL, (1) are highly scalable, yet extremely lightweight and can run on any Linux or Windows server.  Alternatively, if it is not possible to run a database server, a simple XML format can be developed that includes only the fields of interest, or specialized formats such as packed data structures that allow rapid randomized retrieval from the file.  In the case of the Wikipedia project, a MySQL database was used to store the data, which was then exported to a special packed XML format designed for maximum processing efficiency during the large computation phases.

From words to connections: transforming a text archive into a knowledge base

Documents are inherently large collections of words, but to a computer each word holds the same meaning and importance as every other word, limiting the types of patterns that can be explored in an archive to simply word frequencies.  The creation of higher-order representations capturing specific dimensions of that information, recognizing words indicating space, time, and emotion, allow automated analyses to move closer towards studying patterns in the actual meaning and focus of those documents.  The first generation of Big Data analysis focused largely on examining such indicators in isolation, plotting the tone of discussion of a topic over time or mapping locations and making lists of persons mentioned in that coverage.  Connections among indicators have largely been ignored, primarily because the incredible richness of human text leads to networks of interconnections that can easily reach hundreds of trillions of links from relatively small collections.  Yet historical research tends to revolve around these very connections and the interplay they capture between people, places, and dates and the actions and events that relate them.  Thus, the grand challenge questions driving the second generation of Big Data research tend to revolve around weaving together the myriad connections scattered across an archive into a single cohesive network capturing how every piece of information fits together into the global picture.  This in turn is driving an increasing focus on connections and the enormous theoretic and computational challenges that accompany them.  In the case of Wikipedia, mapping mentions of locations and creating timelines of date mentions and tone in isolation can be enlightening, but the real insight comes from coupling those dimensions, exploring how tone diffuses over space through time.

Thus, once a data archive has been assembled, the first stage of the analytical pipeline usually begins with the construction of new metadata layers over the data.  This typically involves using various data mining algorithms to extract key pieces of information, such as names or locations, or to calculate various characteristics of the text, such as readability scores or emotion.  The results of these algorithms are then saved as metadata layers to be used for subsequent access and analysis of the text.  To explore Wikipedia’s view of world history, for example, data mining algorithms were needed to translate its large unstructured text corpus into a structured knowledgebase.  Each study uses a different set of data mining algorithms aimed at its specific needs, but location in particular is an emerging class of metadata that is gaining traction as a way of understanding information in a new light. Culturomics 2.0 (2) found that location was the single most prominent organizing dimension in a three-decade archive of more than 100 million print and broadcast news reports translated from vernacular languages across nearly every country in the world, appearing on average 200-300 words. In the case of Wikipedia, previous studies of the linking structure of its pages have found that time and space form the two central dimensions around which the entire site is organized (3). Thus, for the metadata construction stage of the Wikipedia project, a fulltext geocoding algorithm was applied to all of the articles to automatically identify, disambiguate, and convert all textual geographic references to approximate mappable coordinates (4). This resulted in a new XML metadata layer that recorded every mention of a location in the text of each article and the corresponding latitude and longitude for mapping.  A similar algorithm was used to identify mentions of dates.  For example, a reference to “Georgian authorities” would utilize the surrounding document text to determine whether this referred to the country in Europe or the US state, while a mention of “Cairo” would be disambiguated to see whether it referred to the capital of Egypt or the small town in the state of Illinois in the US.  Each location was ultimately resolved to a centroid set of geographic coordinates that could be placed on a map, while each date was resolved to its corresponding year.

Wikipedia provides a facility for article contributors to manually annotate articles with mappable geographic coordinates.  In fact, content enriched with various forms of metadata, such as the Text Encoding Initiative (TEI) (5) are becoming more commonplace in many archives.  The US Department of State has annotated its historical Foreign Relations of the United States collection with inline TEI tags denoting mentions of person names, dates, and locations (6). However, only selected mentions are annotated, such as pivotal political Figures, rather than annotating every person mentioned in each document.  This can lead to incomplete or even misleading results when relying on collection-provided metadata.  In the case of Wikipedia, the human-provided geographic tags primarily focus on Europe and the Eastern United States, leading to a long history of academic papers that have relied on this metadata to erroneously conclude that Wikipedia is US and European-centric.  When switching to the content-based spatial data extracted by the fulltext geocoder, it becomes clear that Wikipedia’s coverage is actually quite even across the world, matching population centers (7).  As an example of the vast richness obtained by moving from metadata to fulltext, the four million English Wikipedia articles contain 80,674,980 locations and 42,443,169 dates.  An average article references 19 locations and 11 dates and there is an average of a location every 44 words and a date every 75 words.  As one example, the History section of the entry on the Golden Retriever dog breed (8) lists 21 locations and 18 dates in 605 words, an average of a location every 29 words and a date every 34 words.  This reflects the critical role of time and location in situating the narratives of encyclopedias.

Sentiment mining was also used to calculate the “tone” of each article on a 200-point scale from extremely negative to extremely positive. There are thousands of dictionaries available today for calculating everything from positive-negative to anxious-calm and fearful-confident (9).  All dictionaries operate on a similar principle: a set of words representing the emotion in question is compiled into a list and the document text is compared against this list to measure the prevalence of those words in the text.  A document with words such as “awful”, “horrific” and “terrible” is likely to be perceived by a typical reader as more negative than one using words such as “wonderful”, “lovely”, and “fantastic”.  Thus, by measuring what percentage of the document’s words are found in the positive dictionary, what percent are found in the negative dictionary, and then subtracting the two, a rough estimate of the tonality of the text can be achieved. While quite primitive, such approaches can achieve fairly high accuracy at scale.

Computational resources

All of these dimensions must be brought together into an interconnected network of knowledge.  To enable this research, SGI made available one of its UV2 supercomputers with 4,000 processing cores and scalable to 64 terabytes of cache-coherent shared memory. This machine runs a standard Linux operating system across all 4,000 cores, meaning it appears to an end user as essentially a single massive desktop computer and can run any off-the-shelf Linux application unmodified across the entire machine. This is very different from a traditional cluster, which might have 4,000 cores, but spread across hundreds of separate physical computers, each running their own operating system and unable to share memory and other resources. This allowed the project to make use of a rapid prototyping approach to software development to support near-realtime interactive ad-hoc exploration.

All of the metadata extraction, network compilation, workflows, and analysis were done using the PERL (10) programming language and the GraphViz (11) network visualization package.  PERL is one of the few programming languages designed from the ground-up for the processing and manipulation of text, especially efficiently extracting information based on complex patterns.  One of the greatest benefits of PERL is that it offers many high-level primitives and constructs for working with text patterns and as a scripting language it hides the memory management and other complexities of compiled languages.  Often the greatest cost of a research project is the human time it takes to write a new tool or run an analysis, and the ad-hoc exploratory nature of a lot of Big Data analysis means that an analyst is often testing a large number of ideas where the focus is simply on testing what the results look like, not on computational efficiency.

For example, to generate the final network map visualizations, a set of PERL scripts were written to rapidly construct the networks using different parameters to find the best final results in terms of coloration, alpha blending, inclusion thresholds, and other criteria.  A script using regular expressions and a hash table was used to extract and store an 800 gigabyte graph entirely in memory, with the program taking less than 10 minutes to write and less than 20 minutes to run.  Thus, in less than half an hour, a wide array of parameter adjustments and algorithm tweaks could be tested, focusing on the underlying research questions, not the programming implementation.  The shared memory model of the UV2 meant the standard Linux GraphViz package, designed for desktop use, could be used without any modifications to render the final networks, scaling to hundreds of gigabytes of memory as needed.  Finally, three terabytes of the machine’s memory were carved off to create a RAM disk, which is essentially a filesystem that exists entirely in system memory.  While such filesystems are temporary, in that they are lost if the machine is powered down, their read/write performance is limited only by the speed of computer memory and is over 1,000 times faster than even traditional solid state disk.  In this project, the use of a RAM disk meant that all 4,000 processor cores could be reading and writing the same set of common files in non-linear order and experience little to no delay, whereas a traditional magnetic disk system would support only a fraction of this storage load.

Part 1: Background

In part 1 of this article, the author describes the project background, purpose and some of the challenges of data collection.

Part 3: Data analytics and Visualization

In part 3 of this article, the author describes the analytical methodologies and visualization of knowledge extracted from the Wikipedia data.


2.  Leetaru, K. (2011).  “Culturomics 2.0: Forecasting large-scale human behavior using global news media tone in time and space”,  First Monday. 16(9).
3.  Bellomi, F. & Bonato, R. (2005). “Network Analysis for Wikipedia.” Proceedings of Wikimania.
4. Leetaru, K. (forthcoming). “ Fulltext Geocoding Versus Spatial Metadata For Large Text Archives: Towards a Geographically Enriched Wikipedia.” D-Lib Magazine.
7.  Leetaru, K. (forthcoming).  “Fulltext Geocoding Versus Spatial Metadata For Large Text Archives: Towards a Geographically Enriched Wikipedia.” D-Lib Magazine.
9. Leetaru, K. (2011).  “Culturomics 2.0: Forecasting large-scale human behavior using global news media tone in time and space.” First Monday.  16(9).
VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)