In this part of the article Kalev Leetaru describes the analytical methodologies and visualization of knowledge extracted from the Wikipedia data. For other parts of this article click on the links here: Summary, Part 1, Part 2.

The growth of world knowledge

Putting this all together, what can all of this data say about Wikipedia’s view of world history?  One of the greatest challenges facing historical research in the digital era is the so-called “copyright gap” in which the majority of available digital documents were published either in the last few decades (born digital) or prior to 1924 (copyright expiration).  The vast majority of the twentieth century has gone out of print, yet is still protected by copyright and thus cannot be digitized.  Computational approaches can only examine the digital record and as scholarship increasingly relies on digital search and analysis methods, this is creating a critical knowledge gap in which far more is known about the literature of the nineteenth century than of the twentieth.  In an illustration of how severe a problem this has become, one recent analysis of books in Amazon.com’s warehouses found there were twice as many books from 1850 available as digital reprints as there were from 1950 due to this effect (1). It seems logical that perhaps Wikipedia’s contributors might rely on digitized historical resources to edit its entries and thus this same effect might manifest itself in Wikipedia’s view of history.

Figure 1 shows the total number of mentions across Wikipedia of dates in each year 1001AD to 2011, visualizing its timeline of world history. The date extraction tool used to identify all date mentions works on any date range, but four-digit year mentions are more accurate since in Wikipedia four-digit numbers that are not dates have commas, reducing the false positive rate.  Immediately it becomes clear that the copyright gap seen in other collections has not impacted the knowledge contained in Wikipedia’s pages.  Instead, there is a steady exponential growth in Wikipedia’s coverage through time, matching intuition about the degree of surviving information about each decade.  For the purposes of this study, references to decades and centuries were coded as a reference to the year beginning that time period (“the 1500’s” is coded as the year 1500), which accounts for the majority of the spikes.  One can immediately see major events such as the American Civil War and World Wars I and II.  Figure 2 shows the same timeline, but using a log scale on the Y axis.  Instead of displaying the raw number of mentions each year, a log scale displays exponential growth, making it easier to spot the large-scale patterns in how a dataset has expanded over time.  In this case, the log graph shows that Wikipedia’s historical knowledge 1001AD-2011 largely falls into four time periods: 1001-1500, 1501-1729, 1730-2003, 2004-2011. During the first period (roughly corresponding to the Middle Ages) the number of mentions of each year has a slow steady growth rate from around 2,200 mentions about each year to around 2,500 a year. This rapidly accelerates to around 6,500 mentions during the second period (corresponding to the Early Modern Period, starting around the late Renaissance), then increases its growth rate once again in the third period (corresponding to the start of the Age of Enlightenment) through 650,000 mentions of each year in the third period.  Finally, the fourth period begins with the rise of Wikipedia itself (the “Wikipedia Era”), with a sudden massive growth rate far in excess of the previous periods.

Figure 1: Number of mentions of each year 1001AD-2011 in Wikipedia (Y axis is number of pages)

Figure 2: Number of mentions of each year 1001AD-2011 in Wikipedia (Y axis is log scale of page count to show growth rate)

Figure 3 shows a zoom-in of the period 1950-2011, showing that the initial spike of coverage leading into the Wikipedia Era begins in 2001, the year Wikipedia was first released, followed by three years of fairly level coverage, with the real acceleration beginning in 2004.  Equally interesting is the leveling-off that begins in 2008 and that there are nearly equal numbers of mentions of the last three years: 2009, 2010, and 2011. Does this reflect that Wikipedia is stagnating or has it perhaps finally reached a threshold at which all human knowledge generated each year is now recorded on its pages and there is simply nothing more to record?  If the latter was true, this would mean that most edits to Wikipedia today focus on contemporary knowledge, adding in events as they happen, turning Wikipedia into a daybook of modern history.

Figure 3: Number of mentions of each year 1950-2011 in Wikipedia (Y axis is number of pages)

Figure 4 offers an intriguing alternative. It plots the total number of articles in the English-language Wikipedia by year 2001-2011 against the number of mentions of dates from that year.  There are nearly as many mentions of 2007 as there were pages in Wikipedia that year (this does not mean every page mentioned that year, since a single page mentioning a year multiple times will account for multiple entries in this graph).  Since 2007, Wikipedia has continued to grow substantially each year, while the number of mentions of each of those years has leveled off.  This suggests that Wikipedia’s growth is coming in the form of enhanced coverage of the past and that it has reached a point where there are only 1.7-1.9 million new mentions of the current year added, suggesting the number of items deemed worthy of inclusion each year has peaked.

Figure 4: Size of Wikipedia versus number of mentions of that year 2001-2011

Of course, the total number of mentions of each year tells only one part of the story.  What was the emotional context of those mentions?  Were the events being described discussed in a more negative or a more positive light?

Figure 5 visualizes how “positive” or “negative” each year was according to Wikipedia (to normalize the raw tonal scores, the Y axis shows the number of standard deviations from the mean, known as the Z-score).  Annual tone is calculated through a very simplistic measure, computing the average tone of every article in Wikipedia and then computing the average tone of all articles mentioning a given year (if a year is mentioned multiple times in an article, the article’s tone is counted multiple times towards this average).  This is a very coarse measure and doesn’t take into account that a year might be referenced in a positive light in an article that is otherwise highly negative.  Instead this measure captures the macro-level context of a year: on the scale of Wikipedia, if a year is mentioned primarily in negative articles, that suggests something important about that year.

Figure 5: Average tone of all articles mentioning each year 1001AD-2011 (Y axis is Z-score)

One of the most striking features of Figure 5 is the dramatic shift towards greater negativity between 1499 and 1500.  Tone had been becoming steadily more negative from 1001AD to 1499, shifting an entire standard deviation over this period, but there is a sudden sharp shift of one full standard deviation between those two years, with tone remaining more negative until the most recent half-century.  The suddenness of this shift suggests this is likely due to an artifact in Wikipedia or the analysis process, rather than a genuine historical trend such as a reflection of increasing scholarly questioning of worldly norms during that period. Possibilities include a shift in authorship or writing style, or increased historical documentary record that covers a greater class of events.  Another striking plunge towards negativity occurs from 1861-1865, reflecting the American Civil War, with similar plunges around World Wars I and II.  World War II shows nearly double the negativity that World War I did, but just three quarters of that of the Civil War.

Visualizing Wikipedia over time and space

The Figures above show the power of visualizing Wikipedia temporally, but to really understand it as a global daybook of human activity, it is necessary to add the spatial dimension.  The primary geographic databases used for looking up location coordinates are limited to roughly the last 200 years, so here the analysis was limited to 1800-present (2).  Each location was associated with the closest date reference in the text and vice-versa, leading to a spatially and temporally-referenced network capturing the locations and connections among those locations through time recorded in Wikipedia’s pages. For every pair of locations in an article with the same associated year, a link was recorded between them.  The average tone of all articles mentioning both locations with respect to the same year was used to compute the color of that link. A scale from bright green (high positivity) through bright red (high negativity) was used to render tone graphically. The importance of time and location in Wikipedia results in more than 3,851,063 nodes and 23,672,214 connections across all 212 maps from 1800-2012.  The massive number of connections meant most years simply became an unintelligible mess of crisscrossing links.  To reduce the visual clutter, the first sequence discarded links that appeared in less than 10 articles (see Figure 6). This preserves only the strongest links in the data. To focus only on the linking structure, the second sequence displayed all links, but discarded the tonal information and made each edge semi-transparent so they blended into one another (see Figure 7). The result is that an isolated link with no surrounding links will appear very faint, while lots of links overlapping on top of each other will result in a bright white flare. By focusing purely on the linking structure, this animation shows evolving connections across the world.

Figure 6: Tone map (see video at  https://www.youtube.com/watch?v=KmCQVIVpzWg)

Figure 7: Intensity map (see video at https://www.youtube.com/watch?v=wzuOcP7oml0)

Interactively browsing Wikipedia through time and space

While animations are an extremely powerful tool for visualizing complex information, they do not allow users to interactively drill into the data to explore interesting trends.  Ultimately one would like to be able to convert those static images into an interactive interface that would enable browsing Wikipedia through time and space.  As an example, let’s say one was interested in everything Wikipedia said about a certain area of Southern Libya in the 1840’s and 1850’s.  Wikipedia’s own keyword search interface would not be useful here, as it does not support advanced Boolean searches, only searches for a specific entry. Since the Wikipedia search tool does not understand the geographic and date information contained on its pages, one would have to manually compile a list of the name of every city and location in the area of interest, download a copy of Wikipedia, and write a program to run a massive Boolean search along the lines of “(city1name OR city2name OR city3name OR … ) AND (1841 OR 1842 OR …)”.   Obviously such a task would be infeasible for a large area and highly labor-intensive and error-prone even for small queries.  This is a fundamental inconsistency of Wikipedia as it exists today: it contains one of the richest open archives of historical knowledge arrayed through time and space, but the only mechanism of interacting with it is through a keyword search box that cannot take any of this information into account.

To prototype what such an interface might look like, all of the information from the animation sequences for Libya 1800 to 2012 described above was extracted and used to create a Google Earth KML file. Figure 8 links to a Google Earth file (3) that offers interactive browsing of Wikipedia’s coverage of Libya over this period. Libya was chosen because it offered a large geographic area with a fair amount of change over time, while still having few enough points that could easily load in Google Earth.   Unfortunately, most geographic mapping tools today support only a small number of points and Google Earth is one of the few systems that supports date-stamped records.  Each location is date-stamped in this demo to the year level so the Google Earth time slider feature can be used to move through time to see what locations of Libya have been mentioned with respect to different time periods over the last 212 years (note that Google Earth operates at the day level, so even though this data is at the year level, Google Earth will show individual days in the time slider).  The display can be narrowed to show only those locations mentioned with respect to a certain timeframe, or one can scroll through the entire 212 years as an animation to see which areas have attracted the attention of Wikipedia’s editors over time.  Imagine being able to load up the entire world in this fashion and browse all of Wikipedia’s coverage in time and space!

Figure 8: Interactive Google Earth file for Libya (see  http://www.sgi.com/go/wikipedia/LIBYA-1800-2012.KML)

The one-way nature of Wikipedia

The Google Earth demonstration illustrates several limitations of Wikipedia’s reliance on human editors to provide links between articles. For example, the Google Earth display shows mentions of Tajarhi, Libya in 1846 and 1848, reflecting that the entry for that city says slave trade traffic increased through there after Tunisia and Algeria abolished the trade, and also shows a mention in 1819 to reflect a description of it that year by the British naval explorer George Lyon (4). The article mentions both Tunisia and Algeria with respect to the slave trade, but those mentions are not links to those articles.  The mention of George Lyon is also problematic, in that the actual Wikipedia page on his life is titled with his full name, George Francis Lyon” (5) and makes no mention of Tajarhi, only Tripoli and Murzuk, and is not linked from the Tajarhi page, requiring a visitor to manually keyword search on his name.  The fact that these mentions of Tunisia, Algeria, and George Lyon have not been made into hyperlinks to those respective pages may at first seem to be only a small inconvenience.  However, a data mining analysis of Wikipedia that looked only at which pages linked to which other pages (which is one of the most common ways Wikipedia is analyzed) would miss these connections.  This illustrates the limitations of using linking data or other metadata to explore a large text corpus and the importance of examining the content itself.

Along those same lines are Wikipedia’s “Infoboxes” in which human editors can create a table that appears in the sidebar of an article with important key facts about that article.  These are often used as metadata to assign dates and locations to articles in data mining applications.  For example, the American Civil War entry (6) has an Infobox with a rich assortment of details, including the locations and dates of the war.  However, many articles do not contain such Infoboxes, even when the article focuses on a specific event. For example, the Barasa-Ubaidat War (7) between 1860-1890 in North-Eastern Libya, which started a year prior to the American Civil War, does not have an Infobox and the only information on the dates and locations of the conflict appear in the article text itself.  The limitations of Infoboxes are something to keep in mind, as many studies and datasets make use of them as a machine-friendly proxy for the factual contents of Wikipedia (8).

Another trend in Wikipedia apparent in this Google Earth display is the tendency for a connection between two people or places to be mentioned in one of their respective entries, but not in the other’s.  For example, the entry for Tazirbu, Libya (9) notes that Gerhard Rohlfs was the first European to visit the oasis, in 1879.  Rohlfs’ own entry (10), however, notes only that in 1874 he embarked upon a journey to the Kufra basin in the same Kufra district in which Tazirbu is located, but does not mention Tazirbu itself or his visit there in 1879. The Kufra basin entry (11) notes that Rohlfs reached it in 1879, but again mentions nothing of Tazirbu or other details. The entry for Kufra District (12) in which both are located, mentions only that the name Kufra is a derivation of the Arabic word for a non-Muslim and cites one of Rohlfs’ books, but does so only in the references list, and makes no mention of his travels in the text itself. Of course, Wikipedia entries must balance the desire to provide cross-links and updated information without turning each entry into a sea of links and repeated information.  This is one of the areas where Wikipedia’s openness really shines, in that it opens the door for computer scientists, interface designers, and others to apply data mining algorithms to develop new interfaces to Wikipedia and find new ways of finding and displaying these connections transparently.

The ability to display information from across Wikipedia temporally and spatially allows a reader to place a given event in the context of world events of the time period.  For example, the Google Earth display contains a reference to Tripoli with respect to 1878 (the year prior to Rohlfs’ visit to Tazirbu) to the entry for the Italo-Turkish War (13). At first glance this war appears to have no relation to 1879, having occurred 1911-1912.  Yet, the opening sentence of the introductory paragraph notes that the origins of this war, in which Italy was eventually awarded the region of modern-day Libya, began with the Congress of Berlin in 1878.  Thus, while likely entirely unrelated to Rohlfs’ journey, it provides an additional point of context that can be found simply by connecting all of Wikipedia’s articles together.

Thus, a tremendous amount of information in Wikipedia is one-way: one entry provides information about the connections between other entries, but those entries do not in turn mention this connection.  If one was interested in the travels of Gerhard Rohlfs, a natural start would be to pull up his Wikipedia entry.  Yet, his entry mentions only a brief synopsis of his African journey, with no details about the cities he visited. Even Paul Friedrich August Ascherson, who accompanied him on his journey, is not mentioned, while Ascherson’s entry (14) prominently mentions his accompanying Rohlfs on the journey.  One would have to keyword search all of Wikipedia for any mention of Rohlfs’ name and then manually read through all of the material and synthesize their information in time and space to fully map out his journey. Using computational analysis, machines can do most of this work, presenting just the final analysis. This is one of the basic applications of data mining unstructured text repositories: converting their masses of words into knowledge graphs that recover these connections. In fact, this is what historical research is about: weaving a web of connections among people, places, and activities based on the incomplete and one-way records scattered across a vast archive of material.

The networks of Wikipedia

As a final set of analyses, four network visualizations were constructed to look at the broader structure of connections captured in Wikipedia. Figure 9 shows how category tags are connected through co-occurrences in category-tagged articles. Wikipedia allows contributors to assign metadata tags to each article that describes the primary categories relevant to it.  In this case, each category tag applied to an article was cross-linked with each other category tag for that article, across the entirety of Wikipedia, resulting in a massive network capturing how categories co-occur.  This diagram illustrates a central core of categories around which other sub clusters of categories are tightly connected. Figure 10 shows the network of co-mentions of all person names across Wikipedia. In this case, a list of all person names appearing on each page was compiled and links formed to connect all person names appearing together in an article. This network shows a very different structure, which is far more diffuse with far greater clustering of small groups of people together. Figure 11 shows the same approach applied to names of organizations. In this case, it is more similar to category tags, but shows more complex structure at the core, of clusters of names to which other clusters are tightly connected. Finally, Figure 12 shows the network of co-mentions of years across Wikipedia. This network illustrates that the closer to the present, the more Wikipedia content revolves around that year. This captures the fact that entries across Wikipedia tend to be updated with new information and events from the current year, which draws a connection between those earlier years and the present.

Figure 9: Network of co-occurrences of category tags across Wikipedia

Figure 10: Network of co-occurrences of person names across Wikipedia

Figure 11: Network of co-occurrences of organization names across Wikipedia

Figure 12: Network of co-occurrences of years across Wikipedia

Conclusions

This study has surveyed the current landscape of the Big Data Humanities, Arts, and Social Sciences (HASS) disciplines and introduced the workflows, challenges, and opportunities of this emerging field.  As emerging HASS scholarship increasingly moves towards data-driven computationally-assisted exploration, new analytical mindsets are developing around whole-corpus data mining, data movement, and metadata construction.  Interactive exploration, visualization, and ad-hoc hypothesis testing play key roles in this new form of analysis, placing unique requirements on the underlying data storage and computation approaches. An exploration of Wikipedia illustrates all of these components operating together to visualize Wikipedia’s view of world history over the last two centuries through the lens of space, time, and emotion.

Acknowledgements

The author wishes to thank Silicon Graphics International (SGI) for providing access to one of their UV2000 supercomputers to support this project.

Summary
Part 1: Background

In part 1 of this article, the author describes the project background, purpose and some of the challenges of data collection.

Part 2: Data processing and Analytical methodologies

The methods by which the Wikipedia data was stored, processed, and analysed are presented in this part of the article.

References and Useful Links

1. http://www.theatlantic.com/technology/archive/2012/03/the-missing-20th-century-how-copyright-protection-makes-books-vanish/255282/
2. Leetaru, Kalev. (forthcoming).  Fulltext Geocoding Versus Spatial Metadata For Large Text Archives: Towards a Geographically Enriched Wikipedia.  D-Lib Magazine.
3. Requires a free download of Google Earth http://www.google.com/earth/index.html
4. http://en.wikipedia.org/wiki/Tajarhi
5. http://en.wikipedia.org/wiki/George_Francis_Lyon
6. http://en.wikipedia.org/wiki/American_Civil_War
7. http://en.wikipedia.org/wiki/Barasa%E2%80%93Ubaidat_War
8. http://www.infochimps.com/collections/wikipedia-infoboxes
9. http://en.wikipedia.org/wiki/Tazirbu
10. http://en.wikipedia.org/wiki/Friedrich_Gerhard_Rohlfs
11. http://en.wikipedia.org/wiki/Kufra
12. http://en.wikipedia.org/wiki/Kufra_District
13. http://en.wikipedia.org/wiki/Italo-Turkish_War
14. http://en.wikipedia.org/wiki/Paul_Friedrich_August_Ascherson
VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)