Articles

Research trends is an online magazine providing objective insights into scientific trends based on bibliometrics analyses.

Fixing authorship – towards a practical model of contributorship

In this contribution, Mike Taylor and Gudmundur A. Thorisson discuss the problems surrounding authorship in research today, and how these can be resolved in this digital age.

Read more >


Introduction

As we near the completion of the metamorphosis of paper-based scholarly publishing to a medium entirely based on the Internet, so there is increasing need to enrich the environment with a connected network, unfettered by the legacy of putting ink onto paper. One of the more recent areas to come under consideration is issues and concepts of authorship, and how these can be represented in a wholly digital world. For legal and copyright reasons, the concept of ‘an author’ of a scholarly work is likely to persist for some time. However, the idea that a simple list of authors is the optimum way of recording scholarly achievement has reached the end of its shelf life. It’s time to move on.

Anyone who is connected with scholarly publishing knows that there are a variety of tasks that are covered and obscured by the term “authorship”, and there are vital research tasks that are not considered to be worthy of the term. Moreover, there are many grey areas: for example, ‘guest’ authorship - where names appear in author lists of people who have had little or no impact on the research work - and ‘ghost’ authorship - where legitimate authors do not appear on the author list for reasons of expediency or politics.

Clearly, there cannot be just one resolution for authorship-related problems. However, the study of contributorship - and the development of a standard infrastructure to support more nuanced relationships between researcher and published output - promises to solve the logistical issues, and to illuminate those that have an ethical basis. A prominent example of work in this area is the recent International Workshop on Contributorship and Scholarly Attribution (IWCSA), in which we participated and which recently published its results (1).


Authorship broken, needs fixing

Current definitions of authorship only cover a very limited series of relationships that a person can have with a published article. Typical author lists tend to only include authors and/or editors, with other contributions and relationships being inconsistently indicated via text in an acknowledgements section.

This binomial approach - essentially a relic from the print age - to recognizing contributions to a published scholarly work has many flaws. The Harvard Workshop recognized nine specific issues which are listed in Table 1:

Problem identified by Workshop Resolution approach
Varied authorship conventions across disciplines -
Increasing number of authors on articles -
Inadequate definitions of authorship -
Inability to identify individual contributions -
Damaging effect of authorship disputes -
Current metrics are inadequate to capture and include new forms of scholarship and effort Altmetrics (e.g., altmetric.com, altmetrics.org, impactstory.org)
Inability of funders to track the outputs of their funding Fundref (http://www.crossref.org/fundref/index.html)
Name ambiguity leads to misattribution of credit and accountability ORCID (www.orcid.org)
Aggregation of attribution information from a large number of sources ORCID (www.orcid.org), etc.

Table 1: Problems caused by existing authorship practice (Harvard Workshop)

Many readers will be familiar with some or even all of these issues as authors or editors. Here we want to highlight and elaborate on what we consider the most prominent ones:


Varied authorship conventions across disciplines

It often comes as a surprise to find that different disciplines vary in the significance of author order and role. Take, for example, the diverse ways in which the same author order of a fictional paper written by Smith, Taylor and Thorisson might be interpreted depending on discipline:

High Energy Physics Author list is in alphabetic order, no precedence can be interpreted. Names may include engineers as well as researchers.
Economics, some fields within Social Sciences Author list is in alphabetic order, no precedence can be interpreted.
Life Sciences Smith the postdoc did most of the experimental work, but Thorisson was the principal investigator who led the scientific direction of the work. The alphabetical order is coincidental.
‘Standard’ order Smith is the senior researcher who did most of the work. Taylor was subordinate to Smith, Thorisson is subordinate to Taylor. The alphabetical order is coincidental.

Table 2: Varied authorship conventions across disciplines


Increasing number of authors on articles

High Energy Physics (HEP) is well-known for long author lists on research papers, with over 3,000 authors credited in recent extreme cases. This is in part because of the complexity and scale of HEP research, but also because HEP publications tend to give equal weighting to researchers and engineers alike. Clearly, the traditional model of the author as the writer of the work is not being applied in this discipline (2).

Equally, having 1000+ authors on a single paper presents novel logistical problems of managing a non-trivial amount of publication metadata - merely getting all the names and affiliations correct is a significant challenge. In fields other than HEP, there is also a clear trend towards an increased number of authors per published paper. For example, the Wellcome Trust reports that the number of authors on its genetics papers rose from around 10 to nearly 29 between 2004 and 2010. Furthermore, many standard ways of assessing scholarly impact will share the value amongst the authors in an entirely arbitrary manner. This leads to the so-called “dilution effect”, whereby even a well-cited paper makes little or no contribution to the metrics for individual authors because credit is “diluted” across the large number of authors.


Inadequate definitions of authorship

There is no universal definition of what is meant by research authorship: the closest that exists are a set of rules drawn up by the International Committee of Medical Journal Editors (ICMJE) (3). These rules have been adapted and used by a number of journals over the last several years, although even the ICMJE itself recognizes that they are outdated (Christine Laine, Editor of Annals of Internal Medicine, reported at IWCSA).


Inability to identify individual contributions

With any multi-author work, there will be a breakdown of tasks that the individuals listed as authors have contributed to the work. Traditional author lists do not allow for any credit below this level. Many journals now allow (or even require) contributorship statements at the end of the article, but these are rarely in any kind of standardized form that can be processed in automated fashion to inform calculation of impact, expertise or standing. This lack of granularity can lead to the case where a senior researcher who has had little or no influence on a paper can be credited with “proper” authorship, whereas a computer programmer who made a significant contribution via the construction of key algorithms is perhaps not credited at all.


Damaging effect of authorship disputes

The lack of clarity of authorship claims and credit has led to a growth in authorship disputes and a number of scandals. A detailed and standardized method of declaring contributions is likely to put an end to all but the most egregious of such disputes. The problems revealed by an analysis of author / article relationships fall into two broad categories: logistical (in other words, technical) and ethical. However, these are not conveniently discrete categories: an inability to precisely define the relationship leads to a position whereby a research team is obliged to force classification upon its members. Given that authorship is the principal means of recognizing academic achievement, this is not without weight.


Contributorship

We hope that one of the major outcomes of this field of work will be an evidence-based system of classifying relationships between researcher and a published work. Moreover, we hope that this taxonomy will facilitate codification of relationships that go beyond traditional authorship, thus removing the difficult decisions that can arise when compiling an author list. For example, by explicitly allowing “data collection” or “algorithm creation” as a type of contribution, it would be possible to formally attribute credit to members of the team that a strict adherence to authorship conventions (such as they are) would likely ignore, whilst not conflating the precise nature of the researcher’s contribution with intellectual leadership. In the same vein, specifying “Head of research team” or “Principal investigator” would facilitate distinguishing a senior member’s relationship with the work from those who also made intellectual contributions (see Textbox (right)).

Clearly, the answer to this problem goes beyond the creation of a standard - there needs to be an infrastructure for storing these complex relationships, tools to create them and maintain them, and ways of displaying them. Most importantly, the benefits of fully recording these relationships must outweigh (and be seen to outweigh) the cost of the additional complexity and work required (i.e. beyond what is currently the norm).

Software can certainly help in this effort (although the idea of determining who-did-what with a list of 1000+ researchers is overwhelming!) and there have been some very good examples of simple, spreadsheet-based tools in recent proof-of-principle projects. However, the task of apportioning responsibilities (and rewards) can start earlier - perhaps within research tools such as Mendeley.


Help is coming

Many of the issues highlighted above are being tackled by a diverse community of agencies and approaches, many of which came together for the IWCSA workshop. Here we want to highlight a particularly important one: the Open Researcher & Contributor ID initiative (ORCID: http://about.orcid.org). Launched in mid-October 2012, the registry service operated by ORCID enables researchers to create a public identity and obtain a persistent personal identifier, and to maintain a centralized record of their scholarly activities (4), (5).

Whilst the basic idea of an online “author profile” is not unique or innovative in itself, several key attributes differentiate the new service from the myriad free and commercial services in this space. First, it is backed by a non-profit, community-based organization with participation from commercial publishers, academic institutions, research libraries, funding agencies and many others. Second, major stakeholders in the ORCID community are committed to building software applications and platforms that will build on and integrate with the central ORCID service for automatically linking scholars and their published works.

At the time of writing, the ORCID service is limited in functionality and is experiencing some early growing pains, but the service is improving over time and with the strong support of the community. Despite these initial teething troubles, several integrations built by ORCID’s launch partners are already operational and more will come online in the next several months.

So what is ORCID’s relevance to the attribution challenges outlined above? Although the first-generation service is functionally limited, the core system has been built to support future developments and definitions that go beyond basic author or editor roles. These can potentially include richer contributorship statements such as the examples already given above. It follows that ORCID can serve as a central index or discovery hub in which to look up not merely the base contributor-work relationship, but also the nuances of that relationship if more detailed information is available.


Conclusions

Definitions are softening: in the new world of online digital publishing, “articles” are more than words on paper, metrics are more than citation counts, usage is more than subscriptions - and authors are more than just writers. The concept of authorship is rooted in our culture and in our minds, and that principle will not go away. But the idea of contributorship offers a richer set of definitions that enable our contributions to human knowledge to be recorded more precisely, if only we are willing to embrace it, and if the tools and infrastructure are developed that allow us to capture this information whilst not increasing administrative burden.

References

(1) Institute for Quantitative Social Science (2012) “Report on the International Workshop on Contributorship and Scholarly Attribution”, Available at: http://projects.iq.harvard.edu/attribution_workshop
(2) Aaij, R et al (enormous list of authors) (2012) “Measurement of the ratio of branching fractions B (B 0 → K * 0 γ) / B (B s 0 → φ{symbol} γ) and the direct CP asymmetry in B 0 → K * 0 γ”, Nuclear Physics B, Vol. 867, No. 1, pp. 1-18.
(3) International Committee of Medical Journal Editors, “Uniform Requirements for Manuscripts submitted to Biomedical Journals: Ethical Considerations in the Conduct and Reporting of Research: Authorship and Contributorship”, Available at: http://www.icmje.org/ethical_1author.html
(4) Fenner, M., Gómez, C. G. & Thorisson, G. A. (2011) “Key Issue Collective Action for the Open Researcher & Contributor ID (ORCID)”, Serials: The Journal for the Serials Community Vol. 24, No. 3, pp. 277–279. http://dx.doi.org/10.1629/24277
(5) Fenner, M. (2011) “ORCID: Unique Identifiers for Authors and Contributors”, Information Standards Quarterly, Vol. 23, No. 3, pp.10-13. http://dx.doi.org/10.3789/isqv23n3.2011.03

 

Conflict of interest statement

The authors have both been active contributors to ORCID in the past three years. As of October 2012, one of them (GAT) is employed by ORCID part time to work on the EU-funded ODIN project (http://odin-project.eu).

Contributorship statement

The authors contributed equally to the drafting of this article.

About the authors

Gudmundur ‘Mummi’  Thorisson is an academic and consultant interested in scientific communication, in particular as this relates to open access to and use/reuse of research data in the life sciences. He has been involved in various projects relating to identity & unique identifiers in research and scholarly communication, most recently the ORCID initiative. Through his previous work in the GEN2PHEN project (http://www.gen2phen.org) he has also contributed to several database projects in the biomedical research domain, notably GWAS Central (http://www.gwascentral.org).

Gudmundur holds a PhD from the University of Leicester in the United Kingdom and worked there as a post-doctoral researcher after graduating in 2010. He currently works part time for ORCID on the ODIN project (http://odin-project.eu), whilst also working in a research support role at the Institute of Life and Environmental Sciences (http://luvs.hi.is), University of Iceland, Reykjavik where he is now based.

Personal website: http://gthorisson.name
ORCID profile: http://orcid.org/0000-0001-5635-1860

Mike Taylor is a research specialist in Elsevier Labs and the newest member of the Research Trends Editorial Board. His current areas of work include altmetrics, contributorship, research networks, the future of scholarly communications and other identity issues. He has worked in various capacities within the ORCID initiative. Previous to joining Elsevier Labs, Mike worked in various technology and publishing groups within Elsevier.

Website: http://labs.elsevier.com
ORCID profile: http://orcid.org/0000-0002-8534-5985

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

Introduction

As we near the completion of the metamorphosis of paper-based scholarly publishing to a medium entirely based on the Internet, so there is increasing need to enrich the environment with a connected network, unfettered by the legacy of putting ink onto paper. One of the more recent areas to come under consideration is issues and concepts of authorship, and how these can be represented in a wholly digital world. For legal and copyright reasons, the concept of ‘an author’ of a scholarly work is likely to persist for some time. However, the idea that a simple list of authors is the optimum way of recording scholarly achievement has reached the end of its shelf life. It’s time to move on.

Anyone who is connected with scholarly publishing knows that there are a variety of tasks that are covered and obscured by the term “authorship”, and there are vital research tasks that are not considered to be worthy of the term. Moreover, there are many grey areas: for example, ‘guest’ authorship - where names appear in author lists of people who have had little or no impact on the research work - and ‘ghost’ authorship - where legitimate authors do not appear on the author list for reasons of expediency or politics.

Clearly, there cannot be just one resolution for authorship-related problems. However, the study of contributorship - and the development of a standard infrastructure to support more nuanced relationships between researcher and published output - promises to solve the logistical issues, and to illuminate those that have an ethical basis. A prominent example of work in this area is the recent International Workshop on Contributorship and Scholarly Attribution (IWCSA), in which we participated and which recently published its results (1).


Authorship broken, needs fixing

Current definitions of authorship only cover a very limited series of relationships that a person can have with a published article. Typical author lists tend to only include authors and/or editors, with other contributions and relationships being inconsistently indicated via text in an acknowledgements section.

This binomial approach - essentially a relic from the print age - to recognizing contributions to a published scholarly work has many flaws. The Harvard Workshop recognized nine specific issues which are listed in Table 1:

Problem identified by Workshop Resolution approach
Varied authorship conventions across disciplines -
Increasing number of authors on articles -
Inadequate definitions of authorship -
Inability to identify individual contributions -
Damaging effect of authorship disputes -
Current metrics are inadequate to capture and include new forms of scholarship and effort Altmetrics (e.g., altmetric.com, altmetrics.org, impactstory.org)
Inability of funders to track the outputs of their funding Fundref (http://www.crossref.org/fundref/index.html)
Name ambiguity leads to misattribution of credit and accountability ORCID (www.orcid.org)
Aggregation of attribution information from a large number of sources ORCID (www.orcid.org), etc.

Table 1: Problems caused by existing authorship practice (Harvard Workshop)

Many readers will be familiar with some or even all of these issues as authors or editors. Here we want to highlight and elaborate on what we consider the most prominent ones:


Varied authorship conventions across disciplines

It often comes as a surprise to find that different disciplines vary in the significance of author order and role. Take, for example, the diverse ways in which the same author order of a fictional paper written by Smith, Taylor and Thorisson might be interpreted depending on discipline:

High Energy Physics Author list is in alphabetic order, no precedence can be interpreted. Names may include engineers as well as researchers.
Economics, some fields within Social Sciences Author list is in alphabetic order, no precedence can be interpreted.
Life Sciences Smith the postdoc did most of the experimental work, but Thorisson was the principal investigator who led the scientific direction of the work. The alphabetical order is coincidental.
‘Standard’ order Smith is the senior researcher who did most of the work. Taylor was subordinate to Smith, Thorisson is subordinate to Taylor. The alphabetical order is coincidental.

Table 2: Varied authorship conventions across disciplines


Increasing number of authors on articles

High Energy Physics (HEP) is well-known for long author lists on research papers, with over 3,000 authors credited in recent extreme cases. This is in part because of the complexity and scale of HEP research, but also because HEP publications tend to give equal weighting to researchers and engineers alike. Clearly, the traditional model of the author as the writer of the work is not being applied in this discipline (2).

Equally, having 1000+ authors on a single paper presents novel logistical problems of managing a non-trivial amount of publication metadata - merely getting all the names and affiliations correct is a significant challenge. In fields other than HEP, there is also a clear trend towards an increased number of authors per published paper. For example, the Wellcome Trust reports that the number of authors on its genetics papers rose from around 10 to nearly 29 between 2004 and 2010. Furthermore, many standard ways of assessing scholarly impact will share the value amongst the authors in an entirely arbitrary manner. This leads to the so-called “dilution effect”, whereby even a well-cited paper makes little or no contribution to the metrics for individual authors because credit is “diluted” across the large number of authors.


Inadequate definitions of authorship

There is no universal definition of what is meant by research authorship: the closest that exists are a set of rules drawn up by the International Committee of Medical Journal Editors (ICMJE) (3). These rules have been adapted and used by a number of journals over the last several years, although even the ICMJE itself recognizes that they are outdated (Christine Laine, Editor of Annals of Internal Medicine, reported at IWCSA).


Inability to identify individual contributions

With any multi-author work, there will be a breakdown of tasks that the individuals listed as authors have contributed to the work. Traditional author lists do not allow for any credit below this level. Many journals now allow (or even require) contributorship statements at the end of the article, but these are rarely in any kind of standardized form that can be processed in automated fashion to inform calculation of impact, expertise or standing. This lack of granularity can lead to the case where a senior researcher who has had little or no influence on a paper can be credited with “proper” authorship, whereas a computer programmer who made a significant contribution via the construction of key algorithms is perhaps not credited at all.


Damaging effect of authorship disputes

The lack of clarity of authorship claims and credit has led to a growth in authorship disputes and a number of scandals. A detailed and standardized method of declaring contributions is likely to put an end to all but the most egregious of such disputes. The problems revealed by an analysis of author / article relationships fall into two broad categories: logistical (in other words, technical) and ethical. However, these are not conveniently discrete categories: an inability to precisely define the relationship leads to a position whereby a research team is obliged to force classification upon its members. Given that authorship is the principal means of recognizing academic achievement, this is not without weight.


Contributorship

We hope that one of the major outcomes of this field of work will be an evidence-based system of classifying relationships between researcher and a published work. Moreover, we hope that this taxonomy will facilitate codification of relationships that go beyond traditional authorship, thus removing the difficult decisions that can arise when compiling an author list. For example, by explicitly allowing “data collection” or “algorithm creation” as a type of contribution, it would be possible to formally attribute credit to members of the team that a strict adherence to authorship conventions (such as they are) would likely ignore, whilst not conflating the precise nature of the researcher’s contribution with intellectual leadership. In the same vein, specifying “Head of research team” or “Principal investigator” would facilitate distinguishing a senior member’s relationship with the work from those who also made intellectual contributions (see Textbox (right)).

Clearly, the answer to this problem goes beyond the creation of a standard - there needs to be an infrastructure for storing these complex relationships, tools to create them and maintain them, and ways of displaying them. Most importantly, the benefits of fully recording these relationships must outweigh (and be seen to outweigh) the cost of the additional complexity and work required (i.e. beyond what is currently the norm).

Software can certainly help in this effort (although the idea of determining who-did-what with a list of 1000+ researchers is overwhelming!) and there have been some very good examples of simple, spreadsheet-based tools in recent proof-of-principle projects. However, the task of apportioning responsibilities (and rewards) can start earlier - perhaps within research tools such as Mendeley.


Help is coming

Many of the issues highlighted above are being tackled by a diverse community of agencies and approaches, many of which came together for the IWCSA workshop. Here we want to highlight a particularly important one: the Open Researcher & Contributor ID initiative (ORCID: http://about.orcid.org). Launched in mid-October 2012, the registry service operated by ORCID enables researchers to create a public identity and obtain a persistent personal identifier, and to maintain a centralized record of their scholarly activities (4), (5).

Whilst the basic idea of an online “author profile” is not unique or innovative in itself, several key attributes differentiate the new service from the myriad free and commercial services in this space. First, it is backed by a non-profit, community-based organization with participation from commercial publishers, academic institutions, research libraries, funding agencies and many others. Second, major stakeholders in the ORCID community are committed to building software applications and platforms that will build on and integrate with the central ORCID service for automatically linking scholars and their published works.

At the time of writing, the ORCID service is limited in functionality and is experiencing some early growing pains, but the service is improving over time and with the strong support of the community. Despite these initial teething troubles, several integrations built by ORCID’s launch partners are already operational and more will come online in the next several months.

So what is ORCID’s relevance to the attribution challenges outlined above? Although the first-generation service is functionally limited, the core system has been built to support future developments and definitions that go beyond basic author or editor roles. These can potentially include richer contributorship statements such as the examples already given above. It follows that ORCID can serve as a central index or discovery hub in which to look up not merely the base contributor-work relationship, but also the nuances of that relationship if more detailed information is available.


Conclusions

Definitions are softening: in the new world of online digital publishing, “articles” are more than words on paper, metrics are more than citation counts, usage is more than subscriptions - and authors are more than just writers. The concept of authorship is rooted in our culture and in our minds, and that principle will not go away. But the idea of contributorship offers a richer set of definitions that enable our contributions to human knowledge to be recorded more precisely, if only we are willing to embrace it, and if the tools and infrastructure are developed that allow us to capture this information whilst not increasing administrative burden.

References

(1) Institute for Quantitative Social Science (2012) “Report on the International Workshop on Contributorship and Scholarly Attribution”, Available at: http://projects.iq.harvard.edu/attribution_workshop
(2) Aaij, R et al (enormous list of authors) (2012) “Measurement of the ratio of branching fractions B (B 0 → K * 0 γ) / B (B s 0 → φ{symbol} γ) and the direct CP asymmetry in B 0 → K * 0 γ”, Nuclear Physics B, Vol. 867, No. 1, pp. 1-18.
(3) International Committee of Medical Journal Editors, “Uniform Requirements for Manuscripts submitted to Biomedical Journals: Ethical Considerations in the Conduct and Reporting of Research: Authorship and Contributorship”, Available at: http://www.icmje.org/ethical_1author.html
(4) Fenner, M., Gómez, C. G. & Thorisson, G. A. (2011) “Key Issue Collective Action for the Open Researcher & Contributor ID (ORCID)”, Serials: The Journal for the Serials Community Vol. 24, No. 3, pp. 277–279. http://dx.doi.org/10.1629/24277
(5) Fenner, M. (2011) “ORCID: Unique Identifiers for Authors and Contributors”, Information Standards Quarterly, Vol. 23, No. 3, pp.10-13. http://dx.doi.org/10.3789/isqv23n3.2011.03

 

Conflict of interest statement

The authors have both been active contributors to ORCID in the past three years. As of October 2012, one of them (GAT) is employed by ORCID part time to work on the EU-funded ODIN project (http://odin-project.eu).

Contributorship statement

The authors contributed equally to the drafting of this article.

About the authors

Gudmundur ‘Mummi’  Thorisson is an academic and consultant interested in scientific communication, in particular as this relates to open access to and use/reuse of research data in the life sciences. He has been involved in various projects relating to identity & unique identifiers in research and scholarly communication, most recently the ORCID initiative. Through his previous work in the GEN2PHEN project (http://www.gen2phen.org) he has also contributed to several database projects in the biomedical research domain, notably GWAS Central (http://www.gwascentral.org).

Gudmundur holds a PhD from the University of Leicester in the United Kingdom and worked there as a post-doctoral researcher after graduating in 2010. He currently works part time for ORCID on the ODIN project (http://odin-project.eu), whilst also working in a research support role at the Institute of Life and Environmental Sciences (http://luvs.hi.is), University of Iceland, Reykjavik where he is now based.

Personal website: http://gthorisson.name
ORCID profile: http://orcid.org/0000-0001-5635-1860

Mike Taylor is a research specialist in Elsevier Labs and the newest member of the Research Trends Editorial Board. His current areas of work include altmetrics, contributorship, research networks, the future of scholarly communications and other identity issues. He has worked in various capacities within the ORCID initiative. Previous to joining Elsevier Labs, Mike worked in various technology and publishing groups within Elsevier.

Website: http://labs.elsevier.com
ORCID profile: http://orcid.org/0000-0002-8534-5985

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

Australian Research Data — Policy and Practice

A presentation by Dr Ross Wilkinson, Australian National Data Service, at the Big Data, E-Science and Science Policy conference in Canberra, Australia, 16th-17th May 2012.

Read more >


A presentation by Dr Ross Wilkinson, Australian National Data Service, at the Big Data, E-Science and Science Policy conference in Canberra, Australia, 16th-17th May 2012.

Link to presentation

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

A presentation by Dr Ross Wilkinson, Australian National Data Service, at the Big Data, E-Science and Science Policy conference in Canberra, Australia, 16th-17th May 2012.

Link to presentation

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

Advancing Science through Local, Regional, and National Cyberinfrastructure

A presentation by Prof. Daniel Katz, University of Chicago, at the Big Data, E-Science and Science Policy conference in Canberra, Australia, 16th-17th May 2012.

Read more >


A presentation by Prof. Daniel Katz, University of Chicago, at the Big Data, E-Science and Science Policy conference in Canberra, Australia, 16th-17th May 2012.

Link to presentation

 

 

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

A presentation by Prof. Daniel Katz, University of Chicago, at the Big Data, E-Science and Science Policy conference in Canberra, Australia, 16th-17th May 2012.

Link to presentation

 

 

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

The use of large datasets in bibliometric research

A presentation by Dr Henk Moed, Senior Scientific Advisor, Elsevier, at the Big Data, E-Science and Science Policy conference in Canberra, Australia, 16th-17th May 2012.

Read more >


A presentation by Dr Henk Moed, Senior Scientific Advisor, Elsevier, at the Big Data, E-Science and Science Policy conference in Canberra, Australia, 16th-17th May 2012.

Link to the presentation.

 

 

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

A presentation by Dr Henk Moed, Senior Scientific Advisor, Elsevier, at the Big Data, E-Science and Science Policy conference in Canberra, Australia, 16th-17th May 2012.

Link to the presentation.

 

 

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

Big Data, E-Science and Science Policy: Managing and Measuring Research Outcome (part 1)

A presentation by Dr. Michiel Kolman, SVP Academic Relations, Elsevier at the Big Data, E-Science and Science Policy conference in Canberra, Australia, 16th-17th May 2012. Day 1

Read more >


A presentation by Dr. Michiel Kolman, SVP Academic Relations, Elsevier on day 1 of the Big Data, E-Science and Science Policy - conference, in Canberra, Australia, 16th-17th May 2012. Managing and Measuring Research Outcome (part 1) Link to presentation

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

A presentation by Dr. Michiel Kolman, SVP Academic Relations, Elsevier on day 1 of the Big Data, E-Science and Science Policy - conference, in Canberra, Australia, 16th-17th May 2012. Managing and Measuring Research Outcome (part 1) Link to presentation

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

The use of Big Datasets in bibliometric research

This article illustrates how usage, citations, full text, indexing and other large bibliographic datasets can be combined and analyzed to follow scientific trends.

Read more >


Introduction

Due to the increasing importance of scientific research for economic progress and competitiveness, and to new developments in information and communication technologies (ICT), the fields of bibliometrics and research assessment are rapidly developing. A few major trends can be identified:

  • An increase in actual use of bibliometric data and indicators in research assessment;
  • A strong proliferation of bibliometric databases and data-analytical tools; for instance, in the emergence of a range of journal subject classification systems and key words mapping tools;
  • Indicators are becoming more and more sophisticated and fit-to-purpose; new approaches reveal that bibliometrics concerns much more than assessing individuals on the basis of journal impact factors;
  • There is an increasing interest in measuring the effects of the use of bibliometric indicators upon the behavior of researchers, journal editors and publishers;
  • Researchers, research evaluators and policy officials place an emphasis on the societal impact of research, such as its technological value or its contribution to the enlightenment of the general public;
  • Last but not least, more and more projects aim to create and analyze large datasets by combining multiple datasets.

This article deals with the last trend mentioned and focuses on demonstrating which datasets are currently being combined by research groups in the field. It also discusses the aspects and research questions that could be answered using these large datasets. An overview is given in Table 1 below.

 

Combined datasets Studied phenomena Typical research questions
Citation indexes and usage log files of full text publication archives Downloads versus citations; distinct phases in the process of processing scientific information What do downloads of full text articles measure? To what extent do downloads and citations correlate?
Citation indexes and patent databases Linkages between science and technology (the science–technology interface) What is the technological impact of a scientific research finding or field?
Citation indexes and scholarly book indexes The role of books in scholarly communication; research productivity taking scholarly book output into account How important are books in the various scientific disciplines, how do journals and books interrelate, and what are the most important books publishers?
Citation indexes (or publication databases) and OECD national statistics Research input or capacity; evolution of the number of active researchers in a country and the phase of their career How many researchers enter and/or move out of a national research system in a particular year?
Citation indexes and full text article databases The context of citations; sentiment analysis of the scientific-scholarly literature In what ways can one objectively characterize citation contexts? And identify implicit citations to documents or concepts?

Table 1: Compound Big Datasets and their objects of study

Examples

Downloads versus citations

For a definition of “usage” or “downloads” analysis and its context the reader is referred to a previous RT article on this topic (1). Figure 1 relates to journals included in ScienceDirect, Elsevier’s full text article database. For each journal the average citation impact per article was calculated (generated in the third year after publication date), as well as the average number of downloads in full text format per article (carried out in the year of publication of the articles). Journals were grouped into disciplines; the horizontal axis indicates the number of journals in a discipline. In each discipline the Pearson correlation coefficient between a journal’s downloads and its citations was calculated, and plotted on the vertical axis.

Figure 1: Downloads versus citations for journals in ScienceDirect

Figure 1 reveals large differences in the degree of correlation between downloads and citations between disciplines. For instance, in Biochemistry and Molecular Biology the correlation is above 0.9, whereas in Dentistry, Social sciences, Health Professions, Arts and Humanities it is equal to or less than 0.5.

The interpretation of these findings is somewhat unclear. One hypothesis is based on the distinction between authors and readers. In highly specialized subject fields these populations largely overlap, whereas in fields with a more direct societal impact, the readers’ population may consist mainly of professionals or even the general public who do not regularly publish articles. The hypothesis proposes that in the latter type of fields the correlation between downloads and citations is lower than in the first. Additional research, also conducted at the level of individual articles, is needed to further examine this hypothesis.

Patents and scientific articles

Earlier this year, Research Trends also published an article analyzing patent citations to journal articles, in order to measure the technological impact of research (2). The analysis focused on a subject field in the social sciences. It examined the characteristics of research articles published in Library Science journals and the manner by which they are cited in patents.  Library science articles were found to be well cited in patents. The articles cited feature information retrieval and indexing, and information and documents management systems which pertain to electronic and digital libraries development. The citing patents focus on electronic information administration, navigation, and products and services management in commercial systems. Interestingly, the time span between the scientific invention and its use in technology may be up to 10 years. This finding illustrates the time delays one has to take into account when trying to measure technological or societal impact of scientific research. For an overview of this way of using patent citations, see (3).

Scopus author data versus OECD “input” statistics

Scopus, Elsevier’s scientific literature database, containing meta-data of scientific publications published by more than 5,000 publishers in 18,000 titles, has implemented unique features that enable one to obtain an estimate of the number of active – i.e., publishing – authors in a particular year, country, and/or research domain, and also to track the “institutional” career of a researcher, providing information on the institutions in which a researcher has worked during his or her career. Research Trends issues 26 and 27 contained two articles by Andrew Plume presenting a first analysis of migration or brain circulation patterns in the database (4) (5).

Data accuracy and validation is also a relevant issue in this case. One way to validate author data is by comparing outcomes per country with statistics on the number of full time equivalents spent on research in the various institutional sectors, obtained from questionnaires and published by the OECD.

 

Country Germany UK Italy The Netherlands
OECD number of FTE Research 2007 (all sectors) 290,800 254,600 93,000 49,700
OECD number of FTE Research 2007 (Higher Education & Government sector) 116,600 159,100 56,200 23,800
Number of Publishing authors in Scopus 150,400 154,600 113,100 46,300
Ratio number of authors / Number of FTE Research (all Sectors) 0.52 0.61 1.22 0.93
Ratio number of authors / Number of FTE Research (Higher Education & Government sector) 1.29 0.97 2.01 1.95

Table 2: OECD and Scopus based “input” statistics for 4 European countries

Table 2 presents statistics for 4 countries. Rather than comparing absolute numbers, it is interesting to calculate the ratios in the last two rows of the table. It is striking that these ratios differ substantially between countries. They are much higher for the Netherlands and Italy than they are for Germany and UK. This outcome points first of all towards the need to further validate Scopus-based numbers of active researchers. On the other hand, it also raises the question whether the various countries have applied the same definition of FTE research time in their surveys.

Books and journals

Scientific-scholarly books are generally considered as important written communication media, especially in social sciences and humanities. There is an increasing interest in studies of the function and quality of books and book publishers in the various domains of science and human scholarship. Thomson Reuters has launched its Book Citation Index. The Google Books project aims to digitalize millions of books, including many scientific-scholarly ones. Expanding a primarily journal-based citation index with scholarly book sources has two advantages. Not only is the set of source publications expanded with relevant sources, but the enormous reservoir of cited references given in journal articles to book items is used more efficiently.

Citations and full texts

The availability of full text research articles in electronic format gives us the opportunity to conduct textual analyses of all of an article’s content – not just the meta-data extracted by indexing databases. The citation contexts can be analyzed linguistically, and sentiment analyses can be conducted to reveal how the citing author appreciates a cited work. Henry Small and Richard Klavans used citation context analysis as an additional tool for the identification of scientific breakthroughs (6). In one of its next issues Research Trends will publish an article on a detailed citation context analysis in one particular journal focusing on cross-disciplinary citations.

Concluding remarks

The overview above is not complete, and many important contributions to the analysis of big, compound bibliometric datasets were not mentioned in this paper. But the examples presented above illustrate the theoretical and practical relevance of combining bibliometric, or, more generally, statistical datasets, show how this can be done, and indicate which issues a big, compound, bibliometric dataset enables us to address.

References

1. Lendi, S. & Huggett, S. (2012) “Usage: an alternative way to evaluate research”, Research Trends, No. 28 (https://www.researchtrends.com/issue28-may-2012/usage-an-alternative-way-to-evaluate-research/).
2. Halevi, G. & Moed, H.F. (2012) “Patenting Library Science Research Assets”, Research Trends, No. 27 (https://www.researchtrends.com/issue-27-march-2012/patenting-library-science-research-assets/).
3. Breschi, S. & Lissoni, F. (2004) “Knowledge Networks from Patent Data. In: Moed, H.F., Glänzel, W., and Schmoch, U. (eds.). Handbook of quantitative science and technology research. The use of publication and patent statistics in studies of S&T systems. Dordrecht (the Netherlands): Kluwer Academic Publishers, 613-644.
4. Plume, A. (2012) “The evolution of brain drain and its measurement: Part I”, Research Trends, No. 26 (https://www.researchtrends.com/issue26-january-2012/the-evolution-of-brain-drain-and-its-measurement-part-i/).
5. Plume, A. (2012) “The evolution of brain drain and its measurement: Part II”, Research Trends, No. 27 https://www.researchtrends.com/issue-27-march-2012/the-evolution-of-brain-drain-and-its-measurement-part-ii/).
6. Small, H. & Klavans R. (2011). “Identifying Scientific Breakthroughs by Combining Co-citation Analysis and Citation Context”. Paper presented at the Proceedings of 13th International Conference of the International Society for Scientometrics and Informetrics (ISSI 2011.).

 


 

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

Introduction

Due to the increasing importance of scientific research for economic progress and competitiveness, and to new developments in information and communication technologies (ICT), the fields of bibliometrics and research assessment are rapidly developing. A few major trends can be identified:

  • An increase in actual use of bibliometric data and indicators in research assessment;
  • A strong proliferation of bibliometric databases and data-analytical tools; for instance, in the emergence of a range of journal subject classification systems and key words mapping tools;
  • Indicators are becoming more and more sophisticated and fit-to-purpose; new approaches reveal that bibliometrics concerns much more than assessing individuals on the basis of journal impact factors;
  • There is an increasing interest in measuring the effects of the use of bibliometric indicators upon the behavior of researchers, journal editors and publishers;
  • Researchers, research evaluators and policy officials place an emphasis on the societal impact of research, such as its technological value or its contribution to the enlightenment of the general public;
  • Last but not least, more and more projects aim to create and analyze large datasets by combining multiple datasets.

This article deals with the last trend mentioned and focuses on demonstrating which datasets are currently being combined by research groups in the field. It also discusses the aspects and research questions that could be answered using these large datasets. An overview is given in Table 1 below.

 

Combined datasets Studied phenomena Typical research questions
Citation indexes and usage log files of full text publication archives Downloads versus citations; distinct phases in the process of processing scientific information What do downloads of full text articles measure? To what extent do downloads and citations correlate?
Citation indexes and patent databases Linkages between science and technology (the science–technology interface) What is the technological impact of a scientific research finding or field?
Citation indexes and scholarly book indexes The role of books in scholarly communication; research productivity taking scholarly book output into account How important are books in the various scientific disciplines, how do journals and books interrelate, and what are the most important books publishers?
Citation indexes (or publication databases) and OECD national statistics Research input or capacity; evolution of the number of active researchers in a country and the phase of their career How many researchers enter and/or move out of a national research system in a particular year?
Citation indexes and full text article databases The context of citations; sentiment analysis of the scientific-scholarly literature In what ways can one objectively characterize citation contexts? And identify implicit citations to documents or concepts?

Table 1: Compound Big Datasets and their objects of study

Examples

Downloads versus citations

For a definition of “usage” or “downloads” analysis and its context the reader is referred to a previous RT article on this topic (1). Figure 1 relates to journals included in ScienceDirect, Elsevier’s full text article database. For each journal the average citation impact per article was calculated (generated in the third year after publication date), as well as the average number of downloads in full text format per article (carried out in the year of publication of the articles). Journals were grouped into disciplines; the horizontal axis indicates the number of journals in a discipline. In each discipline the Pearson correlation coefficient between a journal’s downloads and its citations was calculated, and plotted on the vertical axis.

Figure 1: Downloads versus citations for journals in ScienceDirect

Figure 1 reveals large differences in the degree of correlation between downloads and citations between disciplines. For instance, in Biochemistry and Molecular Biology the correlation is above 0.9, whereas in Dentistry, Social sciences, Health Professions, Arts and Humanities it is equal to or less than 0.5.

The interpretation of these findings is somewhat unclear. One hypothesis is based on the distinction between authors and readers. In highly specialized subject fields these populations largely overlap, whereas in fields with a more direct societal impact, the readers’ population may consist mainly of professionals or even the general public who do not regularly publish articles. The hypothesis proposes that in the latter type of fields the correlation between downloads and citations is lower than in the first. Additional research, also conducted at the level of individual articles, is needed to further examine this hypothesis.

Patents and scientific articles

Earlier this year, Research Trends also published an article analyzing patent citations to journal articles, in order to measure the technological impact of research (2). The analysis focused on a subject field in the social sciences. It examined the characteristics of research articles published in Library Science journals and the manner by which they are cited in patents.  Library science articles were found to be well cited in patents. The articles cited feature information retrieval and indexing, and information and documents management systems which pertain to electronic and digital libraries development. The citing patents focus on electronic information administration, navigation, and products and services management in commercial systems. Interestingly, the time span between the scientific invention and its use in technology may be up to 10 years. This finding illustrates the time delays one has to take into account when trying to measure technological or societal impact of scientific research. For an overview of this way of using patent citations, see (3).

Scopus author data versus OECD “input” statistics

Scopus, Elsevier’s scientific literature database, containing meta-data of scientific publications published by more than 5,000 publishers in 18,000 titles, has implemented unique features that enable one to obtain an estimate of the number of active – i.e., publishing – authors in a particular year, country, and/or research domain, and also to track the “institutional” career of a researcher, providing information on the institutions in which a researcher has worked during his or her career. Research Trends issues 26 and 27 contained two articles by Andrew Plume presenting a first analysis of migration or brain circulation patterns in the database (4) (5).

Data accuracy and validation is also a relevant issue in this case. One way to validate author data is by comparing outcomes per country with statistics on the number of full time equivalents spent on research in the various institutional sectors, obtained from questionnaires and published by the OECD.

 

Country Germany UK Italy The Netherlands
OECD number of FTE Research 2007 (all sectors) 290,800 254,600 93,000 49,700
OECD number of FTE Research 2007 (Higher Education & Government sector) 116,600 159,100 56,200 23,800
Number of Publishing authors in Scopus 150,400 154,600 113,100 46,300
Ratio number of authors / Number of FTE Research (all Sectors) 0.52 0.61 1.22 0.93
Ratio number of authors / Number of FTE Research (Higher Education & Government sector) 1.29 0.97 2.01 1.95

Table 2: OECD and Scopus based “input” statistics for 4 European countries

Table 2 presents statistics for 4 countries. Rather than comparing absolute numbers, it is interesting to calculate the ratios in the last two rows of the table. It is striking that these ratios differ substantially between countries. They are much higher for the Netherlands and Italy than they are for Germany and UK. This outcome points first of all towards the need to further validate Scopus-based numbers of active researchers. On the other hand, it also raises the question whether the various countries have applied the same definition of FTE research time in their surveys.

Books and journals

Scientific-scholarly books are generally considered as important written communication media, especially in social sciences and humanities. There is an increasing interest in studies of the function and quality of books and book publishers in the various domains of science and human scholarship. Thomson Reuters has launched its Book Citation Index. The Google Books project aims to digitalize millions of books, including many scientific-scholarly ones. Expanding a primarily journal-based citation index with scholarly book sources has two advantages. Not only is the set of source publications expanded with relevant sources, but the enormous reservoir of cited references given in journal articles to book items is used more efficiently.

Citations and full texts

The availability of full text research articles in electronic format gives us the opportunity to conduct textual analyses of all of an article’s content – not just the meta-data extracted by indexing databases. The citation contexts can be analyzed linguistically, and sentiment analyses can be conducted to reveal how the citing author appreciates a cited work. Henry Small and Richard Klavans used citation context analysis as an additional tool for the identification of scientific breakthroughs (6). In one of its next issues Research Trends will publish an article on a detailed citation context analysis in one particular journal focusing on cross-disciplinary citations.

Concluding remarks

The overview above is not complete, and many important contributions to the analysis of big, compound bibliometric datasets were not mentioned in this paper. But the examples presented above illustrate the theoretical and practical relevance of combining bibliometric, or, more generally, statistical datasets, show how this can be done, and indicate which issues a big, compound, bibliometric dataset enables us to address.

References

1. Lendi, S. & Huggett, S. (2012) “Usage: an alternative way to evaluate research”, Research Trends, No. 28 (https://www.researchtrends.com/issue28-may-2012/usage-an-alternative-way-to-evaluate-research/).
2. Halevi, G. & Moed, H.F. (2012) “Patenting Library Science Research Assets”, Research Trends, No. 27 (https://www.researchtrends.com/issue-27-march-2012/patenting-library-science-research-assets/).
3. Breschi, S. & Lissoni, F. (2004) “Knowledge Networks from Patent Data. In: Moed, H.F., Glänzel, W., and Schmoch, U. (eds.). Handbook of quantitative science and technology research. The use of publication and patent statistics in studies of S&T systems. Dordrecht (the Netherlands): Kluwer Academic Publishers, 613-644.
4. Plume, A. (2012) “The evolution of brain drain and its measurement: Part I”, Research Trends, No. 26 (https://www.researchtrends.com/issue26-january-2012/the-evolution-of-brain-drain-and-its-measurement-part-i/).
5. Plume, A. (2012) “The evolution of brain drain and its measurement: Part II”, Research Trends, No. 27 https://www.researchtrends.com/issue-27-march-2012/the-evolution-of-brain-drain-and-its-measurement-part-ii/).
6. Small, H. & Klavans R. (2011). “Identifying Scientific Breakthroughs by Combining Co-citation Analysis and Citation Context”. Paper presented at the Proceedings of 13th International Conference of the International Society for Scientometrics and Informetrics (ISSI 2011.).

 


 

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

Part 3: Data analytics & visualization

In this part of the article Kalev Leetaru describes the analytical methodologies and visualization of knowledge extracted from the Wikipedia data.

Read more >


In this part of the article Kalev Leetaru describes the analytical methodologies and visualization of knowledge extracted from the Wikipedia data. For other parts of this article click on the links here: Summary, Part 1, Part 2.

The growth of world knowledge

Putting this all together, what can all of this data say about Wikipedia’s view of world history?  One of the greatest challenges facing historical research in the digital era is the so-called “copyright gap” in which the majority of available digital documents were published either in the last few decades (born digital) or prior to 1924 (copyright expiration).  The vast majority of the twentieth century has gone out of print, yet is still protected by copyright and thus cannot be digitized.  Computational approaches can only examine the digital record and as scholarship increasingly relies on digital search and analysis methods, this is creating a critical knowledge gap in which far more is known about the literature of the nineteenth century than of the twentieth.  In an illustration of how severe a problem this has become, one recent analysis of books in Amazon.com’s warehouses found there were twice as many books from 1850 available as digital reprints as there were from 1950 due to this effect (1). It seems logical that perhaps Wikipedia’s contributors might rely on digitized historical resources to edit its entries and thus this same effect might manifest itself in Wikipedia’s view of history.

Figure 1 shows the total number of mentions across Wikipedia of dates in each year 1001AD to 2011, visualizing its timeline of world history. The date extraction tool used to identify all date mentions works on any date range, but four-digit year mentions are more accurate since in Wikipedia four-digit numbers that are not dates have commas, reducing the false positive rate.  Immediately it becomes clear that the copyright gap seen in other collections has not impacted the knowledge contained in Wikipedia’s pages.  Instead, there is a steady exponential growth in Wikipedia’s coverage through time, matching intuition about the degree of surviving information about each decade.  For the purposes of this study, references to decades and centuries were coded as a reference to the year beginning that time period (“the 1500’s” is coded as the year 1500), which accounts for the majority of the spikes.  One can immediately see major events such as the American Civil War and World Wars I and II.  Figure 2 shows the same timeline, but using a log scale on the Y axis.  Instead of displaying the raw number of mentions each year, a log scale displays exponential growth, making it easier to spot the large-scale patterns in how a dataset has expanded over time.  In this case, the log graph shows that Wikipedia’s historical knowledge 1001AD-2011 largely falls into four time periods: 1001-1500, 1501-1729, 1730-2003, 2004-2011. During the first period (roughly corresponding to the Middle Ages) the number of mentions of each year has a slow steady growth rate from around 2,200 mentions about each year to around 2,500 a year. This rapidly accelerates to around 6,500 mentions during the second period (corresponding to the Early Modern Period, starting around the late Renaissance), then increases its growth rate once again in the third period (corresponding to the start of the Age of Enlightenment) through 650,000 mentions of each year in the third period.  Finally, the fourth period begins with the rise of Wikipedia itself (the “Wikipedia Era”), with a sudden massive growth rate far in excess of the previous periods.

Figure 1: Number of mentions of each year 1001AD-2011 in Wikipedia (Y axis is number of pages)

Figure 2: Number of mentions of each year 1001AD-2011 in Wikipedia (Y axis is log scale of page count to show growth rate)

Figure 3 shows a zoom-in of the period 1950-2011, showing that the initial spike of coverage leading into the Wikipedia Era begins in 2001, the year Wikipedia was first released, followed by three years of fairly level coverage, with the real acceleration beginning in 2004.  Equally interesting is the leveling-off that begins in 2008 and that there are nearly equal numbers of mentions of the last three years: 2009, 2010, and 2011. Does this reflect that Wikipedia is stagnating or has it perhaps finally reached a threshold at which all human knowledge generated each year is now recorded on its pages and there is simply nothing more to record?  If the latter was true, this would mean that most edits to Wikipedia today focus on contemporary knowledge, adding in events as they happen, turning Wikipedia into a daybook of modern history.

Figure 3: Number of mentions of each year 1950-2011 in Wikipedia (Y axis is number of pages)

Figure 4 offers an intriguing alternative. It plots the total number of articles in the English-language Wikipedia by year 2001-2011 against the number of mentions of dates from that year.  There are nearly as many mentions of 2007 as there were pages in Wikipedia that year (this does not mean every page mentioned that year, since a single page mentioning a year multiple times will account for multiple entries in this graph).  Since 2007, Wikipedia has continued to grow substantially each year, while the number of mentions of each of those years has leveled off.  This suggests that Wikipedia’s growth is coming in the form of enhanced coverage of the past and that it has reached a point where there are only 1.7-1.9 million new mentions of the current year added, suggesting the number of items deemed worthy of inclusion each year has peaked.

Figure 4: Size of Wikipedia versus number of mentions of that year 2001-2011

Of course, the total number of mentions of each year tells only one part of the story.  What was the emotional context of those mentions?  Were the events being described discussed in a more negative or a more positive light?

Figure 5 visualizes how “positive” or “negative” each year was according to Wikipedia (to normalize the raw tonal scores, the Y axis shows the number of standard deviations from the mean, known as the Z-score).  Annual tone is calculated through a very simplistic measure, computing the average tone of every article in Wikipedia and then computing the average tone of all articles mentioning a given year (if a year is mentioned multiple times in an article, the article’s tone is counted multiple times towards this average).  This is a very coarse measure and doesn’t take into account that a year might be referenced in a positive light in an article that is otherwise highly negative.  Instead this measure captures the macro-level context of a year: on the scale of Wikipedia, if a year is mentioned primarily in negative articles, that suggests something important about that year.

Figure 5: Average tone of all articles mentioning each year 1001AD-2011 (Y axis is Z-score)

One of the most striking features of Figure 5 is the dramatic shift towards greater negativity between 1499 and 1500.  Tone had been becoming steadily more negative from 1001AD to 1499, shifting an entire standard deviation over this period, but there is a sudden sharp shift of one full standard deviation between those two years, with tone remaining more negative until the most recent half-century.  The suddenness of this shift suggests this is likely due to an artifact in Wikipedia or the analysis process, rather than a genuine historical trend such as a reflection of increasing scholarly questioning of worldly norms during that period. Possibilities include a shift in authorship or writing style, or increased historical documentary record that covers a greater class of events.  Another striking plunge towards negativity occurs from 1861-1865, reflecting the American Civil War, with similar plunges around World Wars I and II.  World War II shows nearly double the negativity that World War I did, but just three quarters of that of the Civil War.

Visualizing Wikipedia over time and space

The Figures above show the power of visualizing Wikipedia temporally, but to really understand it as a global daybook of human activity, it is necessary to add the spatial dimension.  The primary geographic databases used for looking up location coordinates are limited to roughly the last 200 years, so here the analysis was limited to 1800-present (2).  Each location was associated with the closest date reference in the text and vice-versa, leading to a spatially and temporally-referenced network capturing the locations and connections among those locations through time recorded in Wikipedia’s pages. For every pair of locations in an article with the same associated year, a link was recorded between them.  The average tone of all articles mentioning both locations with respect to the same year was used to compute the color of that link. A scale from bright green (high positivity) through bright red (high negativity) was used to render tone graphically. The importance of time and location in Wikipedia results in more than 3,851,063 nodes and 23,672,214 connections across all 212 maps from 1800-2012.  The massive number of connections meant most years simply became an unintelligible mess of crisscrossing links.  To reduce the visual clutter, the first sequence discarded links that appeared in less than 10 articles (see Figure 6). This preserves only the strongest links in the data. To focus only on the linking structure, the second sequence displayed all links, but discarded the tonal information and made each edge semi-transparent so they blended into one another (see Figure 7). The result is that an isolated link with no surrounding links will appear very faint, while lots of links overlapping on top of each other will result in a bright white flare. By focusing purely on the linking structure, this animation shows evolving connections across the world.

Figure 6: Tone map (see video at  https://www.youtube.com/watch?v=KmCQVIVpzWg)

Figure 7: Intensity map (see video at https://www.youtube.com/watch?v=wzuOcP7oml0)

Interactively browsing Wikipedia through time and space

While animations are an extremely powerful tool for visualizing complex information, they do not allow users to interactively drill into the data to explore interesting trends.  Ultimately one would like to be able to convert those static images into an interactive interface that would enable browsing Wikipedia through time and space.  As an example, let’s say one was interested in everything Wikipedia said about a certain area of Southern Libya in the 1840’s and 1850’s.  Wikipedia’s own keyword search interface would not be useful here, as it does not support advanced Boolean searches, only searches for a specific entry. Since the Wikipedia search tool does not understand the geographic and date information contained on its pages, one would have to manually compile a list of the name of every city and location in the area of interest, download a copy of Wikipedia, and write a program to run a massive Boolean search along the lines of “(city1name OR city2name OR city3name OR … ) AND (1841 OR 1842 OR …)”.   Obviously such a task would be infeasible for a large area and highly labor-intensive and error-prone even for small queries.  This is a fundamental inconsistency of Wikipedia as it exists today: it contains one of the richest open archives of historical knowledge arrayed through time and space, but the only mechanism of interacting with it is through a keyword search box that cannot take any of this information into account.

To prototype what such an interface might look like, all of the information from the animation sequences for Libya 1800 to 2012 described above was extracted and used to create a Google Earth KML file. Figure 8 links to a Google Earth file (3) that offers interactive browsing of Wikipedia’s coverage of Libya over this period. Libya was chosen because it offered a large geographic area with a fair amount of change over time, while still having few enough points that could easily load in Google Earth.   Unfortunately, most geographic mapping tools today support only a small number of points and Google Earth is one of the few systems that supports date-stamped records.  Each location is date-stamped in this demo to the year level so the Google Earth time slider feature can be used to move through time to see what locations of Libya have been mentioned with respect to different time periods over the last 212 years (note that Google Earth operates at the day level, so even though this data is at the year level, Google Earth will show individual days in the time slider).  The display can be narrowed to show only those locations mentioned with respect to a certain timeframe, or one can scroll through the entire 212 years as an animation to see which areas have attracted the attention of Wikipedia’s editors over time.  Imagine being able to load up the entire world in this fashion and browse all of Wikipedia’s coverage in time and space!

Figure 8: Interactive Google Earth file for Libya (see  http://www.sgi.com/go/wikipedia/LIBYA-1800-2012.KML)

The one-way nature of Wikipedia

The Google Earth demonstration illustrates several limitations of Wikipedia’s reliance on human editors to provide links between articles. For example, the Google Earth display shows mentions of Tajarhi, Libya in 1846 and 1848, reflecting that the entry for that city says slave trade traffic increased through there after Tunisia and Algeria abolished the trade, and also shows a mention in 1819 to reflect a description of it that year by the British naval explorer George Lyon (4). The article mentions both Tunisia and Algeria with respect to the slave trade, but those mentions are not links to those articles.  The mention of George Lyon is also problematic, in that the actual Wikipedia page on his life is titled with his full name, George Francis Lyon” (5) and makes no mention of Tajarhi, only Tripoli and Murzuk, and is not linked from the Tajarhi page, requiring a visitor to manually keyword search on his name.  The fact that these mentions of Tunisia, Algeria, and George Lyon have not been made into hyperlinks to those respective pages may at first seem to be only a small inconvenience.  However, a data mining analysis of Wikipedia that looked only at which pages linked to which other pages (which is one of the most common ways Wikipedia is analyzed) would miss these connections.  This illustrates the limitations of using linking data or other metadata to explore a large text corpus and the importance of examining the content itself.

Along those same lines are Wikipedia’s “Infoboxes” in which human editors can create a table that appears in the sidebar of an article with important key facts about that article.  These are often used as metadata to assign dates and locations to articles in data mining applications.  For example, the American Civil War entry (6) has an Infobox with a rich assortment of details, including the locations and dates of the war.  However, many articles do not contain such Infoboxes, even when the article focuses on a specific event. For example, the Barasa-Ubaidat War (7) between 1860-1890 in North-Eastern Libya, which started a year prior to the American Civil War, does not have an Infobox and the only information on the dates and locations of the conflict appear in the article text itself.  The limitations of Infoboxes are something to keep in mind, as many studies and datasets make use of them as a machine-friendly proxy for the factual contents of Wikipedia (8).

Another trend in Wikipedia apparent in this Google Earth display is the tendency for a connection between two people or places to be mentioned in one of their respective entries, but not in the other’s.  For example, the entry for Tazirbu, Libya (9) notes that Gerhard Rohlfs was the first European to visit the oasis, in 1879.  Rohlfs’ own entry (10), however, notes only that in 1874 he embarked upon a journey to the Kufra basin in the same Kufra district in which Tazirbu is located, but does not mention Tazirbu itself or his visit there in 1879. The Kufra basin entry (11) notes that Rohlfs reached it in 1879, but again mentions nothing of Tazirbu or other details. The entry for Kufra District (12) in which both are located, mentions only that the name Kufra is a derivation of the Arabic word for a non-Muslim and cites one of Rohlfs’ books, but does so only in the references list, and makes no mention of his travels in the text itself. Of course, Wikipedia entries must balance the desire to provide cross-links and updated information without turning each entry into a sea of links and repeated information.  This is one of the areas where Wikipedia’s openness really shines, in that it opens the door for computer scientists, interface designers, and others to apply data mining algorithms to develop new interfaces to Wikipedia and find new ways of finding and displaying these connections transparently.

The ability to display information from across Wikipedia temporally and spatially allows a reader to place a given event in the context of world events of the time period.  For example, the Google Earth display contains a reference to Tripoli with respect to 1878 (the year prior to Rohlfs’ visit to Tazirbu) to the entry for the Italo-Turkish War (13). At first glance this war appears to have no relation to 1879, having occurred 1911-1912.  Yet, the opening sentence of the introductory paragraph notes that the origins of this war, in which Italy was eventually awarded the region of modern-day Libya, began with the Congress of Berlin in 1878.  Thus, while likely entirely unrelated to Rohlfs’ journey, it provides an additional point of context that can be found simply by connecting all of Wikipedia’s articles together.

Thus, a tremendous amount of information in Wikipedia is one-way: one entry provides information about the connections between other entries, but those entries do not in turn mention this connection.  If one was interested in the travels of Gerhard Rohlfs, a natural start would be to pull up his Wikipedia entry.  Yet, his entry mentions only a brief synopsis of his African journey, with no details about the cities he visited. Even Paul Friedrich August Ascherson, who accompanied him on his journey, is not mentioned, while Ascherson’s entry (14) prominently mentions his accompanying Rohlfs on the journey.  One would have to keyword search all of Wikipedia for any mention of Rohlfs’ name and then manually read through all of the material and synthesize their information in time and space to fully map out his journey. Using computational analysis, machines can do most of this work, presenting just the final analysis. This is one of the basic applications of data mining unstructured text repositories: converting their masses of words into knowledge graphs that recover these connections. In fact, this is what historical research is about: weaving a web of connections among people, places, and activities based on the incomplete and one-way records scattered across a vast archive of material.

The networks of Wikipedia

As a final set of analyses, four network visualizations were constructed to look at the broader structure of connections captured in Wikipedia. Figure 9 shows how category tags are connected through co-occurrences in category-tagged articles. Wikipedia allows contributors to assign metadata tags to each article that describes the primary categories relevant to it.  In this case, each category tag applied to an article was cross-linked with each other category tag for that article, across the entirety of Wikipedia, resulting in a massive network capturing how categories co-occur.  This diagram illustrates a central core of categories around which other sub clusters of categories are tightly connected. Figure 10 shows the network of co-mentions of all person names across Wikipedia. In this case, a list of all person names appearing on each page was compiled and links formed to connect all person names appearing together in an article. This network shows a very different structure, which is far more diffuse with far greater clustering of small groups of people together. Figure 11 shows the same approach applied to names of organizations. In this case, it is more similar to category tags, but shows more complex structure at the core, of clusters of names to which other clusters are tightly connected. Finally, Figure 12 shows the network of co-mentions of years across Wikipedia. This network illustrates that the closer to the present, the more Wikipedia content revolves around that year. This captures the fact that entries across Wikipedia tend to be updated with new information and events from the current year, which draws a connection between those earlier years and the present.

Figure 9: Network of co-occurrences of category tags across Wikipedia

Figure 10: Network of co-occurrences of person names across Wikipedia

Figure 11: Network of co-occurrences of organization names across Wikipedia

Figure 12: Network of co-occurrences of years across Wikipedia

Conclusions

This study has surveyed the current landscape of the Big Data Humanities, Arts, and Social Sciences (HASS) disciplines and introduced the workflows, challenges, and opportunities of this emerging field.  As emerging HASS scholarship increasingly moves towards data-driven computationally-assisted exploration, new analytical mindsets are developing around whole-corpus data mining, data movement, and metadata construction.  Interactive exploration, visualization, and ad-hoc hypothesis testing play key roles in this new form of analysis, placing unique requirements on the underlying data storage and computation approaches. An exploration of Wikipedia illustrates all of these components operating together to visualize Wikipedia’s view of world history over the last two centuries through the lens of space, time, and emotion.

Acknowledgements

The author wishes to thank Silicon Graphics International (SGI) for providing access to one of their UV2000 supercomputers to support this project.

Summary
Part 1: Background

In part 1 of this article, the author describes the project background, purpose and some of the challenges of data collection.

Part 2: Data processing and Analytical methodologies

The methods by which the Wikipedia data was stored, processed, and analysed are presented in this part of the article.

References and Useful Links

1. http://www.theatlantic.com/technology/archive/2012/03/the-missing-20th-century-how-copyright-protection-makes-books-vanish/255282/
2. Leetaru, Kalev. (forthcoming).  Fulltext Geocoding Versus Spatial Metadata For Large Text Archives: Towards a Geographically Enriched Wikipedia.  D-Lib Magazine.
3. Requires a free download of Google Earth http://www.google.com/earth/index.html
4. http://en.wikipedia.org/wiki/Tajarhi
5. http://en.wikipedia.org/wiki/George_Francis_Lyon
6. http://en.wikipedia.org/wiki/American_Civil_War
7. http://en.wikipedia.org/wiki/Barasa%E2%80%93Ubaidat_War
8. http://www.infochimps.com/collections/wikipedia-infoboxes
9. http://en.wikipedia.org/wiki/Tazirbu
10. http://en.wikipedia.org/wiki/Friedrich_Gerhard_Rohlfs
11. http://en.wikipedia.org/wiki/Kufra
12. http://en.wikipedia.org/wiki/Kufra_District
13. http://en.wikipedia.org/wiki/Italo-Turkish_War
14. http://en.wikipedia.org/wiki/Paul_Friedrich_August_Ascherson
VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

In this part of the article Kalev Leetaru describes the analytical methodologies and visualization of knowledge extracted from the Wikipedia data. For other parts of this article click on the links here: Summary, Part 1, Part 2.

The growth of world knowledge

Putting this all together, what can all of this data say about Wikipedia’s view of world history?  One of the greatest challenges facing historical research in the digital era is the so-called “copyright gap” in which the majority of available digital documents were published either in the last few decades (born digital) or prior to 1924 (copyright expiration).  The vast majority of the twentieth century has gone out of print, yet is still protected by copyright and thus cannot be digitized.  Computational approaches can only examine the digital record and as scholarship increasingly relies on digital search and analysis methods, this is creating a critical knowledge gap in which far more is known about the literature of the nineteenth century than of the twentieth.  In an illustration of how severe a problem this has become, one recent analysis of books in Amazon.com’s warehouses found there were twice as many books from 1850 available as digital reprints as there were from 1950 due to this effect (1). It seems logical that perhaps Wikipedia’s contributors might rely on digitized historical resources to edit its entries and thus this same effect might manifest itself in Wikipedia’s view of history.

Figure 1 shows the total number of mentions across Wikipedia of dates in each year 1001AD to 2011, visualizing its timeline of world history. The date extraction tool used to identify all date mentions works on any date range, but four-digit year mentions are more accurate since in Wikipedia four-digit numbers that are not dates have commas, reducing the false positive rate.  Immediately it becomes clear that the copyright gap seen in other collections has not impacted the knowledge contained in Wikipedia’s pages.  Instead, there is a steady exponential growth in Wikipedia’s coverage through time, matching intuition about the degree of surviving information about each decade.  For the purposes of this study, references to decades and centuries were coded as a reference to the year beginning that time period (“the 1500’s” is coded as the year 1500), which accounts for the majority of the spikes.  One can immediately see major events such as the American Civil War and World Wars I and II.  Figure 2 shows the same timeline, but using a log scale on the Y axis.  Instead of displaying the raw number of mentions each year, a log scale displays exponential growth, making it easier to spot the large-scale patterns in how a dataset has expanded over time.  In this case, the log graph shows that Wikipedia’s historical knowledge 1001AD-2011 largely falls into four time periods: 1001-1500, 1501-1729, 1730-2003, 2004-2011. During the first period (roughly corresponding to the Middle Ages) the number of mentions of each year has a slow steady growth rate from around 2,200 mentions about each year to around 2,500 a year. This rapidly accelerates to around 6,500 mentions during the second period (corresponding to the Early Modern Period, starting around the late Renaissance), then increases its growth rate once again in the third period (corresponding to the start of the Age of Enlightenment) through 650,000 mentions of each year in the third period.  Finally, the fourth period begins with the rise of Wikipedia itself (the “Wikipedia Era”), with a sudden massive growth rate far in excess of the previous periods.

Figure 1: Number of mentions of each year 1001AD-2011 in Wikipedia (Y axis is number of pages)

Figure 2: Number of mentions of each year 1001AD-2011 in Wikipedia (Y axis is log scale of page count to show growth rate)

Figure 3 shows a zoom-in of the period 1950-2011, showing that the initial spike of coverage leading into the Wikipedia Era begins in 2001, the year Wikipedia was first released, followed by three years of fairly level coverage, with the real acceleration beginning in 2004.  Equally interesting is the leveling-off that begins in 2008 and that there are nearly equal numbers of mentions of the last three years: 2009, 2010, and 2011. Does this reflect that Wikipedia is stagnating or has it perhaps finally reached a threshold at which all human knowledge generated each year is now recorded on its pages and there is simply nothing more to record?  If the latter was true, this would mean that most edits to Wikipedia today focus on contemporary knowledge, adding in events as they happen, turning Wikipedia into a daybook of modern history.

Figure 3: Number of mentions of each year 1950-2011 in Wikipedia (Y axis is number of pages)

Figure 4 offers an intriguing alternative. It plots the total number of articles in the English-language Wikipedia by year 2001-2011 against the number of mentions of dates from that year.  There are nearly as many mentions of 2007 as there were pages in Wikipedia that year (this does not mean every page mentioned that year, since a single page mentioning a year multiple times will account for multiple entries in this graph).  Since 2007, Wikipedia has continued to grow substantially each year, while the number of mentions of each of those years has leveled off.  This suggests that Wikipedia’s growth is coming in the form of enhanced coverage of the past and that it has reached a point where there are only 1.7-1.9 million new mentions of the current year added, suggesting the number of items deemed worthy of inclusion each year has peaked.

Figure 4: Size of Wikipedia versus number of mentions of that year 2001-2011

Of course, the total number of mentions of each year tells only one part of the story.  What was the emotional context of those mentions?  Were the events being described discussed in a more negative or a more positive light?

Figure 5 visualizes how “positive” or “negative” each year was according to Wikipedia (to normalize the raw tonal scores, the Y axis shows the number of standard deviations from the mean, known as the Z-score).  Annual tone is calculated through a very simplistic measure, computing the average tone of every article in Wikipedia and then computing the average tone of all articles mentioning a given year (if a year is mentioned multiple times in an article, the article’s tone is counted multiple times towards this average).  This is a very coarse measure and doesn’t take into account that a year might be referenced in a positive light in an article that is otherwise highly negative.  Instead this measure captures the macro-level context of a year: on the scale of Wikipedia, if a year is mentioned primarily in negative articles, that suggests something important about that year.

Figure 5: Average tone of all articles mentioning each year 1001AD-2011 (Y axis is Z-score)

One of the most striking features of Figure 5 is the dramatic shift towards greater negativity between 1499 and 1500.  Tone had been becoming steadily more negative from 1001AD to 1499, shifting an entire standard deviation over this period, but there is a sudden sharp shift of one full standard deviation between those two years, with tone remaining more negative until the most recent half-century.  The suddenness of this shift suggests this is likely due to an artifact in Wikipedia or the analysis process, rather than a genuine historical trend such as a reflection of increasing scholarly questioning of worldly norms during that period. Possibilities include a shift in authorship or writing style, or increased historical documentary record that covers a greater class of events.  Another striking plunge towards negativity occurs from 1861-1865, reflecting the American Civil War, with similar plunges around World Wars I and II.  World War II shows nearly double the negativity that World War I did, but just three quarters of that of the Civil War.

Visualizing Wikipedia over time and space

The Figures above show the power of visualizing Wikipedia temporally, but to really understand it as a global daybook of human activity, it is necessary to add the spatial dimension.  The primary geographic databases used for looking up location coordinates are limited to roughly the last 200 years, so here the analysis was limited to 1800-present (2).  Each location was associated with the closest date reference in the text and vice-versa, leading to a spatially and temporally-referenced network capturing the locations and connections among those locations through time recorded in Wikipedia’s pages. For every pair of locations in an article with the same associated year, a link was recorded between them.  The average tone of all articles mentioning both locations with respect to the same year was used to compute the color of that link. A scale from bright green (high positivity) through bright red (high negativity) was used to render tone graphically. The importance of time and location in Wikipedia results in more than 3,851,063 nodes and 23,672,214 connections across all 212 maps from 1800-2012.  The massive number of connections meant most years simply became an unintelligible mess of crisscrossing links.  To reduce the visual clutter, the first sequence discarded links that appeared in less than 10 articles (see Figure 6). This preserves only the strongest links in the data. To focus only on the linking structure, the second sequence displayed all links, but discarded the tonal information and made each edge semi-transparent so they blended into one another (see Figure 7). The result is that an isolated link with no surrounding links will appear very faint, while lots of links overlapping on top of each other will result in a bright white flare. By focusing purely on the linking structure, this animation shows evolving connections across the world.

Figure 6: Tone map (see video at  https://www.youtube.com/watch?v=KmCQVIVpzWg)

Figure 7: Intensity map (see video at https://www.youtube.com/watch?v=wzuOcP7oml0)

Interactively browsing Wikipedia through time and space

While animations are an extremely powerful tool for visualizing complex information, they do not allow users to interactively drill into the data to explore interesting trends.  Ultimately one would like to be able to convert those static images into an interactive interface that would enable browsing Wikipedia through time and space.  As an example, let’s say one was interested in everything Wikipedia said about a certain area of Southern Libya in the 1840’s and 1850’s.  Wikipedia’s own keyword search interface would not be useful here, as it does not support advanced Boolean searches, only searches for a specific entry. Since the Wikipedia search tool does not understand the geographic and date information contained on its pages, one would have to manually compile a list of the name of every city and location in the area of interest, download a copy of Wikipedia, and write a program to run a massive Boolean search along the lines of “(city1name OR city2name OR city3name OR … ) AND (1841 OR 1842 OR …)”.   Obviously such a task would be infeasible for a large area and highly labor-intensive and error-prone even for small queries.  This is a fundamental inconsistency of Wikipedia as it exists today: it contains one of the richest open archives of historical knowledge arrayed through time and space, but the only mechanism of interacting with it is through a keyword search box that cannot take any of this information into account.

To prototype what such an interface might look like, all of the information from the animation sequences for Libya 1800 to 2012 described above was extracted and used to create a Google Earth KML file. Figure 8 links to a Google Earth file (3) that offers interactive browsing of Wikipedia’s coverage of Libya over this period. Libya was chosen because it offered a large geographic area with a fair amount of change over time, while still having few enough points that could easily load in Google Earth.   Unfortunately, most geographic mapping tools today support only a small number of points and Google Earth is one of the few systems that supports date-stamped records.  Each location is date-stamped in this demo to the year level so the Google Earth time slider feature can be used to move through time to see what locations of Libya have been mentioned with respect to different time periods over the last 212 years (note that Google Earth operates at the day level, so even though this data is at the year level, Google Earth will show individual days in the time slider).  The display can be narrowed to show only those locations mentioned with respect to a certain timeframe, or one can scroll through the entire 212 years as an animation to see which areas have attracted the attention of Wikipedia’s editors over time.  Imagine being able to load up the entire world in this fashion and browse all of Wikipedia’s coverage in time and space!

Figure 8: Interactive Google Earth file for Libya (see  http://www.sgi.com/go/wikipedia/LIBYA-1800-2012.KML)

The one-way nature of Wikipedia

The Google Earth demonstration illustrates several limitations of Wikipedia’s reliance on human editors to provide links between articles. For example, the Google Earth display shows mentions of Tajarhi, Libya in 1846 and 1848, reflecting that the entry for that city says slave trade traffic increased through there after Tunisia and Algeria abolished the trade, and also shows a mention in 1819 to reflect a description of it that year by the British naval explorer George Lyon (4). The article mentions both Tunisia and Algeria with respect to the slave trade, but those mentions are not links to those articles.  The mention of George Lyon is also problematic, in that the actual Wikipedia page on his life is titled with his full name, George Francis Lyon” (5) and makes no mention of Tajarhi, only Tripoli and Murzuk, and is not linked from the Tajarhi page, requiring a visitor to manually keyword search on his name.  The fact that these mentions of Tunisia, Algeria, and George Lyon have not been made into hyperlinks to those respective pages may at first seem to be only a small inconvenience.  However, a data mining analysis of Wikipedia that looked only at which pages linked to which other pages (which is one of the most common ways Wikipedia is analyzed) would miss these connections.  This illustrates the limitations of using linking data or other metadata to explore a large text corpus and the importance of examining the content itself.

Along those same lines are Wikipedia’s “Infoboxes” in which human editors can create a table that appears in the sidebar of an article with important key facts about that article.  These are often used as metadata to assign dates and locations to articles in data mining applications.  For example, the American Civil War entry (6) has an Infobox with a rich assortment of details, including the locations and dates of the war.  However, many articles do not contain such Infoboxes, even when the article focuses on a specific event. For example, the Barasa-Ubaidat War (7) between 1860-1890 in North-Eastern Libya, which started a year prior to the American Civil War, does not have an Infobox and the only information on the dates and locations of the conflict appear in the article text itself.  The limitations of Infoboxes are something to keep in mind, as many studies and datasets make use of them as a machine-friendly proxy for the factual contents of Wikipedia (8).

Another trend in Wikipedia apparent in this Google Earth display is the tendency for a connection between two people or places to be mentioned in one of their respective entries, but not in the other’s.  For example, the entry for Tazirbu, Libya (9) notes that Gerhard Rohlfs was the first European to visit the oasis, in 1879.  Rohlfs’ own entry (10), however, notes only that in 1874 he embarked upon a journey to the Kufra basin in the same Kufra district in which Tazirbu is located, but does not mention Tazirbu itself or his visit there in 1879. The Kufra basin entry (11) notes that Rohlfs reached it in 1879, but again mentions nothing of Tazirbu or other details. The entry for Kufra District (12) in which both are located, mentions only that the name Kufra is a derivation of the Arabic word for a non-Muslim and cites one of Rohlfs’ books, but does so only in the references list, and makes no mention of his travels in the text itself. Of course, Wikipedia entries must balance the desire to provide cross-links and updated information without turning each entry into a sea of links and repeated information.  This is one of the areas where Wikipedia’s openness really shines, in that it opens the door for computer scientists, interface designers, and others to apply data mining algorithms to develop new interfaces to Wikipedia and find new ways of finding and displaying these connections transparently.

The ability to display information from across Wikipedia temporally and spatially allows a reader to place a given event in the context of world events of the time period.  For example, the Google Earth display contains a reference to Tripoli with respect to 1878 (the year prior to Rohlfs’ visit to Tazirbu) to the entry for the Italo-Turkish War (13). At first glance this war appears to have no relation to 1879, having occurred 1911-1912.  Yet, the opening sentence of the introductory paragraph notes that the origins of this war, in which Italy was eventually awarded the region of modern-day Libya, began with the Congress of Berlin in 1878.  Thus, while likely entirely unrelated to Rohlfs’ journey, it provides an additional point of context that can be found simply by connecting all of Wikipedia’s articles together.

Thus, a tremendous amount of information in Wikipedia is one-way: one entry provides information about the connections between other entries, but those entries do not in turn mention this connection.  If one was interested in the travels of Gerhard Rohlfs, a natural start would be to pull up his Wikipedia entry.  Yet, his entry mentions only a brief synopsis of his African journey, with no details about the cities he visited. Even Paul Friedrich August Ascherson, who accompanied him on his journey, is not mentioned, while Ascherson’s entry (14) prominently mentions his accompanying Rohlfs on the journey.  One would have to keyword search all of Wikipedia for any mention of Rohlfs’ name and then manually read through all of the material and synthesize their information in time and space to fully map out his journey. Using computational analysis, machines can do most of this work, presenting just the final analysis. This is one of the basic applications of data mining unstructured text repositories: converting their masses of words into knowledge graphs that recover these connections. In fact, this is what historical research is about: weaving a web of connections among people, places, and activities based on the incomplete and one-way records scattered across a vast archive of material.

The networks of Wikipedia

As a final set of analyses, four network visualizations were constructed to look at the broader structure of connections captured in Wikipedia. Figure 9 shows how category tags are connected through co-occurrences in category-tagged articles. Wikipedia allows contributors to assign metadata tags to each article that describes the primary categories relevant to it.  In this case, each category tag applied to an article was cross-linked with each other category tag for that article, across the entirety of Wikipedia, resulting in a massive network capturing how categories co-occur.  This diagram illustrates a central core of categories around which other sub clusters of categories are tightly connected. Figure 10 shows the network of co-mentions of all person names across Wikipedia. In this case, a list of all person names appearing on each page was compiled and links formed to connect all person names appearing together in an article. This network shows a very different structure, which is far more diffuse with far greater clustering of small groups of people together. Figure 11 shows the same approach applied to names of organizations. In this case, it is more similar to category tags, but shows more complex structure at the core, of clusters of names to which other clusters are tightly connected. Finally, Figure 12 shows the network of co-mentions of years across Wikipedia. This network illustrates that the closer to the present, the more Wikipedia content revolves around that year. This captures the fact that entries across Wikipedia tend to be updated with new information and events from the current year, which draws a connection between those earlier years and the present.

Figure 9: Network of co-occurrences of category tags across Wikipedia

Figure 10: Network of co-occurrences of person names across Wikipedia

Figure 11: Network of co-occurrences of organization names across Wikipedia

Figure 12: Network of co-occurrences of years across Wikipedia

Conclusions

This study has surveyed the current landscape of the Big Data Humanities, Arts, and Social Sciences (HASS) disciplines and introduced the workflows, challenges, and opportunities of this emerging field.  As emerging HASS scholarship increasingly moves towards data-driven computationally-assisted exploration, new analytical mindsets are developing around whole-corpus data mining, data movement, and metadata construction.  Interactive exploration, visualization, and ad-hoc hypothesis testing play key roles in this new form of analysis, placing unique requirements on the underlying data storage and computation approaches. An exploration of Wikipedia illustrates all of these components operating together to visualize Wikipedia’s view of world history over the last two centuries through the lens of space, time, and emotion.

Acknowledgements

The author wishes to thank Silicon Graphics International (SGI) for providing access to one of their UV2000 supercomputers to support this project.

Summary
Part 1: Background

In part 1 of this article, the author describes the project background, purpose and some of the challenges of data collection.

Part 2: Data processing and Analytical methodologies

The methods by which the Wikipedia data was stored, processed, and analysed are presented in this part of the article.

References and Useful Links

1. http://www.theatlantic.com/technology/archive/2012/03/the-missing-20th-century-how-copyright-protection-makes-books-vanish/255282/
2. Leetaru, Kalev. (forthcoming).  Fulltext Geocoding Versus Spatial Metadata For Large Text Archives: Towards a Geographically Enriched Wikipedia.  D-Lib Magazine.
3. Requires a free download of Google Earth http://www.google.com/earth/index.html
4. http://en.wikipedia.org/wiki/Tajarhi
5. http://en.wikipedia.org/wiki/George_Francis_Lyon
6. http://en.wikipedia.org/wiki/American_Civil_War
7. http://en.wikipedia.org/wiki/Barasa%E2%80%93Ubaidat_War
8. http://www.infochimps.com/collections/wikipedia-infoboxes
9. http://en.wikipedia.org/wiki/Tazirbu
10. http://en.wikipedia.org/wiki/Friedrich_Gerhard_Rohlfs
11. http://en.wikipedia.org/wiki/Kufra
12. http://en.wikipedia.org/wiki/Kufra_District
13. http://en.wikipedia.org/wiki/Italo-Turkish_War
14. http://en.wikipedia.org/wiki/Paul_Friedrich_August_Ascherson
VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

Part 2: Data processing and Analytical methodologies

In this part of the article Kalev Leetaru describes the methods by which the Wikipedia data was stored processed and analyzed.

Read more >


In this part of the article Kalev Leetaru describes the methods by which the Wikipedia data was stored processed and analyzed. For other parts of this article click on the links here: Summary, Part 1, Part 3.

Storing the data for processing

Once the data arrives, it must be processed into a format that can be read by the analysis tools.  Many collections are stored in proprietary or discipline-specific formats, requiring preparation and data reformatting stages.  One large digital book archive arrives as two million ZIP files containing 750 million individual ASCII files, one for each page of each book in the archive.  Few computer file systems can handle that many tiny files, and most analysis software expects to see each book as a single file.  Thus, before any analysis can begin, each of these ZIP files must be uncompressed and the individual page files reformatted as a single ASCII or XML file per book.  Other common delivery formats include PDF, EPUB, and DjVu, requiring similar preprocessing stages to extract the text layers.  While XML is becoming a growing standard for the distribution of text content, the XML standard defines only how a file is structured, leaving individual vendors to decide the specific XML encoding scheme they prefer.  Thus, even when an archive is distributed as a single XML file, preprocessing tools will be needed to extract the fields of interest.  In the case of Wikipedia, the complete four million entry archive is available as a single XML file for download directly from their website and uses a fairly simple XML schema, making it easy to extract the text of each entry.

As the fields of interest are extracted from the source data, they must be stored in a format amenable to data analysis.  In cases where only one or two software packages will be used for the analysis, data can simply be converted into a file format they support.  If multiple software packages will be used, it may make more sense to convert the data to an intermediate representation that can easily be converted to and from the other formats on demand.  Relational database servers offer a variety of features such as indexes and specialized algorithms designed for datasets too large to fit into memory that enable high-speed efficient searching, browsing, and basic analysis of even very large collections, and many filters are available to convert to and from major file formats.  Some servers, like the free edition of MySQL, (1) are highly scalable, yet extremely lightweight and can run on any Linux or Windows server.  Alternatively, if it is not possible to run a database server, a simple XML format can be developed that includes only the fields of interest, or specialized formats such as packed data structures that allow rapid randomized retrieval from the file.  In the case of the Wikipedia project, a MySQL database was used to store the data, which was then exported to a special packed XML format designed for maximum processing efficiency during the large computation phases.

From words to connections: transforming a text archive into a knowledge base

Documents are inherently large collections of words, but to a computer each word holds the same meaning and importance as every other word, limiting the types of patterns that can be explored in an archive to simply word frequencies.  The creation of higher-order representations capturing specific dimensions of that information, recognizing words indicating space, time, and emotion, allow automated analyses to move closer towards studying patterns in the actual meaning and focus of those documents.  The first generation of Big Data analysis focused largely on examining such indicators in isolation, plotting the tone of discussion of a topic over time or mapping locations and making lists of persons mentioned in that coverage.  Connections among indicators have largely been ignored, primarily because the incredible richness of human text leads to networks of interconnections that can easily reach hundreds of trillions of links from relatively small collections.  Yet historical research tends to revolve around these very connections and the interplay they capture between people, places, and dates and the actions and events that relate them.  Thus, the grand challenge questions driving the second generation of Big Data research tend to revolve around weaving together the myriad connections scattered across an archive into a single cohesive network capturing how every piece of information fits together into the global picture.  This in turn is driving an increasing focus on connections and the enormous theoretic and computational challenges that accompany them.  In the case of Wikipedia, mapping mentions of locations and creating timelines of date mentions and tone in isolation can be enlightening, but the real insight comes from coupling those dimensions, exploring how tone diffuses over space through time.

Thus, once a data archive has been assembled, the first stage of the analytical pipeline usually begins with the construction of new metadata layers over the data.  This typically involves using various data mining algorithms to extract key pieces of information, such as names or locations, or to calculate various characteristics of the text, such as readability scores or emotion.  The results of these algorithms are then saved as metadata layers to be used for subsequent access and analysis of the text.  To explore Wikipedia’s view of world history, for example, data mining algorithms were needed to translate its large unstructured text corpus into a structured knowledgebase.  Each study uses a different set of data mining algorithms aimed at its specific needs, but location in particular is an emerging class of metadata that is gaining traction as a way of understanding information in a new light. Culturomics 2.0 (2) found that location was the single most prominent organizing dimension in a three-decade archive of more than 100 million print and broadcast news reports translated from vernacular languages across nearly every country in the world, appearing on average 200-300 words. In the case of Wikipedia, previous studies of the linking structure of its pages have found that time and space form the two central dimensions around which the entire site is organized (3). Thus, for the metadata construction stage of the Wikipedia project, a fulltext geocoding algorithm was applied to all of the articles to automatically identify, disambiguate, and convert all textual geographic references to approximate mappable coordinates (4). This resulted in a new XML metadata layer that recorded every mention of a location in the text of each article and the corresponding latitude and longitude for mapping.  A similar algorithm was used to identify mentions of dates.  For example, a reference to “Georgian authorities” would utilize the surrounding document text to determine whether this referred to the country in Europe or the US state, while a mention of “Cairo” would be disambiguated to see whether it referred to the capital of Egypt or the small town in the state of Illinois in the US.  Each location was ultimately resolved to a centroid set of geographic coordinates that could be placed on a map, while each date was resolved to its corresponding year.

Wikipedia provides a facility for article contributors to manually annotate articles with mappable geographic coordinates.  In fact, content enriched with various forms of metadata, such as the Text Encoding Initiative (TEI) (5) are becoming more commonplace in many archives.  The US Department of State has annotated its historical Foreign Relations of the United States collection with inline TEI tags denoting mentions of person names, dates, and locations (6). However, only selected mentions are annotated, such as pivotal political Figures, rather than annotating every person mentioned in each document.  This can lead to incomplete or even misleading results when relying on collection-provided metadata.  In the case of Wikipedia, the human-provided geographic tags primarily focus on Europe and the Eastern United States, leading to a long history of academic papers that have relied on this metadata to erroneously conclude that Wikipedia is US and European-centric.  When switching to the content-based spatial data extracted by the fulltext geocoder, it becomes clear that Wikipedia’s coverage is actually quite even across the world, matching population centers (7).  As an example of the vast richness obtained by moving from metadata to fulltext, the four million English Wikipedia articles contain 80,674,980 locations and 42,443,169 dates.  An average article references 19 locations and 11 dates and there is an average of a location every 44 words and a date every 75 words.  As one example, the History section of the entry on the Golden Retriever dog breed (8) lists 21 locations and 18 dates in 605 words, an average of a location every 29 words and a date every 34 words.  This reflects the critical role of time and location in situating the narratives of encyclopedias.

Sentiment mining was also used to calculate the “tone” of each article on a 200-point scale from extremely negative to extremely positive. There are thousands of dictionaries available today for calculating everything from positive-negative to anxious-calm and fearful-confident (9).  All dictionaries operate on a similar principle: a set of words representing the emotion in question is compiled into a list and the document text is compared against this list to measure the prevalence of those words in the text.  A document with words such as “awful”, “horrific” and “terrible” is likely to be perceived by a typical reader as more negative than one using words such as “wonderful”, “lovely”, and “fantastic”.  Thus, by measuring what percentage of the document’s words are found in the positive dictionary, what percent are found in the negative dictionary, and then subtracting the two, a rough estimate of the tonality of the text can be achieved. While quite primitive, such approaches can achieve fairly high accuracy at scale.

Computational resources

All of these dimensions must be brought together into an interconnected network of knowledge.  To enable this research, SGI made available one of its UV2 supercomputers with 4,000 processing cores and scalable to 64 terabytes of cache-coherent shared memory. This machine runs a standard Linux operating system across all 4,000 cores, meaning it appears to an end user as essentially a single massive desktop computer and can run any off-the-shelf Linux application unmodified across the entire machine. This is very different from a traditional cluster, which might have 4,000 cores, but spread across hundreds of separate physical computers, each running their own operating system and unable to share memory and other resources. This allowed the project to make use of a rapid prototyping approach to software development to support near-realtime interactive ad-hoc exploration.

All of the metadata extraction, network compilation, workflows, and analysis were done using the PERL (10) programming language and the GraphViz (11) network visualization package.  PERL is one of the few programming languages designed from the ground-up for the processing and manipulation of text, especially efficiently extracting information based on complex patterns.  One of the greatest benefits of PERL is that it offers many high-level primitives and constructs for working with text patterns and as a scripting language it hides the memory management and other complexities of compiled languages.  Often the greatest cost of a research project is the human time it takes to write a new tool or run an analysis, and the ad-hoc exploratory nature of a lot of Big Data analysis means that an analyst is often testing a large number of ideas where the focus is simply on testing what the results look like, not on computational efficiency.

For example, to generate the final network map visualizations, a set of PERL scripts were written to rapidly construct the networks using different parameters to find the best final results in terms of coloration, alpha blending, inclusion thresholds, and other criteria.  A script using regular expressions and a hash table was used to extract and store an 800 gigabyte graph entirely in memory, with the program taking less than 10 minutes to write and less than 20 minutes to run.  Thus, in less than half an hour, a wide array of parameter adjustments and algorithm tweaks could be tested, focusing on the underlying research questions, not the programming implementation.  The shared memory model of the UV2 meant the standard Linux GraphViz package, designed for desktop use, could be used without any modifications to render the final networks, scaling to hundreds of gigabytes of memory as needed.  Finally, three terabytes of the machine’s memory were carved off to create a RAM disk, which is essentially a filesystem that exists entirely in system memory.  While such filesystems are temporary, in that they are lost if the machine is powered down, their read/write performance is limited only by the speed of computer memory and is over 1,000 times faster than even traditional solid state disk.  In this project, the use of a RAM disk meant that all 4,000 processor cores could be reading and writing the same set of common files in non-linear order and experience little to no delay, whereas a traditional magnetic disk system would support only a fraction of this storage load.

Summary
Part 1: Background

In part 1 of this article, the author describes the project background, purpose and some of the challenges of data collection.

Part 3: Data analytics and Visualization

In part 3 of this article, the author describes the analytical methodologies and visualization of knowledge extracted from the Wikipedia data.

References

1. http://www.mysql.com/
2.  Leetaru, K. (2011).  “Culturomics 2.0: Forecasting large-scale human behavior using global news media tone in time and space”,  First Monday. 16(9). http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/3663/3040
3.  Bellomi, F. & Bonato, R. (2005). “Network Analysis for Wikipedia.” Proceedings of Wikimania.
4. Leetaru, K. (forthcoming). “ Fulltext Geocoding Versus Spatial Metadata For Large Text Archives: Towards a Geographically Enriched Wikipedia.” D-Lib Magazine.
5.  http://www.tei-c.org/index.xml
6.  http://history.state.gov/historicaldocuments
7.  Leetaru, K. (forthcoming).  “Fulltext Geocoding Versus Spatial Metadata For Large Text Archives: Towards a Geographically Enriched Wikipedia.” D-Lib Magazine.
8. http://en.wikipedia.org/wiki/Golden_Retriever
9. Leetaru, K. (2011).  “Culturomics 2.0: Forecasting large-scale human behavior using global news media tone in time and space.” First Monday.  16(9). http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/3663/3040
10. http://www.perl.org/
11. http://www.graphviz.org/
VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

In this part of the article Kalev Leetaru describes the methods by which the Wikipedia data was stored processed and analyzed. For other parts of this article click on the links here: Summary, Part 1, Part 3.

Storing the data for processing

Once the data arrives, it must be processed into a format that can be read by the analysis tools.  Many collections are stored in proprietary or discipline-specific formats, requiring preparation and data reformatting stages.  One large digital book archive arrives as two million ZIP files containing 750 million individual ASCII files, one for each page of each book in the archive.  Few computer file systems can handle that many tiny files, and most analysis software expects to see each book as a single file.  Thus, before any analysis can begin, each of these ZIP files must be uncompressed and the individual page files reformatted as a single ASCII or XML file per book.  Other common delivery formats include PDF, EPUB, and DjVu, requiring similar preprocessing stages to extract the text layers.  While XML is becoming a growing standard for the distribution of text content, the XML standard defines only how a file is structured, leaving individual vendors to decide the specific XML encoding scheme they prefer.  Thus, even when an archive is distributed as a single XML file, preprocessing tools will be needed to extract the fields of interest.  In the case of Wikipedia, the complete four million entry archive is available as a single XML file for download directly from their website and uses a fairly simple XML schema, making it easy to extract the text of each entry.

As the fields of interest are extracted from the source data, they must be stored in a format amenable to data analysis.  In cases where only one or two software packages will be used for the analysis, data can simply be converted into a file format they support.  If multiple software packages will be used, it may make more sense to convert the data to an intermediate representation that can easily be converted to and from the other formats on demand.  Relational database servers offer a variety of features such as indexes and specialized algorithms designed for datasets too large to fit into memory that enable high-speed efficient searching, browsing, and basic analysis of even very large collections, and many filters are available to convert to and from major file formats.  Some servers, like the free edition of MySQL, (1) are highly scalable, yet extremely lightweight and can run on any Linux or Windows server.  Alternatively, if it is not possible to run a database server, a simple XML format can be developed that includes only the fields of interest, or specialized formats such as packed data structures that allow rapid randomized retrieval from the file.  In the case of the Wikipedia project, a MySQL database was used to store the data, which was then exported to a special packed XML format designed for maximum processing efficiency during the large computation phases.

From words to connections: transforming a text archive into a knowledge base

Documents are inherently large collections of words, but to a computer each word holds the same meaning and importance as every other word, limiting the types of patterns that can be explored in an archive to simply word frequencies.  The creation of higher-order representations capturing specific dimensions of that information, recognizing words indicating space, time, and emotion, allow automated analyses to move closer towards studying patterns in the actual meaning and focus of those documents.  The first generation of Big Data analysis focused largely on examining such indicators in isolation, plotting the tone of discussion of a topic over time or mapping locations and making lists of persons mentioned in that coverage.  Connections among indicators have largely been ignored, primarily because the incredible richness of human text leads to networks of interconnections that can easily reach hundreds of trillions of links from relatively small collections.  Yet historical research tends to revolve around these very connections and the interplay they capture between people, places, and dates and the actions and events that relate them.  Thus, the grand challenge questions driving the second generation of Big Data research tend to revolve around weaving together the myriad connections scattered across an archive into a single cohesive network capturing how every piece of information fits together into the global picture.  This in turn is driving an increasing focus on connections and the enormous theoretic and computational challenges that accompany them.  In the case of Wikipedia, mapping mentions of locations and creating timelines of date mentions and tone in isolation can be enlightening, but the real insight comes from coupling those dimensions, exploring how tone diffuses over space through time.

Thus, once a data archive has been assembled, the first stage of the analytical pipeline usually begins with the construction of new metadata layers over the data.  This typically involves using various data mining algorithms to extract key pieces of information, such as names or locations, or to calculate various characteristics of the text, such as readability scores or emotion.  The results of these algorithms are then saved as metadata layers to be used for subsequent access and analysis of the text.  To explore Wikipedia’s view of world history, for example, data mining algorithms were needed to translate its large unstructured text corpus into a structured knowledgebase.  Each study uses a different set of data mining algorithms aimed at its specific needs, but location in particular is an emerging class of metadata that is gaining traction as a way of understanding information in a new light. Culturomics 2.0 (2) found that location was the single most prominent organizing dimension in a three-decade archive of more than 100 million print and broadcast news reports translated from vernacular languages across nearly every country in the world, appearing on average 200-300 words. In the case of Wikipedia, previous studies of the linking structure of its pages have found that time and space form the two central dimensions around which the entire site is organized (3). Thus, for the metadata construction stage of the Wikipedia project, a fulltext geocoding algorithm was applied to all of the articles to automatically identify, disambiguate, and convert all textual geographic references to approximate mappable coordinates (4). This resulted in a new XML metadata layer that recorded every mention of a location in the text of each article and the corresponding latitude and longitude for mapping.  A similar algorithm was used to identify mentions of dates.  For example, a reference to “Georgian authorities” would utilize the surrounding document text to determine whether this referred to the country in Europe or the US state, while a mention of “Cairo” would be disambiguated to see whether it referred to the capital of Egypt or the small town in the state of Illinois in the US.  Each location was ultimately resolved to a centroid set of geographic coordinates that could be placed on a map, while each date was resolved to its corresponding year.

Wikipedia provides a facility for article contributors to manually annotate articles with mappable geographic coordinates.  In fact, content enriched with various forms of metadata, such as the Text Encoding Initiative (TEI) (5) are becoming more commonplace in many archives.  The US Department of State has annotated its historical Foreign Relations of the United States collection with inline TEI tags denoting mentions of person names, dates, and locations (6). However, only selected mentions are annotated, such as pivotal political Figures, rather than annotating every person mentioned in each document.  This can lead to incomplete or even misleading results when relying on collection-provided metadata.  In the case of Wikipedia, the human-provided geographic tags primarily focus on Europe and the Eastern United States, leading to a long history of academic papers that have relied on this metadata to erroneously conclude that Wikipedia is US and European-centric.  When switching to the content-based spatial data extracted by the fulltext geocoder, it becomes clear that Wikipedia’s coverage is actually quite even across the world, matching population centers (7).  As an example of the vast richness obtained by moving from metadata to fulltext, the four million English Wikipedia articles contain 80,674,980 locations and 42,443,169 dates.  An average article references 19 locations and 11 dates and there is an average of a location every 44 words and a date every 75 words.  As one example, the History section of the entry on the Golden Retriever dog breed (8) lists 21 locations and 18 dates in 605 words, an average of a location every 29 words and a date every 34 words.  This reflects the critical role of time and location in situating the narratives of encyclopedias.

Sentiment mining was also used to calculate the “tone” of each article on a 200-point scale from extremely negative to extremely positive. There are thousands of dictionaries available today for calculating everything from positive-negative to anxious-calm and fearful-confident (9).  All dictionaries operate on a similar principle: a set of words representing the emotion in question is compiled into a list and the document text is compared against this list to measure the prevalence of those words in the text.  A document with words such as “awful”, “horrific” and “terrible” is likely to be perceived by a typical reader as more negative than one using words such as “wonderful”, “lovely”, and “fantastic”.  Thus, by measuring what percentage of the document’s words are found in the positive dictionary, what percent are found in the negative dictionary, and then subtracting the two, a rough estimate of the tonality of the text can be achieved. While quite primitive, such approaches can achieve fairly high accuracy at scale.

Computational resources

All of these dimensions must be brought together into an interconnected network of knowledge.  To enable this research, SGI made available one of its UV2 supercomputers with 4,000 processing cores and scalable to 64 terabytes of cache-coherent shared memory. This machine runs a standard Linux operating system across all 4,000 cores, meaning it appears to an end user as essentially a single massive desktop computer and can run any off-the-shelf Linux application unmodified across the entire machine. This is very different from a traditional cluster, which might have 4,000 cores, but spread across hundreds of separate physical computers, each running their own operating system and unable to share memory and other resources. This allowed the project to make use of a rapid prototyping approach to software development to support near-realtime interactive ad-hoc exploration.

All of the metadata extraction, network compilation, workflows, and analysis were done using the PERL (10) programming language and the GraphViz (11) network visualization package.  PERL is one of the few programming languages designed from the ground-up for the processing and manipulation of text, especially efficiently extracting information based on complex patterns.  One of the greatest benefits of PERL is that it offers many high-level primitives and constructs for working with text patterns and as a scripting language it hides the memory management and other complexities of compiled languages.  Often the greatest cost of a research project is the human time it takes to write a new tool or run an analysis, and the ad-hoc exploratory nature of a lot of Big Data analysis means that an analyst is often testing a large number of ideas where the focus is simply on testing what the results look like, not on computational efficiency.

For example, to generate the final network map visualizations, a set of PERL scripts were written to rapidly construct the networks using different parameters to find the best final results in terms of coloration, alpha blending, inclusion thresholds, and other criteria.  A script using regular expressions and a hash table was used to extract and store an 800 gigabyte graph entirely in memory, with the program taking less than 10 minutes to write and less than 20 minutes to run.  Thus, in less than half an hour, a wide array of parameter adjustments and algorithm tweaks could be tested, focusing on the underlying research questions, not the programming implementation.  The shared memory model of the UV2 meant the standard Linux GraphViz package, designed for desktop use, could be used without any modifications to render the final networks, scaling to hundreds of gigabytes of memory as needed.  Finally, three terabytes of the machine’s memory were carved off to create a RAM disk, which is essentially a filesystem that exists entirely in system memory.  While such filesystems are temporary, in that they are lost if the machine is powered down, their read/write performance is limited only by the speed of computer memory and is over 1,000 times faster than even traditional solid state disk.  In this project, the use of a RAM disk meant that all 4,000 processor cores could be reading and writing the same set of common files in non-linear order and experience little to no delay, whereas a traditional magnetic disk system would support only a fraction of this storage load.

Summary
Part 1: Background

In part 1 of this article, the author describes the project background, purpose and some of the challenges of data collection.

Part 3: Data analytics and Visualization

In part 3 of this article, the author describes the analytical methodologies and visualization of knowledge extracted from the Wikipedia data.

References

1. http://www.mysql.com/
2.  Leetaru, K. (2011).  “Culturomics 2.0: Forecasting large-scale human behavior using global news media tone in time and space”,  First Monday. 16(9). http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/3663/3040
3.  Bellomi, F. & Bonato, R. (2005). “Network Analysis for Wikipedia.” Proceedings of Wikimania.
4. Leetaru, K. (forthcoming). “ Fulltext Geocoding Versus Spatial Metadata For Large Text Archives: Towards a Geographically Enriched Wikipedia.” D-Lib Magazine.
5.  http://www.tei-c.org/index.xml
6.  http://history.state.gov/historicaldocuments
7.  Leetaru, K. (forthcoming).  “Fulltext Geocoding Versus Spatial Metadata For Large Text Archives: Towards a Geographically Enriched Wikipedia.” D-Lib Magazine.
8. http://en.wikipedia.org/wiki/Golden_Retriever
9. Leetaru, K. (2011).  “Culturomics 2.0: Forecasting large-scale human behavior using global news media tone in time and space.” First Monday.  16(9). http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/3663/3040
10. http://www.perl.org/
11. http://www.graphviz.org/
VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

Part 1: Background

This part of the article describes the project background, purpose and some of the challenges of data collection.

Read more >


This part of the article describes the project background, purpose and some of the challenges of data collection. For other parts of this article click on the links here: Summary, Part 2, Part 3.

A Big Data exploration of Wikipedia

The introduction of massive digitized and born digital text archives and the emerging algorithms, computational methods, and computing platforms capable of exploring them has revolutionized the Humanities, Arts, and Social Sciences (HASS) disciplines over the past decade.  These days, scholars are able to explore historical patterns of human society across billions of book pages dating back more than three centuries or to watch the pulse of contemporary civilization moment by moment through hundreds of millions of microblog posts with a click of a mouse.  The scale of these datasets and the methods used to analyze them has led to a new emphasis on interactive exploration, “test[ing] different assumptions, different datasets, and different algorithms … Figure[ing] out whether you’re asking the right questions, and … pursuing intriguing possibilities that you’d otherwise have to drop for lack of time.”(1) Data scholars leverage off-the-shelf tools and plug-and-play data pipelines to rapidly and iteratively test new ideas and search for patterns to let the data “speak for itself.”  They are also increasingly becoming cross-trained experts capable of rapid ad-hoc computing, analysis, and synthesis.  At Facebook, “on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of [those] analyses to other members of the organization.”(1)

The classic image of the solitary scholar spending a professional lifetime examining the most nuanced details of a small collection of works is slowly giving way to the collaborative researcher exploring large-scale patterns across millions or even billions of works.  A driving force of this new approach to scholarship is the concept of whole-corpus analysis, in which data mining tools are applied to every work in a collection.  This is in contrast to the historical model of a researcher searching for specific works and analyzing only the trends found in that small set of documents.  There are two reasons for this shift towards larger-scale analysis: more complex topics being explored and the need for baseline indicators.  Advances in computing power have made it possible to move beyond the simple keyword searches of early research to more complex topics, but this requires more complex search mechanisms.  To study topical patterns in how books of the nineteenth century described “The West” using a traditional keyword search, one would have to compile a list of every city and landmark in the Western United States and construct a massive Boolean “OR” statement potentially including several million terms.  Geographic terms are often ambiguous (“Washington” can refer both to the state on the West coast and the US capital on the East coast; 40% of US locations share their name with another location elsewhere in the US) and so in addition to being impractical, the resulting queries would have a very high false-positive rate. Instead, algorithms can be applied to identify and disambiguate each geographic location in each document, annotating the text with their approximate locations, allowing native geographic search of the text.

The creation of baselines has also been a strong factor in driving whole-corpus analysis.  Search for the raw number of mentions by year of nearly any keyword in a digitized book collection 1800-1900 and the resulting graph will likely show a strong increase in the use of the term over that century.  The problem with this measure is that the total number of books published in each year that have been digitized is not constant: it increases at a linear to exponential rate depending on the book collection.  This means that nearly any word will show a significant increase in the total number of raw mentions simply because the universe of text has increased.  To compensate for this, measurement tools like the Google Ngrams viewer (2) calculate a word’s popularity each year not as the absolute number of mentions, but rather as the percentage of all words published that year.  This effectively measures the “rate” at which a word is used, essentially normalizing the impact of the increasing number of books each year.  Yet, to do this, Google had to compute the total list of all unique words in all books published in each year, creating a whole-corpus baseline.  Similarly, when calculating shifts in the “tone” towards a topic or its spatial association, corpus baselines are needed to determine whether the observed changes are specifically associated with that topic, or whether they merely reflect corpus-wide trends over that period.

Into this emerging world of Big Data Humanities, Arts, and Social Sciences (HASS) scholarship, a collaboration with supercomputing company Silicon Graphics International (SGI) leveraged their new 4,000-core 64TB-shared-memory UV2 supercomputer to apply this interactive exploration approach to telling the story of Wikipedia’s chronicle of world history.  Launched a little over a decade ago, Wikipedia has become an almost indispensable part of daily life, housing 22 million articles across 285 languages that are accessed more than 2.7 billion times a month from the United States alone. Today Alexa ranks it the 6th most popular site on the entire web and it has become one of the largest general web-based reference works in existence (3). It is also unique among encyclopedias in that in addition to being a community product of millions of contributors, Wikipedia actively encourages the downloading of its complete contents for data mining.  In fact, it even has a dedicated download site containing the complete contents of the site in XML format ready for computer processing (4). This openness has made it one of the most widely-used data sources for data mining, with Google Scholar returning more than 400,000 articles either studying or referencing Wikipedia.

As an encyclopedia, Wikipedia is essentially a massive historical daybook cataloging global activity through history arrayed by date and location.  Yet, most of the literature on Wikipedia thus far has focused on its topical knowledge, examining the linking structure of Wikipedia (which pages link to which other pages and what category tags are applied where) or studied a small number of entries intensively (5). Few studies have explored the historical record captured on Wikipedia’s pages.  In fact, one of the few previous studies to explore Wikipedia as a historical record visualized just 14,000 events cross-linked from entries that had been manually tagged by human contributors with both date and geographic location information (6). No study has delved into the contents of the pages themselves and looked at every location and every date mentioned across all four million English-language entries and the picture of history they yield from the collective views of the millions of contributors that have built Wikipedia over the past decade.

The Big Data workflow: acquiring the data

The notion of exploring Wikipedia’s view of history is a classic Big Data application: an open-ended exploration of “what’s interesting” in a large data collection leveraging massive computing resources.  While quite small in comparison to the hundreds-of-terabytes datasets that are becoming increasingly common in the Big Data realm of corporations and governments, the underlying question explored in this Wikipedia study is quite similar: finding overarching patterns in a large collection of unstructured text, to learn new things about the world from those patterns, and to do all of this rapidly, interactively, and with minimal human investment.

As their name suggests, all Big Data projects begin with the selection and acquisition of data.  In the HASS disciplines the data acquisition process can involve months of searching, license negotiations with data vendors, and elaborate preparations for data transfer.  Data collections at these scales are often too large to simply download over the network (some collections can total hundreds of terabytes or even petabytes) and so historically have been shipped on USB drives.  While most collections fit onto just one or two drives, the largest collections can require tens, hundreds, or even thousands of high-capacity USB drives or tape cartridges.  Some collections are simply too large to move or may involve complex licensing restrictions that prevent them from being copied en-mass.  To address this, some data vendors are beginning to offer small local clusters housed at their facilities where researchers can apply for an allocation to run their data mining algorithms on the vendor’s own data mining cluster and retrieve just the analytical results, saving all of the data movement concerns.

In some cases it is possible to leverage the high-speed research networks that connect many academic institutions to download smaller collections via the network.  Some services require specialized file transfer software that may utilize network ports that are blocked by campus firewalls or may require that the receiving machine install specialized software or obtain security certificates that may be difficult at many institutions.  Web-based APIs that allow files to be downloaded via standard authenticated web requests are more flexible and supported on most academic computing resources.  Such APIs also allow for nearly unlimited data transfer parallelism as most archives consist of massive numbers of small documents which can be parallelized simply by requesting multiple documents at once.  Not all web-based APIs are well-suited for bulk transfers, however.  Some APIs only allow documents to be requested a page at a time, requiring 600 individual requests to separately download each page of a single 600 page book.  At the very minimum, APIs must allow the retrieval of an entire work at a time as a single file, either as a plain ASCII file with new page characters indicating page boundaries (where applicable) or in XML format.  Applications used to manage the downloading workflow must be capable of automatically restarting where they left off, since the downloading process can often take days or even weeks and can frequently be interrupted by network outages and hardware failures.  The most flexible APIs allow an application to query the master inventory of all works, selecting only those works matching certain criteria (or a list of all documents), and downloading a machine-friendly CSV or XML output that includes a direct link to download each document.  Data mining tools are often developed for use on just one language, so a project might wish to download only English language works, for example.

Many emerging projects perform data mining on the full textual content of each work, and thus require access to the Optical Character Recognition (OCR) output (7). However, handwritten works, works scanned from poor-quality originals (such as heavily-scratched service microform), or works that make use of Fraktur or other specialized fonts, are highly resistant to OCR and thus normally do not yield usable OCR output.  Some archives OCR every document and include the output as-is, leading to 10MB files of random garbage characters, while others filter poor-quality OCR through an automated or manual review processes.  Those that exclude poor-quality OCR should indicate through a metadata flag or other means that the OCR file has been specifically excluded from this work.  Otherwise, it is difficult for automated downloading tools to distinguish between a work where the OCR file has been specifically left out and a technical error that prevented the file from being downloaded (and thus should be requeued to try again).  For those documents that include OCR content, archives should include as much metadata as possible on the specific organization scanning the work, the library it was scanned from, the scanning software and imaging system, and the specific OCR software and version used.  This information can often be used to incorporate domain knowledge about scanning practices or imaging and OCR pipeline nuances that can be used to optimize or enhance the processing of the resultant text.

Yet, perhaps the greatest challenge in the data acquisition process is policy-based rather than technical.  Unlike copyright status, for which there are clear guidelines in determining whether a work has entered the public domain (at least in the United States), there are no national policies or recommendations on what content should be made available for data mining.  In some cases archives may have received data from a commercial vendor or other source that may permit browsing, but not computational analysis.  In others, funding sources or institutional policy may permit data mining only by researchers at the home institution, or grant them exclusive early access.  Some archives permit unrestricted data mining on some content and only “non-consumptive” analysis of other material.  Yet, despite this varied landscape of access, few archives have written policies regarding data mining or clear guidelines on what material is available for analysis.  Most critically, however, while many archives include a flag for each work indicating whether it has entered public domain, no major archive today has a similar flag to indicate whether a work is available for data mining and under what restrictions.  This can cause long delays as archives must evaluate which material can be data mined, in some cases having to create policies and manually review content first.  As data mining becomes more commonplace, it is hoped that new national and international guidelines will be formed to help standardize the determination process and that archives will begin to include item-level metadata that indicates the availability of an item for data mining to vastly simplify this process.

Summary
Part 2: Data processing and Analytical methodologies

In part 2 of this article, the author describes the data processing and analytical methodologies applied to the Wikipedia content.

Part 3: Data analytics and Visualization

In part 3 of this article, the author describes the analytical methodologies and visualization of knowledge extracted from the Wikipedia data.

References

1. Loukides, M.  (2010). “What is Data Science?” http://radar.oreilly.com/2010/06/what-is-data-science.html
2. Google books Ngram Viewer. (online).  http://books.google.com/ngrams/
3. Wikipedia.  (online).  http://en.wikipedia.org/wiki/Wikipedia
4. Wikipedia: Database download. (online).  http://en.wikipedia.org/wiki/Wikipedia:Database_download
5. Giles, J. (2005).  “Special Report: Internet encyclopedias go head to head.” Nature.  http://www.nature.com/nature/journal/v438/n7070/full/438900a.html
6. Lloyd, G. (2011).  “A history of the world in 100 seconds.” Ragtag.info.  http://www.ragtag.info/2011/feb/2/history-world-100-seconds/
7. Leetaru, K. (2011). “Data Mining Methods for the Content Analyst: An Introduction to the Computational Analysis of Informational Content.” Routledge.
VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

This part of the article describes the project background, purpose and some of the challenges of data collection. For other parts of this article click on the links here: Summary, Part 2, Part 3.

A Big Data exploration of Wikipedia

The introduction of massive digitized and born digital text archives and the emerging algorithms, computational methods, and computing platforms capable of exploring them has revolutionized the Humanities, Arts, and Social Sciences (HASS) disciplines over the past decade.  These days, scholars are able to explore historical patterns of human society across billions of book pages dating back more than three centuries or to watch the pulse of contemporary civilization moment by moment through hundreds of millions of microblog posts with a click of a mouse.  The scale of these datasets and the methods used to analyze them has led to a new emphasis on interactive exploration, “test[ing] different assumptions, different datasets, and different algorithms … Figure[ing] out whether you’re asking the right questions, and … pursuing intriguing possibilities that you’d otherwise have to drop for lack of time.”(1) Data scholars leverage off-the-shelf tools and plug-and-play data pipelines to rapidly and iteratively test new ideas and search for patterns to let the data “speak for itself.”  They are also increasingly becoming cross-trained experts capable of rapid ad-hoc computing, analysis, and synthesis.  At Facebook, “on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of [those] analyses to other members of the organization.”(1)

The classic image of the solitary scholar spending a professional lifetime examining the most nuanced details of a small collection of works is slowly giving way to the collaborative researcher exploring large-scale patterns across millions or even billions of works.  A driving force of this new approach to scholarship is the concept of whole-corpus analysis, in which data mining tools are applied to every work in a collection.  This is in contrast to the historical model of a researcher searching for specific works and analyzing only the trends found in that small set of documents.  There are two reasons for this shift towards larger-scale analysis: more complex topics being explored and the need for baseline indicators.  Advances in computing power have made it possible to move beyond the simple keyword searches of early research to more complex topics, but this requires more complex search mechanisms.  To study topical patterns in how books of the nineteenth century described “The West” using a traditional keyword search, one would have to compile a list of every city and landmark in the Western United States and construct a massive Boolean “OR” statement potentially including several million terms.  Geographic terms are often ambiguous (“Washington” can refer both to the state on the West coast and the US capital on the East coast; 40% of US locations share their name with another location elsewhere in the US) and so in addition to being impractical, the resulting queries would have a very high false-positive rate. Instead, algorithms can be applied to identify and disambiguate each geographic location in each document, annotating the text with their approximate locations, allowing native geographic search of the text.

The creation of baselines has also been a strong factor in driving whole-corpus analysis.  Search for the raw number of mentions by year of nearly any keyword in a digitized book collection 1800-1900 and the resulting graph will likely show a strong increase in the use of the term over that century.  The problem with this measure is that the total number of books published in each year that have been digitized is not constant: it increases at a linear to exponential rate depending on the book collection.  This means that nearly any word will show a significant increase in the total number of raw mentions simply because the universe of text has increased.  To compensate for this, measurement tools like the Google Ngrams viewer (2) calculate a word’s popularity each year not as the absolute number of mentions, but rather as the percentage of all words published that year.  This effectively measures the “rate” at which a word is used, essentially normalizing the impact of the increasing number of books each year.  Yet, to do this, Google had to compute the total list of all unique words in all books published in each year, creating a whole-corpus baseline.  Similarly, when calculating shifts in the “tone” towards a topic or its spatial association, corpus baselines are needed to determine whether the observed changes are specifically associated with that topic, or whether they merely reflect corpus-wide trends over that period.

Into this emerging world of Big Data Humanities, Arts, and Social Sciences (HASS) scholarship, a collaboration with supercomputing company Silicon Graphics International (SGI) leveraged their new 4,000-core 64TB-shared-memory UV2 supercomputer to apply this interactive exploration approach to telling the story of Wikipedia’s chronicle of world history.  Launched a little over a decade ago, Wikipedia has become an almost indispensable part of daily life, housing 22 million articles across 285 languages that are accessed more than 2.7 billion times a month from the United States alone. Today Alexa ranks it the 6th most popular site on the entire web and it has become one of the largest general web-based reference works in existence (3). It is also unique among encyclopedias in that in addition to being a community product of millions of contributors, Wikipedia actively encourages the downloading of its complete contents for data mining.  In fact, it even has a dedicated download site containing the complete contents of the site in XML format ready for computer processing (4). This openness has made it one of the most widely-used data sources for data mining, with Google Scholar returning more than 400,000 articles either studying or referencing Wikipedia.

As an encyclopedia, Wikipedia is essentially a massive historical daybook cataloging global activity through history arrayed by date and location.  Yet, most of the literature on Wikipedia thus far has focused on its topical knowledge, examining the linking structure of Wikipedia (which pages link to which other pages and what category tags are applied where) or studied a small number of entries intensively (5). Few studies have explored the historical record captured on Wikipedia’s pages.  In fact, one of the few previous studies to explore Wikipedia as a historical record visualized just 14,000 events cross-linked from entries that had been manually tagged by human contributors with both date and geographic location information (6). No study has delved into the contents of the pages themselves and looked at every location and every date mentioned across all four million English-language entries and the picture of history they yield from the collective views of the millions of contributors that have built Wikipedia over the past decade.

The Big Data workflow: acquiring the data

The notion of exploring Wikipedia’s view of history is a classic Big Data application: an open-ended exploration of “what’s interesting” in a large data collection leveraging massive computing resources.  While quite small in comparison to the hundreds-of-terabytes datasets that are becoming increasingly common in the Big Data realm of corporations and governments, the underlying question explored in this Wikipedia study is quite similar: finding overarching patterns in a large collection of unstructured text, to learn new things about the world from those patterns, and to do all of this rapidly, interactively, and with minimal human investment.

As their name suggests, all Big Data projects begin with the selection and acquisition of data.  In the HASS disciplines the data acquisition process can involve months of searching, license negotiations with data vendors, and elaborate preparations for data transfer.  Data collections at these scales are often too large to simply download over the network (some collections can total hundreds of terabytes or even petabytes) and so historically have been shipped on USB drives.  While most collections fit onto just one or two drives, the largest collections can require tens, hundreds, or even thousands of high-capacity USB drives or tape cartridges.  Some collections are simply too large to move or may involve complex licensing restrictions that prevent them from being copied en-mass.  To address this, some data vendors are beginning to offer small local clusters housed at their facilities where researchers can apply for an allocation to run their data mining algorithms on the vendor’s own data mining cluster and retrieve just the analytical results, saving all of the data movement concerns.

In some cases it is possible to leverage the high-speed research networks that connect many academic institutions to download smaller collections via the network.  Some services require specialized file transfer software that may utilize network ports that are blocked by campus firewalls or may require that the receiving machine install specialized software or obtain security certificates that may be difficult at many institutions.  Web-based APIs that allow files to be downloaded via standard authenticated web requests are more flexible and supported on most academic computing resources.  Such APIs also allow for nearly unlimited data transfer parallelism as most archives consist of massive numbers of small documents which can be parallelized simply by requesting multiple documents at once.  Not all web-based APIs are well-suited for bulk transfers, however.  Some APIs only allow documents to be requested a page at a time, requiring 600 individual requests to separately download each page of a single 600 page book.  At the very minimum, APIs must allow the retrieval of an entire work at a time as a single file, either as a plain ASCII file with new page characters indicating page boundaries (where applicable) or in XML format.  Applications used to manage the downloading workflow must be capable of automatically restarting where they left off, since the downloading process can often take days or even weeks and can frequently be interrupted by network outages and hardware failures.  The most flexible APIs allow an application to query the master inventory of all works, selecting only those works matching certain criteria (or a list of all documents), and downloading a machine-friendly CSV or XML output that includes a direct link to download each document.  Data mining tools are often developed for use on just one language, so a project might wish to download only English language works, for example.

Many emerging projects perform data mining on the full textual content of each work, and thus require access to the Optical Character Recognition (OCR) output (7). However, handwritten works, works scanned from poor-quality originals (such as heavily-scratched service microform), or works that make use of Fraktur or other specialized fonts, are highly resistant to OCR and thus normally do not yield usable OCR output.  Some archives OCR every document and include the output as-is, leading to 10MB files of random garbage characters, while others filter poor-quality OCR through an automated or manual review processes.  Those that exclude poor-quality OCR should indicate through a metadata flag or other means that the OCR file has been specifically excluded from this work.  Otherwise, it is difficult for automated downloading tools to distinguish between a work where the OCR file has been specifically left out and a technical error that prevented the file from being downloaded (and thus should be requeued to try again).  For those documents that include OCR content, archives should include as much metadata as possible on the specific organization scanning the work, the library it was scanned from, the scanning software and imaging system, and the specific OCR software and version used.  This information can often be used to incorporate domain knowledge about scanning practices or imaging and OCR pipeline nuances that can be used to optimize or enhance the processing of the resultant text.

Yet, perhaps the greatest challenge in the data acquisition process is policy-based rather than technical.  Unlike copyright status, for which there are clear guidelines in determining whether a work has entered the public domain (at least in the United States), there are no national policies or recommendations on what content should be made available for data mining.  In some cases archives may have received data from a commercial vendor or other source that may permit browsing, but not computational analysis.  In others, funding sources or institutional policy may permit data mining only by researchers at the home institution, or grant them exclusive early access.  Some archives permit unrestricted data mining on some content and only “non-consumptive” analysis of other material.  Yet, despite this varied landscape of access, few archives have written policies regarding data mining or clear guidelines on what material is available for analysis.  Most critically, however, while many archives include a flag for each work indicating whether it has entered public domain, no major archive today has a similar flag to indicate whether a work is available for data mining and under what restrictions.  This can cause long delays as archives must evaluate which material can be data mined, in some cases having to create policies and manually review content first.  As data mining becomes more commonplace, it is hoped that new national and international guidelines will be formed to help standardize the determination process and that archives will begin to include item-level metadata that indicates the availability of an item for data mining to vastly simplify this process.

Summary
Part 2: Data processing and Analytical methodologies

In part 2 of this article, the author describes the data processing and analytical methodologies applied to the Wikipedia content.

Part 3: Data analytics and Visualization

In part 3 of this article, the author describes the analytical methodologies and visualization of knowledge extracted from the Wikipedia data.

References

1. Loukides, M.  (2010). “What is Data Science?” http://radar.oreilly.com/2010/06/what-is-data-science.html
2. Google books Ngram Viewer. (online).  http://books.google.com/ngrams/
3. Wikipedia.  (online).  http://en.wikipedia.org/wiki/Wikipedia
4. Wikipedia: Database download. (online).  http://en.wikipedia.org/wiki/Wikipedia:Database_download
5. Giles, J. (2005).  “Special Report: Internet encyclopedias go head to head.” Nature.  http://www.nature.com/nature/journal/v438/n7070/full/438900a.html
6. Lloyd, G. (2011).  “A history of the world in 100 seconds.” Ragtag.info.  http://www.ragtag.info/2011/feb/2/history-world-100-seconds/
7. Leetaru, K. (2011). “Data Mining Methods for the Content Analyst: An Introduction to the Computational Analysis of Informational Content.” Routledge.
VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

A Big Data Approach to the Humanities, Arts, and Social Sciences: Wikipedia’s View of the World through Supercomputing

Kalev Leetaru shares an innovative way to analyze Wikipedia’s view of world history using a Big Data approach to historical research.

Read more >


Summary

Wikipedia’s view of world history is explored and visualized through spatial, temporal, and emotional data mining using a Big Data approach to historical research.  Unlike previous studies which have looked only at Wikipedia’s metadata, this study focuses on the complete fulltext of all four million English-language entries to identify every mention of a location and date across every entry, automatically disambiguating and converting each location to an approximate geographic coordinate for mapping and every date to a numeric year.  More than 80 million locations and 42 million dates between 1000AD and 2012 are extracted, averaging 19 locations and 11 dates per article and Wikipedia is seen to have four periods of growth over the past millennia: 1001-1500 (Middle Ages), 1501-1729 (Early Modern Period), 1730-2003 (Age of Enlightenment), 2004-2011 (Wikipedia Era).  Since 2007 Wikipedia has hit a limit of around 1.7-1.9 million new mentions of each year, with the majority of its growth coming in the form of enhanced historical coverage, rather than increasing documentation of the present. Two animation sequences visualize Wikipedia’s view of the world over the past two centuries, while an interactive Google Earth display allows the browsing of Wikipedia’s knowledgebase in time and space. The one-way nature of connections in Wikipedia, the lack of links, and uneven distribution of Infoboxes, all point to the limitations of metadata-based data mining of collections such as Wikipedia and the ability of fulltext analysis and spatial and temporal analysis in particular, to overcome these limitations.  Along the way, the underlying challenges and opportunities facing Big Data analysis in the Humanities, Arts, and Social Sciences (HASS) disciplines are explored, including computational approaches, the data acquisition workflow, data storage, metadata construction and translating text into knowledge.

Part 1: Background

This part of the article describes the project background, purpose and some of the challenges of data collection.

Part 2: Data processing and Analytical methodologies

The methods by which the Wikipedia data was stored, processed, and analysed are presented in this part of the article.

Part 3: Data analytics and Visualization

This part of the article describes the analytical methodologies and visualization of knowledge extracted from the Wikipedia data.

 

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

Summary

Wikipedia’s view of world history is explored and visualized through spatial, temporal, and emotional data mining using a Big Data approach to historical research.  Unlike previous studies which have looked only at Wikipedia’s metadata, this study focuses on the complete fulltext of all four million English-language entries to identify every mention of a location and date across every entry, automatically disambiguating and converting each location to an approximate geographic coordinate for mapping and every date to a numeric year.  More than 80 million locations and 42 million dates between 1000AD and 2012 are extracted, averaging 19 locations and 11 dates per article and Wikipedia is seen to have four periods of growth over the past millennia: 1001-1500 (Middle Ages), 1501-1729 (Early Modern Period), 1730-2003 (Age of Enlightenment), 2004-2011 (Wikipedia Era).  Since 2007 Wikipedia has hit a limit of around 1.7-1.9 million new mentions of each year, with the majority of its growth coming in the form of enhanced historical coverage, rather than increasing documentation of the present. Two animation sequences visualize Wikipedia’s view of the world over the past two centuries, while an interactive Google Earth display allows the browsing of Wikipedia’s knowledgebase in time and space. The one-way nature of connections in Wikipedia, the lack of links, and uneven distribution of Infoboxes, all point to the limitations of metadata-based data mining of collections such as Wikipedia and the ability of fulltext analysis and spatial and temporal analysis in particular, to overcome these limitations.  Along the way, the underlying challenges and opportunities facing Big Data analysis in the Humanities, Arts, and Social Sciences (HASS) disciplines are explored, including computational approaches, the data acquisition workflow, data storage, metadata construction and translating text into knowledge.

Part 1: Background

This part of the article describes the project background, purpose and some of the challenges of data collection.

Part 2: Data processing and Analytical methodologies

The methods by which the Wikipedia data was stored, processed, and analysed are presented in this part of the article.

Part 3: Data analytics and Visualization

This part of the article describes the analytical methodologies and visualization of knowledge extracted from the Wikipedia data.

 

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
  • Elsevier has recently launched the International Center for the Study of Research - ICSR - to help create a more transparent approach to research assessment. Its mission is to encourage the examination of research using an array of metrics and a variety of qualitative and quantitive methods.