Articles

Research trends is an online magazine providing objective insights into scientific trends based on bibliometrics analyses.

Computational & Data Science, Infrastructure, & Interdisciplinary Research on University Campuses:

Daniel Katz and Gabrielle Allen demonstrate the use of Big Data analytics at university level.

Read more >


Experiences and Lessons from the Center for Computation & Technology

[This paper is the work of Daniel S. Katz (CCT Director of Cyberinfrastructure Development, 2006 to 2009) and Gabrielle Allen (CCT Assistant Director, 2003 to 2008); it does not reflect the views or opinions of the CCT or LSU.]

Introduction

In recent years, numerous distinguished national panels (1) have critically examined modern developments in research and education and reached a similar conclusion: computational and data-enabled science, as the third pillar of research, standing equally alongside theory and experiment, will radically transform all areas of education, scholarly inquiry, industrial practice, as well as local and world economies. The panels also similarly concluded that to facilitate this transformation, profound changes must be made throughout government, academia, and industry. The remarks made in the 2005 Presidential Information Technology Advisory Committee (PITAC) report (2) are still relevant: “Universities...have not effectively recognized the strategic significance of computational science in either their organizational structures or their research and educational planning.” Computational initiatives associated with universities have taken various forms: supercomputing centers that provide national, statewide, or local computing facilities and encourage research involving computation; faculty hiring initiatives focused on initiating research programs to change the university's expertise and culture; establishment of academic research centers on campuses that include formal involvement of faculty, for example, through joint positions with departments; and multi-university or other partnerships  where the university is represented by a single entity.

We believe that any academic institution wishing to advance computational and data science needs to first examine its status in three areas: cyberinfrastructure facilities, support for interdisciplinary research, and computational culture and expertise (Figure 1). Cyberinfrastructure facilities refers to the computational, storage, network, and visualization resources (local, national, and international) to which researchers have access; to the technical and professional support for these services; and to the connection of these services to desktop machines or experimental instruments in an end-to-end manner. Support for interdisciplinary research refers to the university's policies on joint appointments between units and associated promotion and tenure, policies and practices for university-wide curricula, and the academic appreciation of computational science that could rate, for example, software or data development in a similar manner to publications and citations. Finally, computational culture and expertise relates to the existence and prominence of faculty across a campus who develop or use computation as part of their research, and the provision of undergraduate and graduate courses that will train and educate students to work on research projects in the computational sciences.

Figure 1: Advancing a comprehensive computational science program requires coordinated initiatives in developing and supporting interdisciplinary research, enabling cyberinfrastructure, and underlying research and culture in computation.

Once the status of these areas has been reviewed, there are additional questions in designing a computational initiative. Should the cyberinfrastructure resources be state-of-the-art to enable leading edge research in computational science? Should faculty expertise in computational science be pervasive across all departments or concentrated in a few departments? Will the university administration back a long-term agenda in computational science and have the sustained desire to implement policies for changing culture? What is the timescale for change?

While there is some literature on issues relating to general interdisciplinary research (e.g., a National Academy review) (3), there is little written on the underlying visions, strategies, issues, practical implementations and best practices for computational initiatives. Further, what exists was usually written for a specific purpose, such as justifying an initiative for a state legislature, funding agency, or campus administration.

Louisiana Experiences

In April 2001, Louisiana Governor Foster asked the state Legislature to fund an Information Technology Initiative as a commitment to the 20-year Vision 2020 plan adopted in 2000 to grow and diversify the state's economy. The legislature authorized a permanent $25 million per year commitment, divided among the state's five research institutions. LSU created the Center for Applied Information Technology and Learning (LSU CAPITAL), targeting funds in education, research, and economic development, with the intent that this investment would result in the creation of new businesses, increased graduates in IT areas, and increased patents and licenses. Edward Seidel was recruited from the Max Planck Institute for Gravitational Physics (AEI) to formulate a vision and detailed plan (4) to structure LSU CAPITAL into a research center related to computation and informational technology, with a physical presence on the campus and a broad mission for interdisciplinary research at LSU and across the state. Seidel became director of LSU CAPITAL, reporting to the LSU vice chancellor of research and economic development. In October 2003, LSU CAPITAL was renamed the LSU Center for Computation & Technology, or CCT (http://www.cct.lsu.edu). LSU was lacking in all the three areas identified in Figure 1: cyberinfrastructure; support for interdisciplinary research and education; and computational research, which necessitated a three-pronged approach for the center's strategy (5,6).

Cyberinfrastructure

To address LSU's cyberinfrastructure needs, CCT worked to develop campus and regional networks, connect to the national high-speed backbone, and build sustainable computational resources on the campus. (A negative side effect of including a focus on the provision of cyberinfrastructure resources is that some people tend to label the center as just an High Performance Computing (HPC) resource provider, rather than a research center; this proved to be an issue with how the center was represented and seen by the LSU administration.)  CCT led an effort to propose a statewide high-speed network (called LONI) to connect state research institutions with multiple 10-Gbps optical lambdas. Louisiana Governor Blanco then mentioned LONI as a priority in her State of the State address. At this time, National LambdaRail (NLR) was emerging as a high-speed optical national backbone without a plan to connect to Louisiana. In 2004, Governor Blanco committed $40 million over 10 years to fund LONI, including purchasing and deploying initial computational  resources at five sites and supporting technicians and staff, to advance research, education, and industry in the state. The state also funded a membership in NLR to connect the state to computational power available throughout the nation and the world.

When the CCT was formed, LSU had recently deployed what were then significant computational resources: 128-node and 512-node dual-processor clusters, managed by staff from the physics department, and a 46-node IBM Power2/Power3 machine managed by the university’s Information Technology Services (ITS). LSU created the HPC@LSU group, funded 50-50 by CCT and ITS to jointly manage these systems, which were the only major compute resources in Louisiana. HPC@LSU also began to manage the LONI compute systems, IBM Power5 clusters, and later, additional Dell systems for both LONI and LSU, including Queen Bee (the largest LONI system), as part of the TeraGrid, the US national HPC infrastructure.

CCT envisioned building a campus and national center for advancing computational sciences across all disciplines, with these groups' research activities integrated as closely as possible with the research computing environment.  In this way, the services provided by the computing environment to the campus and nation would be the best possible, and the research output of the faculty, students, and staff would be advanced.  CCT faculty would be able to lead nationally visible research activities, being able to carry out a research program that would not be otherwise possible, providing exemplars to the campus, catalyzing activity in computational science approaches to basic sciences, engineering, humanities, business, etc.  This was a key component of the CCT vision, one that has been successful at other centers (e.g. NCSA, SDSC, AEI) around the world.

Computational Research

Initially, there were very few computationally oriented faculty in Louisiana, which hindered research in computational science, state collaborations, and LSU's involvement in national or international projects involving computation. To address this, CCT's core strategy has been to recruit computationally-oriented faculty to LSU, generally in joint 50-50 positions with departments, with tenure residing in the departments. This model has been discussed at length and has continuously been seen as the best model for strengthening departments in computational science, and encouraging real buy-in to the overall initiative from the departments. CCT also implements other strategies for associating faculty with the center, both for encouraging and supporting the participation of faculty already on the campus to take an active role in the center's programs and research, and for helping to attract and recruit faculty whose research interests overlap with CCT.

Research staff are also essential, making it possible to quickly bring in expertise in a particular computational area as a catalyst and tool for faculty recruitment, to form a bridge from center activities to the campus, to provide consistent support to strategically important areas, and to facilitate production level software development.

The fundamental group (in the CCT Core Computing Sciences Focus Area), centered around the Departments of Computer Science, Electrical and Computer Engineering, and Mathematics, was to have the necessary skills needed to build and sustain any program in computational science, including computational mathematics, scientific visualization, software toolkits, etc. Application groups were built to leverage strength on campus, hiring possibilities, and new opportunities.

In addition, CCT’s Cyberinfrastructure Development (CyD) division aimed to better integrate CCT’s research and HPC activities with the campus and national initiatives, with the mission to design, develop, and prototype cyberinfrastructure systems and software for current and future users of LSU's supercomputing systems, partnering where possible with the research groups at CCT to help professionalize prototype systems and support and expand their user base. CyD includes computational scientists, expected to cover 30-50% of their time on proposals led by scientists elsewhere at LSU or LONI, and to spend the rest of their time on computational science activities that lead to new funding or projects and internal support of HPC and LONI activities.

CCT’s education goal has been to cultivate the next generation of leaders in Louisiana’s knowledge-based economy, creating a highly skilled, diverse workforce. To reach this goal, objectives were set to assist in developing curricula and educational opportunities related to computation, to help hire faculty who would support an integrated effort to incorporate computation into the curricula, to offer programs that support activity in scientific computing, to attract and retain competitive students, and to advance opportunities for women and minorities in the STEM disciplines.

Interdisciplinary Research

The final component of the triangle, interdisciplinary research, was supported by CCT’s organization and projects.  CCT faculty are generally able to lead and take part in world-class interdisciplinary research groups related to computation, organized in focus areas: Core Computing Sciences, Coast to Cosmos, Material World, Cultural Computing, and System Science & Engineering. Each focus area has a faculty lead responsible for building cross-cutting interdisciplinary research programs, administration, coordinating the hiring of new faculty and staff, and organizing their unit. Interdisciplinary research is driven by activities in strategically motivated, large-scale projects in the focus areas, faculty research groups, and the Cyberinfrastructure Development division. These projects provide support (students, postdocs, and direction) to the Focus Areas as well as broad outreach for education and training across the state. In addition, CCT tried to engage senior administrators and use CCT faculty to drive curriculum change on the campus.

Crosscutting Activities

Two large projects begun in 2007 were the LONI Institute and Cybertools.  The LONI Institute was a statewide multi-university collaboration, built on the success of the LONI university partnership, to coordinate the hiring of two faculty members at each university, in computer science, computational biology, and/or computational materials, and of one computational scientist at each university, to spur collaborative projects.  Cybertools was another multi-university collaboration that used computational science projects across the state to drive developments in tools that could use the state’s computing and networking resources, which in turn could enable new computational science projects.

Particularly from the state legislature's point of view, CCT was intended to catalyze and support new economic development in the state. In fact, the initial metrics for success provided for LSU CAPITAL included the number of resulting new businesses and patents. Economic development needs to be carefully planned and is a long-term initiative, where success can be hard to measure, particularly in the short term. An example success, though not originally planned, was in September 2008, when Electronic Arts (EA) announced that they would place their North American quality assurance and testing center at LSU, creating 20 full-time jobs and 600 half-time jobs, with an annual payroll of $5.7 million throughout the next two years. EA noted that education and research efforts at LSU, including CCT research areas, were a strong factor in the company's decision to locate this center in Louisiana.

Recent Developments and Concluding Thoughts

In 2008, Seidel was recruited to the National Science Foundation, and LSU appointed an interim director and co-director and began a search for a new permanent director, which led to a director being appointed from inside the university for a three-year term. Starting in 2009, LSU has faced several significant and ongoing budget cuts that are currently impacting the CCT, particularly in its ability to recruit and retain faculty and staff.

The issues faced at LSU are similar to those at other institutions responding to the nation's call for an advancement of computation, computational science and interdisciplinary research. We believe it is important to carefully analyze the experiences of centers such as at LSU, as we have attempted to begin to do in this paper, in order to establish best practices for new initiatives or to lead to more fundamental change. From our experiences at CCT, we can highlight four key points that we feel are crucial for the success and sustainability of computational research centers such as CCT:

The three facets of computational science shown in Figure 1 have be taken seriously on the campus at the highest levels and seen as an important component of academic research.

HPC facilities on campuses need to be integrated with national resources and provide a pathway for campus research to easily connect to national and international activities.

Education and training of students and faculty is crucial; vast improvements are needed over the small numbers currently reached through HPC center tutorials; computation and computational thinking need to be part of new curricula across all disciplines.

Funding agencies should put more emphasis on broadening participation in computation, not just focusing on high end systems where decreasing numbers of researchers can join in, but making tools much more easily usable and intuitive and freeing all researchers from the limitations of their personal workstations, and providing access to simple tools for large scale parameter studies, data archiving, visualization and collaboration.

In addition, there are two points that we have learned specifically from the CCT experience:

  • The overall vision of the university on topic X needs to be consistent across a broad spectrum of the university administration and faculty; it cannot be just one person’s vision, though it may start with one person.
  • The funding needs to be stable over a number of years; activities need to be sustained to be successful, and this needs to be clear to the community from the beginning.

References

1. World Technology Evaluation Center, Inc., (2009)“International Assessment of Research and Development in Simulation-based Engineering and Science”, http://www.wtec.org/sbes/
2. President’s Information Technology AdvisoryCommittee (2005) “Report to the Presidentof the US, Computational Science: EnsuringAmerica’s Competitiveness”, http://www.nitrd.gov/pitac/reports/20050609_computational/computational.pdf
3. Committee on Facilitating InterdisciplinaryResearch, National Academy of Sciences, NationalAcademy of Engineering, Institute of Medicine(2004) “Facilitating Interdisciplinary Research”, http://www.nap.edu/catalog/11153.html
4. Seidel, E., Allen, G., & Towns, J. (2003) “LSU CAPITALCenter (LSUC) Immediate Plans,” http://figshare.com/articles/Original_LSU_CAPITAL_plan/92822
5. CCT Strategic Plan (2006–2010) http://www.cct.lsu.edu/uploads/CCTStrategicPlan20062010.pdf
6. CCT Faculty Plan (2006) http://www.cct.lsu.edu/~gallen/Reports/FacultyPlan_2006.pdf
VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

Experiences and Lessons from the Center for Computation & Technology

[This paper is the work of Daniel S. Katz (CCT Director of Cyberinfrastructure Development, 2006 to 2009) and Gabrielle Allen (CCT Assistant Director, 2003 to 2008); it does not reflect the views or opinions of the CCT or LSU.]

Introduction

In recent years, numerous distinguished national panels (1) have critically examined modern developments in research and education and reached a similar conclusion: computational and data-enabled science, as the third pillar of research, standing equally alongside theory and experiment, will radically transform all areas of education, scholarly inquiry, industrial practice, as well as local and world economies. The panels also similarly concluded that to facilitate this transformation, profound changes must be made throughout government, academia, and industry. The remarks made in the 2005 Presidential Information Technology Advisory Committee (PITAC) report (2) are still relevant: “Universities...have not effectively recognized the strategic significance of computational science in either their organizational structures or their research and educational planning.” Computational initiatives associated with universities have taken various forms: supercomputing centers that provide national, statewide, or local computing facilities and encourage research involving computation; faculty hiring initiatives focused on initiating research programs to change the university's expertise and culture; establishment of academic research centers on campuses that include formal involvement of faculty, for example, through joint positions with departments; and multi-university or other partnerships  where the university is represented by a single entity.

We believe that any academic institution wishing to advance computational and data science needs to first examine its status in three areas: cyberinfrastructure facilities, support for interdisciplinary research, and computational culture and expertise (Figure 1). Cyberinfrastructure facilities refers to the computational, storage, network, and visualization resources (local, national, and international) to which researchers have access; to the technical and professional support for these services; and to the connection of these services to desktop machines or experimental instruments in an end-to-end manner. Support for interdisciplinary research refers to the university's policies on joint appointments between units and associated promotion and tenure, policies and practices for university-wide curricula, and the academic appreciation of computational science that could rate, for example, software or data development in a similar manner to publications and citations. Finally, computational culture and expertise relates to the existence and prominence of faculty across a campus who develop or use computation as part of their research, and the provision of undergraduate and graduate courses that will train and educate students to work on research projects in the computational sciences.

Figure 1: Advancing a comprehensive computational science program requires coordinated initiatives in developing and supporting interdisciplinary research, enabling cyberinfrastructure, and underlying research and culture in computation.

Once the status of these areas has been reviewed, there are additional questions in designing a computational initiative. Should the cyberinfrastructure resources be state-of-the-art to enable leading edge research in computational science? Should faculty expertise in computational science be pervasive across all departments or concentrated in a few departments? Will the university administration back a long-term agenda in computational science and have the sustained desire to implement policies for changing culture? What is the timescale for change?

While there is some literature on issues relating to general interdisciplinary research (e.g., a National Academy review) (3), there is little written on the underlying visions, strategies, issues, practical implementations and best practices for computational initiatives. Further, what exists was usually written for a specific purpose, such as justifying an initiative for a state legislature, funding agency, or campus administration.

Louisiana Experiences

In April 2001, Louisiana Governor Foster asked the state Legislature to fund an Information Technology Initiative as a commitment to the 20-year Vision 2020 plan adopted in 2000 to grow and diversify the state's economy. The legislature authorized a permanent $25 million per year commitment, divided among the state's five research institutions. LSU created the Center for Applied Information Technology and Learning (LSU CAPITAL), targeting funds in education, research, and economic development, with the intent that this investment would result in the creation of new businesses, increased graduates in IT areas, and increased patents and licenses. Edward Seidel was recruited from the Max Planck Institute for Gravitational Physics (AEI) to formulate a vision and detailed plan (4) to structure LSU CAPITAL into a research center related to computation and informational technology, with a physical presence on the campus and a broad mission for interdisciplinary research at LSU and across the state. Seidel became director of LSU CAPITAL, reporting to the LSU vice chancellor of research and economic development. In October 2003, LSU CAPITAL was renamed the LSU Center for Computation & Technology, or CCT (http://www.cct.lsu.edu). LSU was lacking in all the three areas identified in Figure 1: cyberinfrastructure; support for interdisciplinary research and education; and computational research, which necessitated a three-pronged approach for the center's strategy (5,6).

Cyberinfrastructure

To address LSU's cyberinfrastructure needs, CCT worked to develop campus and regional networks, connect to the national high-speed backbone, and build sustainable computational resources on the campus. (A negative side effect of including a focus on the provision of cyberinfrastructure resources is that some people tend to label the center as just an High Performance Computing (HPC) resource provider, rather than a research center; this proved to be an issue with how the center was represented and seen by the LSU administration.)  CCT led an effort to propose a statewide high-speed network (called LONI) to connect state research institutions with multiple 10-Gbps optical lambdas. Louisiana Governor Blanco then mentioned LONI as a priority in her State of the State address. At this time, National LambdaRail (NLR) was emerging as a high-speed optical national backbone without a plan to connect to Louisiana. In 2004, Governor Blanco committed $40 million over 10 years to fund LONI, including purchasing and deploying initial computational  resources at five sites and supporting technicians and staff, to advance research, education, and industry in the state. The state also funded a membership in NLR to connect the state to computational power available throughout the nation and the world.

When the CCT was formed, LSU had recently deployed what were then significant computational resources: 128-node and 512-node dual-processor clusters, managed by staff from the physics department, and a 46-node IBM Power2/Power3 machine managed by the university’s Information Technology Services (ITS). LSU created the HPC@LSU group, funded 50-50 by CCT and ITS to jointly manage these systems, which were the only major compute resources in Louisiana. HPC@LSU also began to manage the LONI compute systems, IBM Power5 clusters, and later, additional Dell systems for both LONI and LSU, including Queen Bee (the largest LONI system), as part of the TeraGrid, the US national HPC infrastructure.

CCT envisioned building a campus and national center for advancing computational sciences across all disciplines, with these groups' research activities integrated as closely as possible with the research computing environment.  In this way, the services provided by the computing environment to the campus and nation would be the best possible, and the research output of the faculty, students, and staff would be advanced.  CCT faculty would be able to lead nationally visible research activities, being able to carry out a research program that would not be otherwise possible, providing exemplars to the campus, catalyzing activity in computational science approaches to basic sciences, engineering, humanities, business, etc.  This was a key component of the CCT vision, one that has been successful at other centers (e.g. NCSA, SDSC, AEI) around the world.

Computational Research

Initially, there were very few computationally oriented faculty in Louisiana, which hindered research in computational science, state collaborations, and LSU's involvement in national or international projects involving computation. To address this, CCT's core strategy has been to recruit computationally-oriented faculty to LSU, generally in joint 50-50 positions with departments, with tenure residing in the departments. This model has been discussed at length and has continuously been seen as the best model for strengthening departments in computational science, and encouraging real buy-in to the overall initiative from the departments. CCT also implements other strategies for associating faculty with the center, both for encouraging and supporting the participation of faculty already on the campus to take an active role in the center's programs and research, and for helping to attract and recruit faculty whose research interests overlap with CCT.

Research staff are also essential, making it possible to quickly bring in expertise in a particular computational area as a catalyst and tool for faculty recruitment, to form a bridge from center activities to the campus, to provide consistent support to strategically important areas, and to facilitate production level software development.

The fundamental group (in the CCT Core Computing Sciences Focus Area), centered around the Departments of Computer Science, Electrical and Computer Engineering, and Mathematics, was to have the necessary skills needed to build and sustain any program in computational science, including computational mathematics, scientific visualization, software toolkits, etc. Application groups were built to leverage strength on campus, hiring possibilities, and new opportunities.

In addition, CCT’s Cyberinfrastructure Development (CyD) division aimed to better integrate CCT’s research and HPC activities with the campus and national initiatives, with the mission to design, develop, and prototype cyberinfrastructure systems and software for current and future users of LSU's supercomputing systems, partnering where possible with the research groups at CCT to help professionalize prototype systems and support and expand their user base. CyD includes computational scientists, expected to cover 30-50% of their time on proposals led by scientists elsewhere at LSU or LONI, and to spend the rest of their time on computational science activities that lead to new funding or projects and internal support of HPC and LONI activities.

CCT’s education goal has been to cultivate the next generation of leaders in Louisiana’s knowledge-based economy, creating a highly skilled, diverse workforce. To reach this goal, objectives were set to assist in developing curricula and educational opportunities related to computation, to help hire faculty who would support an integrated effort to incorporate computation into the curricula, to offer programs that support activity in scientific computing, to attract and retain competitive students, and to advance opportunities for women and minorities in the STEM disciplines.

Interdisciplinary Research

The final component of the triangle, interdisciplinary research, was supported by CCT’s organization and projects.  CCT faculty are generally able to lead and take part in world-class interdisciplinary research groups related to computation, organized in focus areas: Core Computing Sciences, Coast to Cosmos, Material World, Cultural Computing, and System Science & Engineering. Each focus area has a faculty lead responsible for building cross-cutting interdisciplinary research programs, administration, coordinating the hiring of new faculty and staff, and organizing their unit. Interdisciplinary research is driven by activities in strategically motivated, large-scale projects in the focus areas, faculty research groups, and the Cyberinfrastructure Development division. These projects provide support (students, postdocs, and direction) to the Focus Areas as well as broad outreach for education and training across the state. In addition, CCT tried to engage senior administrators and use CCT faculty to drive curriculum change on the campus.

Crosscutting Activities

Two large projects begun in 2007 were the LONI Institute and Cybertools.  The LONI Institute was a statewide multi-university collaboration, built on the success of the LONI university partnership, to coordinate the hiring of two faculty members at each university, in computer science, computational biology, and/or computational materials, and of one computational scientist at each university, to spur collaborative projects.  Cybertools was another multi-university collaboration that used computational science projects across the state to drive developments in tools that could use the state’s computing and networking resources, which in turn could enable new computational science projects.

Particularly from the state legislature's point of view, CCT was intended to catalyze and support new economic development in the state. In fact, the initial metrics for success provided for LSU CAPITAL included the number of resulting new businesses and patents. Economic development needs to be carefully planned and is a long-term initiative, where success can be hard to measure, particularly in the short term. An example success, though not originally planned, was in September 2008, when Electronic Arts (EA) announced that they would place their North American quality assurance and testing center at LSU, creating 20 full-time jobs and 600 half-time jobs, with an annual payroll of $5.7 million throughout the next two years. EA noted that education and research efforts at LSU, including CCT research areas, were a strong factor in the company's decision to locate this center in Louisiana.

Recent Developments and Concluding Thoughts

In 2008, Seidel was recruited to the National Science Foundation, and LSU appointed an interim director and co-director and began a search for a new permanent director, which led to a director being appointed from inside the university for a three-year term. Starting in 2009, LSU has faced several significant and ongoing budget cuts that are currently impacting the CCT, particularly in its ability to recruit and retain faculty and staff.

The issues faced at LSU are similar to those at other institutions responding to the nation's call for an advancement of computation, computational science and interdisciplinary research. We believe it is important to carefully analyze the experiences of centers such as at LSU, as we have attempted to begin to do in this paper, in order to establish best practices for new initiatives or to lead to more fundamental change. From our experiences at CCT, we can highlight four key points that we feel are crucial for the success and sustainability of computational research centers such as CCT:

The three facets of computational science shown in Figure 1 have be taken seriously on the campus at the highest levels and seen as an important component of academic research.

HPC facilities on campuses need to be integrated with national resources and provide a pathway for campus research to easily connect to national and international activities.

Education and training of students and faculty is crucial; vast improvements are needed over the small numbers currently reached through HPC center tutorials; computation and computational thinking need to be part of new curricula across all disciplines.

Funding agencies should put more emphasis on broadening participation in computation, not just focusing on high end systems where decreasing numbers of researchers can join in, but making tools much more easily usable and intuitive and freeing all researchers from the limitations of their personal workstations, and providing access to simple tools for large scale parameter studies, data archiving, visualization and collaboration.

In addition, there are two points that we have learned specifically from the CCT experience:

  • The overall vision of the university on topic X needs to be consistent across a broad spectrum of the university administration and faculty; it cannot be just one person’s vision, though it may start with one person.
  • The funding needs to be stable over a number of years; activities need to be sustained to be successful, and this needs to be clear to the community from the beginning.

References

1. World Technology Evaluation Center, Inc., (2009)“International Assessment of Research and Development in Simulation-based Engineering and Science”, http://www.wtec.org/sbes/
2. President’s Information Technology AdvisoryCommittee (2005) “Report to the Presidentof the US, Computational Science: EnsuringAmerica’s Competitiveness”, http://www.nitrd.gov/pitac/reports/20050609_computational/computational.pdf
3. Committee on Facilitating InterdisciplinaryResearch, National Academy of Sciences, NationalAcademy of Engineering, Institute of Medicine(2004) “Facilitating Interdisciplinary Research”, http://www.nap.edu/catalog/11153.html
4. Seidel, E., Allen, G., & Towns, J. (2003) “LSU CAPITALCenter (LSUC) Immediate Plans,” http://figshare.com/articles/Original_LSU_CAPITAL_plan/92822
5. CCT Strategic Plan (2006–2010) http://www.cct.lsu.edu/uploads/CCTStrategicPlan20062010.pdf
6. CCT Faculty Plan (2006) http://www.cct.lsu.edu/~gallen/Reports/FacultyPlan_2006.pdf
VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

International Council for Science (ICSU) and the Challenges of Big Data in Science

Ray Harris, discusses challenges of Big Data and ICSU’s approach to Big Data analytics.

Read more >


The Fourth Paradigm

The enormous amounts of data now available to science and to society at large have stimulated some authors to say that we are in the Fourth Paradigm of data-intensive science (1). The First Paradigm was the period of observation, description and experimentation characterised by early scientists and explorers such as Ptolemy and Ibn Battuta. The Second Paradigm was that of the development of theory to explain the way the world works such as in Maxwell’s equations and Newton’s theory of gravitation and laws of motion. The Third Paradigm developed the earlier theories to create extensive simulations and models such as those used in weather forecasting and in climatology. The reason for the step change to a new paradigm, the Fourth Paradigm, is that the volume of data available to us is so large, now often termed Big Data, that it is both presenting many new opportunities for analysis as well as requiring new modes of thinking, for example in the International Virtual Observatory Alliance and in citizen science.

Big Data

One clear example of Big Data is the Square Kilometre Array (SKA) planned to be constructed in South Africa and Australia. When the SKA is completed in 2024 it will produce in excess of one exabyte of raw data per day (1 exabyte = 1018 bytes), which is more than the entire daily internet traffic at present. The SKA is a 1.5 billion Euro project that will have more than 3000 receiving dishes to produce a combined information collecting area of one square kilometre, and will use enough optical fibre to wrap twice around the Earth. Another example of Big Data is the Large Hadron Collider, at the European Organisation for Nuclear Research (CERN), which has 150 million sensors and is creating 22 petabytes of data in 2012 (1 Petabyte = 1015 bytes, see Figure 1). In biomedicine the Human Genome Project is determining the sequences of the three billion chemical base pairs that make up human DNA. In Earth observation there are over 200 satellites in orbit continuously collecting data about the atmosphere and the land, ocean and ice surfaces of planet Earth with pixel sizes ranging from 50 cm to many tens of kilometres.

In a paper in the journal Science in 2011, Hilbert and Lopez (2) estimated that if all the data used in the world today were written to CD-ROMs and the CD-ROMs piled up in a single stack, the stack would stretch all the way from the Earth to the Moon and a quarter of the way back again. A report by the International Data Corporation (3) in 2010 estimated that by the year 2020 there will be 35 Zettabytes (ZB) of digital data created per annum.

Figure 1: Overview of data scale from megabytes to yottabytes (log scale).

International Council for Science

The International Council for Science (ICSU) is the coordinating organisation for science and is taking a leading role in developing further the capability of science to exploit the new era of the Fourth Paradigm. The members of ICSU are the 121 national scientific bodies such as the Australian Academy of Sciences and the Royal Society plus the 31 international science unions such as the International Astronomical Union and the International Union for Crystallography. ICSU has always been committed to the principle of the universality of science and in its vision (4) it sees:

“… a world where science is used for the benefit of all, excellence in science is valued and scientific knowledge is effectively linked to policy making. In such a world, universal and equitable access to high quality scientific data and information is a reality …”

Because of its desire to make a reality of universal and equitable access to data, ICSU established three initiatives to address how ICSU can encourage better management of science data (5).

  • Panel Area Assessment on Scientific Information and Data, 2003–2004.
  • Strategic Committee on Information and Data, 2007–2008.
  • Strategic Coordinating Committee on Information and Data, 2009–2011.

World Data System

One of the main outcomes of these ICSU initiatives is the establishment of the World Data System (WDS). In 1957 during the International Geophysical Year (IGY) several World Data Centres were initiated to act as repositories for data collected during the IGY. The number of these data centres increased over time but they were never fully coordinated. The World Data System is now in the process of rejuvenating these data centres by establishing an active network of centres that practice professional data management. The objectives of the WDS are as follows:

  • Enable universal and equitable access to quality-assured scientific data, data services, products and information;
  • Ensure long term data stewardship;
  • Foster compliance to agreed-upon data standards and conventions;
  • Provide mechanisms to facilitate and improve access to data and data products

By early 2012 over 150 expressions of interest in the WDS had been received by ICSU, resulting in over 60 formal applications for membership. Approved members of the World Data System so far include centres for Antarctic data (Hobart), climate data (Hamburg), ocean data (Washington DC), environment data (Beijing) and solid Earth physics data (Moscow) plus the International Laser Ranging Service and the International Very Long Baseline Interferometry Service. By 2013 it is anticipated that the WDS will comprise over 100 centres and networks of active, professional data management.

Further actions

There is still much to do in developing a professional approach to data management in science. The main outstanding issues were addressed by the ICSU Strategic Coordinating Committee on Information and Data noted above and include the following: better guidance for best practice on data management; improved definitions of the various terms used in the phrase “open access”; greater recognition of the publication of data by scientists as well as the publication of journal articles and books; practical help in data management for less economically developed countries through partnership with members of the ICSU family and others; and cooperation with commercial companies for mutual benefit.

Conclusion

Big Data presents science with many challenges, but at the same time presents many opportunities to influence how science grows and develops for the better, not least by adding data-driven science to hypothesis-driven science. Improvements in professional data management will result in better science.


References
1. Hey, T.,  Tansley, S. & Tolle, K. (2009) The Fourth Paradigm. “Data-intensive scientific discovery”, Microsoft
2. Hilbert, M. & Lopez, P. (2011) “The world’s technological capacity to store, communicate and compute information”, Science 332, 1 April 2011, 60-65.
3. IDC (2010) IDC Digital Universe Study, sponsored by EMC, May 2010, available at http://www.emc.com/collateral/demos/microsites/idc-digital-universe/iview.htm
4. ICSU Strategic Plan 2006-2011, International Council for Science, Paris, 64pp
5. All the reports are available at the ICSU website
VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

The Fourth Paradigm

The enormous amounts of data now available to science and to society at large have stimulated some authors to say that we are in the Fourth Paradigm of data-intensive science (1). The First Paradigm was the period of observation, description and experimentation characterised by early scientists and explorers such as Ptolemy and Ibn Battuta. The Second Paradigm was that of the development of theory to explain the way the world works such as in Maxwell’s equations and Newton’s theory of gravitation and laws of motion. The Third Paradigm developed the earlier theories to create extensive simulations and models such as those used in weather forecasting and in climatology. The reason for the step change to a new paradigm, the Fourth Paradigm, is that the volume of data available to us is so large, now often termed Big Data, that it is both presenting many new opportunities for analysis as well as requiring new modes of thinking, for example in the International Virtual Observatory Alliance and in citizen science.

Big Data

One clear example of Big Data is the Square Kilometre Array (SKA) planned to be constructed in South Africa and Australia. When the SKA is completed in 2024 it will produce in excess of one exabyte of raw data per day (1 exabyte = 1018 bytes), which is more than the entire daily internet traffic at present. The SKA is a 1.5 billion Euro project that will have more than 3000 receiving dishes to produce a combined information collecting area of one square kilometre, and will use enough optical fibre to wrap twice around the Earth. Another example of Big Data is the Large Hadron Collider, at the European Organisation for Nuclear Research (CERN), which has 150 million sensors and is creating 22 petabytes of data in 2012 (1 Petabyte = 1015 bytes, see Figure 1). In biomedicine the Human Genome Project is determining the sequences of the three billion chemical base pairs that make up human DNA. In Earth observation there are over 200 satellites in orbit continuously collecting data about the atmosphere and the land, ocean and ice surfaces of planet Earth with pixel sizes ranging from 50 cm to many tens of kilometres.

In a paper in the journal Science in 2011, Hilbert and Lopez (2) estimated that if all the data used in the world today were written to CD-ROMs and the CD-ROMs piled up in a single stack, the stack would stretch all the way from the Earth to the Moon and a quarter of the way back again. A report by the International Data Corporation (3) in 2010 estimated that by the year 2020 there will be 35 Zettabytes (ZB) of digital data created per annum.

Figure 1: Overview of data scale from megabytes to yottabytes (log scale).

International Council for Science

The International Council for Science (ICSU) is the coordinating organisation for science and is taking a leading role in developing further the capability of science to exploit the new era of the Fourth Paradigm. The members of ICSU are the 121 national scientific bodies such as the Australian Academy of Sciences and the Royal Society plus the 31 international science unions such as the International Astronomical Union and the International Union for Crystallography. ICSU has always been committed to the principle of the universality of science and in its vision (4) it sees:

“… a world where science is used for the benefit of all, excellence in science is valued and scientific knowledge is effectively linked to policy making. In such a world, universal and equitable access to high quality scientific data and information is a reality …”

Because of its desire to make a reality of universal and equitable access to data, ICSU established three initiatives to address how ICSU can encourage better management of science data (5).

  • Panel Area Assessment on Scientific Information and Data, 2003–2004.
  • Strategic Committee on Information and Data, 2007–2008.
  • Strategic Coordinating Committee on Information and Data, 2009–2011.

World Data System

One of the main outcomes of these ICSU initiatives is the establishment of the World Data System (WDS). In 1957 during the International Geophysical Year (IGY) several World Data Centres were initiated to act as repositories for data collected during the IGY. The number of these data centres increased over time but they were never fully coordinated. The World Data System is now in the process of rejuvenating these data centres by establishing an active network of centres that practice professional data management. The objectives of the WDS are as follows:

  • Enable universal and equitable access to quality-assured scientific data, data services, products and information;
  • Ensure long term data stewardship;
  • Foster compliance to agreed-upon data standards and conventions;
  • Provide mechanisms to facilitate and improve access to data and data products

By early 2012 over 150 expressions of interest in the WDS had been received by ICSU, resulting in over 60 formal applications for membership. Approved members of the World Data System so far include centres for Antarctic data (Hobart), climate data (Hamburg), ocean data (Washington DC), environment data (Beijing) and solid Earth physics data (Moscow) plus the International Laser Ranging Service and the International Very Long Baseline Interferometry Service. By 2013 it is anticipated that the WDS will comprise over 100 centres and networks of active, professional data management.

Further actions

There is still much to do in developing a professional approach to data management in science. The main outstanding issues were addressed by the ICSU Strategic Coordinating Committee on Information and Data noted above and include the following: better guidance for best practice on data management; improved definitions of the various terms used in the phrase “open access”; greater recognition of the publication of data by scientists as well as the publication of journal articles and books; practical help in data management for less economically developed countries through partnership with members of the ICSU family and others; and cooperation with commercial companies for mutual benefit.

Conclusion

Big Data presents science with many challenges, but at the same time presents many opportunities to influence how science grows and develops for the better, not least by adding data-driven science to hypothesis-driven science. Improvements in professional data management will result in better science.


References
1. Hey, T.,  Tansley, S. & Tolle, K. (2009) The Fourth Paradigm. “Data-intensive scientific discovery”, Microsoft
2. Hilbert, M. & Lopez, P. (2011) “The world’s technological capacity to store, communicate and compute information”, Science 332, 1 April 2011, 60-65.
3. IDC (2010) IDC Digital Universe Study, sponsored by EMC, May 2010, available at http://www.emc.com/collateral/demos/microsites/idc-digital-universe/iview.htm
4. ICSU Strategic Plan 2006-2011, International Council for Science, Paris, 64pp
5. All the reports are available at the ICSU website
VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

Guiding Investments in Research: Using Data To Develop Science Funding Programs and Policies

Norman Braveman demonstrates how sophisticated text mining technologies can be used to analyze Big Data.

Read more >


One important goal of organizations that provide funds for biomedical and behavioral research is to encourage and support research that leads to more effective health promotion, better disease prevention, and improved treatment of disease. They do this in order to provide a scientific evidence base. In order to ensure that an organization or its programs are effectively moving science forward toward this objective, organizations that fund research must continually assess and re-assess their goals, directions and progress. While there are a variety of ways that funding organizations can carry out these program assessments, there are several discrete and interlinked components common to all approaches including development of: a strategic plan that identifies organizational values, mission, priorities and objectives; an implementation plan listing the timelines, benchmarks, mechanisms of implementation, and the sequence of events related to the elements of the strategic plan; a logic model, based on information gained from all stakeholders, which identifies inputs or available resources that can be used along with expected outcomes from the organization’s activities; a gap analysis, an assessment of progress in reaching organizational goals as well as in carrying out the implementation plan by addressing questions about the current position of the organization in relation to where it expected or wanted to be.

In the process of conducting a gap analysis the organization also addresses specific questions about the current state-of-the-science along with pathways to scientific advancement in terms of what is needed to move science ahead, along with identifying barriers to and opportunities for progress. Nevertheless, most program assessments by funding organizations use what I call ‘demographic information’, that is information that answers questions on the number of grants in a portfolio, how much is being spent on a particular funding program, the mix of grant mechanisms (e.g., basic vs. translational vs. clinical research; investigator initiated vs. solicited research; single project grants vs. large center multi-center grants), and the number of inventions or patents resulting from research supported by any individual portfolio of grants or group of grant portfolios.

While these kinds of measures may be excellent indicators of progress of an organization, with the exception of information about inventions and patents, they seem at least one step removed from measuring the impact of an organization’s grant portfolios on the content of science itself. In order to maximize the impact of organizational activities and programs on progress in science, the analysis should use science itself as the data that guides the planning, development and implementation of programs and policies. It’s what the scientists whose research the organization may be supporting do in justifying the next step in their research.

There are times when organizations analyze the science of grants in their portfolios by capturing key words in the titles, abstracts, progress reports, and/or grants or grant applications. These are generally tracked over time by program analysts. While the program analysts are typically highly experienced and/or trained, they carry out the analysis by hand and from time to time, from document to document, or from person to person the algorithm they use in classification and categorization can shift in small ways. Such shifts introduce a source of variability that can reduce the reliability and perhaps even the validity of the final results. Moreover, analyzing science by hand is a long, tedious, and expensive task. So our tendency is to do this kind of detailed analysis infrequently…clearly not in ‘real time’ as seems to be what is needed in this age of fast-paced discovery.

Scientific Fingerprinting

Fortunately, the technology now exists that will allow us to analyze the content of science in a valid, reliable and timely way that overcomes many of the problems that crop up when we do it by hand. More than that, because this approach is computer-based, and therefore fast and reliable, its use allows us to carry out assessments often and on a regular basis. The approach I’m referring to involves the formal textual analysis of scientific concepts and knowledge contained within documents such as strategic and implementation plans, grant applications, progress reports, and the scientific literature. We refer to the output of the textual analysis of a document as a ‘scientific fingerprint’ or simply a ‘fingerprint’.


Without going into the details of the underlying processing, a fingerprint is a precise abstract representation of a text that allows us to look into the text or content rather than only looking at the metadata or demographics. Because fingerprinting is concept driven and not keyword driven and because it uses an ontology (i.e., A is an example of B) as its base, it is not necessary to have a term appear in a document in order for it to be part of a fingerprint. For example, it is possible for a document to contain all of the diagnostic characteristics of a disease but not the name of the disease in order for the disease name to appear in the fingerprint. The only requirement is that the diagnostic characteristics be identified as examples of the named disease somewhere in the scientific literature that makes up the database that is searched. Further, the concepts or weights given to individual concepts comprising a scientific fingerprint of textual content can be adjusted to fit the views of experts in the field. Thus they are not adopted blindly or without validation by experts.

Fingerprinting uses as its base for comparison and analysis the entirety of the Elsevier Scopus database consisting of 45.5 million records, and the information used to develop a fingerprint can be captured relatively easily. Because it is computer-based the textual analysis of the grant portfolio is also much faster than when it is done by hand, thus allowing us to continually assess and reassess an organization’s scientific grant portfolio. It is applicable to any document and can be used at any stage of evaluation and program development. In short, it is possible to carry out continual ongoing real time assessments using fingerprinting.  Finally, as science changes, as reflected in the scientific literature itself, the fingerprint profile of a given area of science will change, and those changes can be documented and used in an analysis of progress both in science and in the organization’s grant portfolio.

While the use of this approach to textual analysis is in its infancy, fingerprinting can allow organizational decision making to be based on the state-of-science, help align organizational goals with the current state-of-the-science, and clarify the organization’s role and contributions within a specific area of science. As such, when coupled with demographic data charting the organization’s performance it can provide a fuller picture of the current role of the organization in moving science forward, and in the possible role that the organization can play in future scientific development.

A full version of this paper can be found on Braveman BioMed Consultants’ website.

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

One important goal of organizations that provide funds for biomedical and behavioral research is to encourage and support research that leads to more effective health promotion, better disease prevention, and improved treatment of disease. They do this in order to provide a scientific evidence base. In order to ensure that an organization or its programs are effectively moving science forward toward this objective, organizations that fund research must continually assess and re-assess their goals, directions and progress. While there are a variety of ways that funding organizations can carry out these program assessments, there are several discrete and interlinked components common to all approaches including development of: a strategic plan that identifies organizational values, mission, priorities and objectives; an implementation plan listing the timelines, benchmarks, mechanisms of implementation, and the sequence of events related to the elements of the strategic plan; a logic model, based on information gained from all stakeholders, which identifies inputs or available resources that can be used along with expected outcomes from the organization’s activities; a gap analysis, an assessment of progress in reaching organizational goals as well as in carrying out the implementation plan by addressing questions about the current position of the organization in relation to where it expected or wanted to be.

In the process of conducting a gap analysis the organization also addresses specific questions about the current state-of-the-science along with pathways to scientific advancement in terms of what is needed to move science ahead, along with identifying barriers to and opportunities for progress. Nevertheless, most program assessments by funding organizations use what I call ‘demographic information’, that is information that answers questions on the number of grants in a portfolio, how much is being spent on a particular funding program, the mix of grant mechanisms (e.g., basic vs. translational vs. clinical research; investigator initiated vs. solicited research; single project grants vs. large center multi-center grants), and the number of inventions or patents resulting from research supported by any individual portfolio of grants or group of grant portfolios.

While these kinds of measures may be excellent indicators of progress of an organization, with the exception of information about inventions and patents, they seem at least one step removed from measuring the impact of an organization’s grant portfolios on the content of science itself. In order to maximize the impact of organizational activities and programs on progress in science, the analysis should use science itself as the data that guides the planning, development and implementation of programs and policies. It’s what the scientists whose research the organization may be supporting do in justifying the next step in their research.

There are times when organizations analyze the science of grants in their portfolios by capturing key words in the titles, abstracts, progress reports, and/or grants or grant applications. These are generally tracked over time by program analysts. While the program analysts are typically highly experienced and/or trained, they carry out the analysis by hand and from time to time, from document to document, or from person to person the algorithm they use in classification and categorization can shift in small ways. Such shifts introduce a source of variability that can reduce the reliability and perhaps even the validity of the final results. Moreover, analyzing science by hand is a long, tedious, and expensive task. So our tendency is to do this kind of detailed analysis infrequently…clearly not in ‘real time’ as seems to be what is needed in this age of fast-paced discovery.

Scientific Fingerprinting

Fortunately, the technology now exists that will allow us to analyze the content of science in a valid, reliable and timely way that overcomes many of the problems that crop up when we do it by hand. More than that, because this approach is computer-based, and therefore fast and reliable, its use allows us to carry out assessments often and on a regular basis. The approach I’m referring to involves the formal textual analysis of scientific concepts and knowledge contained within documents such as strategic and implementation plans, grant applications, progress reports, and the scientific literature. We refer to the output of the textual analysis of a document as a ‘scientific fingerprint’ or simply a ‘fingerprint’.


Without going into the details of the underlying processing, a fingerprint is a precise abstract representation of a text that allows us to look into the text or content rather than only looking at the metadata or demographics. Because fingerprinting is concept driven and not keyword driven and because it uses an ontology (i.e., A is an example of B) as its base, it is not necessary to have a term appear in a document in order for it to be part of a fingerprint. For example, it is possible for a document to contain all of the diagnostic characteristics of a disease but not the name of the disease in order for the disease name to appear in the fingerprint. The only requirement is that the diagnostic characteristics be identified as examples of the named disease somewhere in the scientific literature that makes up the database that is searched. Further, the concepts or weights given to individual concepts comprising a scientific fingerprint of textual content can be adjusted to fit the views of experts in the field. Thus they are not adopted blindly or without validation by experts.

Fingerprinting uses as its base for comparison and analysis the entirety of the Elsevier Scopus database consisting of 45.5 million records, and the information used to develop a fingerprint can be captured relatively easily. Because it is computer-based the textual analysis of the grant portfolio is also much faster than when it is done by hand, thus allowing us to continually assess and reassess an organization’s scientific grant portfolio. It is applicable to any document and can be used at any stage of evaluation and program development. In short, it is possible to carry out continual ongoing real time assessments using fingerprinting.  Finally, as science changes, as reflected in the scientific literature itself, the fingerprint profile of a given area of science will change, and those changes can be documented and used in an analysis of progress both in science and in the organization’s grant portfolio.

While the use of this approach to textual analysis is in its infancy, fingerprinting can allow organizational decision making to be based on the state-of-science, help align organizational goals with the current state-of-the-science, and clarify the organization’s role and contributions within a specific area of science. As such, when coupled with demographic data charting the organization’s performance it can provide a fuller picture of the current role of the organization in moving science forward, and in the possible role that the organization can play in future scientific development.

A full version of this paper can be found on Braveman BioMed Consultants’ website.

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

Big Data: Science Metrics and the black box of Science Policy

This contribution, by Julia Lane, illustrates how Big Datasets should be used to inform funding and science policy decisions.

Read more >


The deluge of data and metrics are generating much heat but shed little light on the black box of science policy. The fundamental problem is conceptual: metrics that connect science funding interventions with numbers of documents miss the key link.  Science is done by scientists. Dashboards of metrics that don’t link back to scientists are like dashboards missing the link of cables to the engine. They do not provide policy makers with information on how or why funding changed the way in which scientists created and transmitted knowledge. In other words, while bibliometricians have made use of the data deluge to make enormous advances in understanding how to manage scientific documents, the science policy community needs to use the data deluge to make enormous advances in understanding how to manage science (1).

Missing causal links matters

If the focus of funding agencies turns to forcing scientists to produce scientific papers and patents, then they will do so.  But if, as the evidence suggests, the essence of science is the creation, transmission and adoption of knowledge via scientific networks, then by missing the causal links, the agencies may distort and retard the very activity they wish to foster. Funding agencies must develop “the ability to define a clear policy intervention, assess its likely impact on the scientific community, find appropriate measures of scientific activities in the pre- and post-period, and define a clear counterfactual.” (2) (italics added).  This is no different from Louis Pasteur’s swan flask diagram (see Figure 1) that illustrates the fact that spontaneous generation is impossible and that life can never be created out of non-life (3).  Like any scientist, we must develop the appropriate conceptual framework that enables us to write down the theory of change of how science policy interventions work – describing what makes the engine run (4).
Figure 1: Illustration of swan-necked flask experiment used by Louis Pasteur to test the hypothesis of spontaneous generation.

A sensible organizing framework has been provided by Ian Foster, which identifies individual scientists (or the scientific community consisting of the networks of scientists) as the “engine” that generates scientific ideas.  In this case the theory of change is that there is a link between funding and the way in which those networks assemble.  Then, in turn, there is a link between scientific networks and the way in which those ideas are created and transmitted, and hence used to generate scientific, social, economic and workforce “products”.


Figure 2: The practice of science (source: Ian Foster)

Big Data offer science funders a tremendous opportunity to capture those links, precisely because the causal links are often so long and tenuous that relying on manual, individual reporting is, quite simply, destined to fail.  The Science of science policy community has been developing a body of knowledge about how to think about and identify those links, rather than just saying, as the cartoon (below) would have it “that a miracle occurred”. The Summer issue of the Journal of Policy Analysis and Management (5), from which the quote was drawn, features articles that document what a science of science policy means in practice; namely, bringing the same intellectual set of models, tools and data to science policy as have been brought to labor, education, and health policy (and many others) (6).  The September NSF SciSIP Principal Investigator conference (7) will demonstrate how far this new scientific field has come in moving towards more theoretically grounded metrics – in many cases by both building on the impressive work done by bibliometricians, and working with experts in the field.  And the STAR METRICS program has built on the efforts of that community to begin to provide a linked data infrastructure on which those metrics can be founded (8). In sum, Big Data offers an enormous opportunity to advance the science of science policy. Making the links, so that science funders have new understanding of what is needed to foster science, will enable new light to shine on what has hitherto been a rather black box within which miracles occurred.

References

1. Lane, J. (2010) “Let’s make science metrics more scientific”, Nature 464, 488–489.
2. Furman, J. L., Murray, F. & Stern, S. (2012) “Growing Stem Cells: The Impact of Federal Funding Policy on the U.S. Scientific Frontier”, J. Pol. Anal. Manage. 31, 661–705.
3. File: Experiment Pasteur English.jpg - Wikipedia, the free encyclopedia. at <http://en.wikipedia.org/wiki/File:Experiment_Pasteur_English.jpg>
4. Gertler, P.J., Martinez, S., Premand, P., Rawlings, L.B., & Vermeersch, C.M.J. (World Bank 2011) “Impact Evaluation in Practice”, at http://siteresources.worldbank.org/EXTHDOFFICE/Resources/5485726-1295455628620/Impact_Evaluation_in_Practice.pdf
5. Journal of Policy Analysis and Management - Volume 31, Issue 3 - Summer 2012 - Wiley Online Library. at http://onlinelibrary.wiley.com/doi/10.1002/pam.2012.31.issue-3/issuetoc
6. Lane, J. & Black, D., (2012) “Overview of the Science of Science Policy Symposium”, J. Pol. Anal. Manage. 31, 598–600.
7. NAS CNSTAT SciSIP Principal Investigator Conference. (2012) at http://www7.nationalacademies.org/cnstat/SciSIP%20Invitation.pdf
8. Largent, M. A. & Lane, J. I. (2012) “Star Metrics and the Science of Science Policy”, Review of Policy Research 29, 431–438.
VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

The deluge of data and metrics are generating much heat but shed little light on the black box of science policy. The fundamental problem is conceptual: metrics that connect science funding interventions with numbers of documents miss the key link.  Science is done by scientists. Dashboards of metrics that don’t link back to scientists are like dashboards missing the link of cables to the engine. They do not provide policy makers with information on how or why funding changed the way in which scientists created and transmitted knowledge. In other words, while bibliometricians have made use of the data deluge to make enormous advances in understanding how to manage scientific documents, the science policy community needs to use the data deluge to make enormous advances in understanding how to manage science (1).

Missing causal links matters

If the focus of funding agencies turns to forcing scientists to produce scientific papers and patents, then they will do so.  But if, as the evidence suggests, the essence of science is the creation, transmission and adoption of knowledge via scientific networks, then by missing the causal links, the agencies may distort and retard the very activity they wish to foster. Funding agencies must develop “the ability to define a clear policy intervention, assess its likely impact on the scientific community, find appropriate measures of scientific activities in the pre- and post-period, and define a clear counterfactual.” (2) (italics added).  This is no different from Louis Pasteur’s swan flask diagram (see Figure 1) that illustrates the fact that spontaneous generation is impossible and that life can never be created out of non-life (3).  Like any scientist, we must develop the appropriate conceptual framework that enables us to write down the theory of change of how science policy interventions work – describing what makes the engine run (4).
Figure 1: Illustration of swan-necked flask experiment used by Louis Pasteur to test the hypothesis of spontaneous generation.

A sensible organizing framework has been provided by Ian Foster, which identifies individual scientists (or the scientific community consisting of the networks of scientists) as the “engine” that generates scientific ideas.  In this case the theory of change is that there is a link between funding and the way in which those networks assemble.  Then, in turn, there is a link between scientific networks and the way in which those ideas are created and transmitted, and hence used to generate scientific, social, economic and workforce “products”.


Figure 2: The practice of science (source: Ian Foster)

Big Data offer science funders a tremendous opportunity to capture those links, precisely because the causal links are often so long and tenuous that relying on manual, individual reporting is, quite simply, destined to fail.  The Science of science policy community has been developing a body of knowledge about how to think about and identify those links, rather than just saying, as the cartoon (below) would have it “that a miracle occurred”. The Summer issue of the Journal of Policy Analysis and Management (5), from which the quote was drawn, features articles that document what a science of science policy means in practice; namely, bringing the same intellectual set of models, tools and data to science policy as have been brought to labor, education, and health policy (and many others) (6).  The September NSF SciSIP Principal Investigator conference (7) will demonstrate how far this new scientific field has come in moving towards more theoretically grounded metrics – in many cases by both building on the impressive work done by bibliometricians, and working with experts in the field.  And the STAR METRICS program has built on the efforts of that community to begin to provide a linked data infrastructure on which those metrics can be founded (8). In sum, Big Data offers an enormous opportunity to advance the science of science policy. Making the links, so that science funders have new understanding of what is needed to foster science, will enable new light to shine on what has hitherto been a rather black box within which miracles occurred.

References

1. Lane, J. (2010) “Let’s make science metrics more scientific”, Nature 464, 488–489.
2. Furman, J. L., Murray, F. & Stern, S. (2012) “Growing Stem Cells: The Impact of Federal Funding Policy on the U.S. Scientific Frontier”, J. Pol. Anal. Manage. 31, 661–705.
3. File: Experiment Pasteur English.jpg - Wikipedia, the free encyclopedia. at <http://en.wikipedia.org/wiki/File:Experiment_Pasteur_English.jpg>
4. Gertler, P.J., Martinez, S., Premand, P., Rawlings, L.B., & Vermeersch, C.M.J. (World Bank 2011) “Impact Evaluation in Practice”, at http://siteresources.worldbank.org/EXTHDOFFICE/Resources/5485726-1295455628620/Impact_Evaluation_in_Practice.pdf
5. Journal of Policy Analysis and Management - Volume 31, Issue 3 - Summer 2012 - Wiley Online Library. at http://onlinelibrary.wiley.com/doi/10.1002/pam.2012.31.issue-3/issuetoc
6. Lane, J. & Black, D., (2012) “Overview of the Science of Science Policy Symposium”, J. Pol. Anal. Manage. 31, 598–600.
7. NAS CNSTAT SciSIP Principal Investigator Conference. (2012) at http://www7.nationalacademies.org/cnstat/SciSIP%20Invitation.pdf
8. Largent, M. A. & Lane, J. I. (2012) “Star Metrics and the Science of Science Policy”, Review of Policy Research 29, 431–438.
VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

The Evolution of Big Data as a Research and Scientific Topic: Overview of the Literature

This overview explores the evolution of Big Data as a scientific topic of investigation in an article that frames the topic within the peer reviewed literature.

Read more >


The term Big Data is used almost anywhere these days; from news articles to professional magazines, from tweets to YouTube videos and blog discussions. The term coined by Roger Magoulas from O’Reilly media in 2005 (1), refers to a wide range of large data sets almost impossible to manage and process using traditional data management tools—due to their size, but also their complexity. Big Data can be seen in the finance and  business where enormous amount of stock exchange, banking, online and onsite purchasing data flows through computerized systems every day and are then captured and stored for inventory monitoring, customer behavior and market behavior. It can also be seen in the life sciences where big sets of data such as genome sequencing, clinical data and patient data are analyzed and used to advance breakthroughs in science in research. Other areas of research where Big Data is of central importance are astronomy, oceanography, and engineering among many others. The leap in computational and storage power enables the collection, storage and analysis of these Big Data sets and companies introducing innovative technological solutions to Big Data analytics are flourishing. In this article, we explore the term Big Data as it emerged from the peer reviewed literature. As opposed to news items and social media articles, peer reviewed articles offer a glimpse into Big Data as a topic of study and the scientific problems methodologies and solutions that researchers are focusing on in relation to it. The purpose of this article, therefore, is to sketch the emergence of Big Data as a research topic from several points: (1) timeline, (2) geographic output, (3) disciplinary output, (4) types of published papers, and (5) thematic and conceptual development. To accomplish this overview we used Scopus.

Method

The term Big Data was searched on Scopus using the index and author keywords fields. No variations of the term were used in order to capture only this specific phrase. It should be noted that there are other phrases such as “large datasets” or “big size data” that appear throughout the literature and might refer to the same concept as Big Data. However, the focus of this article was to capture the prevalent Big Data phrase itself and examine the ways in which the research community adapted and embedded it in the mainstream research literature. The search results were further examined manually in order to determine the complete match between the articles’ content and the phrase Big Data. Special attention was given to articles from the 1960s and 1970s which were retrieved using the above fields. After close evaluation of the results set, only 4 older articles were removed from the final results set which left 306 core articles. These core articles were then analyzed using the Scopus analytics tool which enables different aggregated views of the results set based on year, source title, author, affiliation, country, document type and subject area. In addition, a content analysis of the titles and abstracts was performed in order to extract a timeline of themes and concepts within the results set.

Results

The growth of research articles about Big Data from 2008 to the present can be easily explained as the topic gained much attention over the last few years (see Figure 1). It is, however, interesting to take a closer look at older instances where the term was used. For example, the first appearance of term Big Data appears in a 1970 article on atmospheric and oceanic soundings (according to data available in Scopus; see study limitations). The 1970 article discusses the Barbados Oceanographic and Meteorological Experiment (BOMEX) which was conducted in 1969 (2).   This was a joint project of seven US departments and agencies with the cooperation of Barbados. A look at the BOMEX site features a photo of a large computer probably used at the time to process the large amounts of data generated by this project (3). Other early occurrences of the term are usually related to computer modeling and software/hardware development for large data sets in areas such as linguistics, geography and engineering.

Figure 1: Time line of Big Data as topic of research. The dotted line represents the exponential growth curve best fitting the data represented by the blue bars. This shows the number of Big Data articles increasing faster than the best exponential fit.

When segmenting the timeline and examining the subject areas covered in different timeframes, one can see that the early papers (i.e. until 2000) are led by engineering especially in the areas of computer engineering (neural networks, artificial intelligence, computer simulation, data management, mining and storage) but also in areas such as building materials, electric generators, electrical engineering, telecommunication equipment, cellular telephone systems and electronics. From 2000 onwards, the field is led by computer science followed by engineering and mathematics. Another interesting finding in terms of document types is that conference papers are most frequent followed by articles (see Figures 2 and 3). As we see in the thematic analysis, these conference papers become visible through the abstracts and titles analysis.

Figure 2: Document types of Big Data papers.

Figure 3: Conference papers and Articles growth over time.

The top subject area in this research field is, not surprisingly, computer science; but one can notice other disciplines that investigate the topic such as engineering, mathematics, business and also social and decision sciences (see Figure 4). Other subject areas that are evident in the results sets but not yet showing significant growth are chemistry, energy, arts and humanities and environmental sciences. In the arts and humanities for example, there is a growing interest in the development of infrastructure for e-science for humanities digital ecosystems (for instance, text mining), or in using census data to improve the allocation of funds from public resources.

Figure 4: Subject areas researching Big Data.

Finally, we took a look at the geographical distribution of papers. The USA has published the highest number of papers on Big Data by far, followed by China in second place (see Figure 5).  In both countries the research on Big Data is concentrated in the areas of computer science and engineering. However, while in the USA these two areas are followed by biochemistry, genetics and molecular biology, in China computer science and engineering are followed by mathematics, material sciences and physics. This observation coincides with other research findings such as the report on International Comparative Performance of the UK Research Base: 2011 (4) which indicated that the USA is strong in research areas such as medical, health and brain research while China is strong in areas such as computer science, engineering and mathematics.

Figure 5: Geographical Distribution of Big Data papers.

In addition to the overall characteristics of the publications on Big Data, we also conducted a thematic contextual analysis of the titles and abstracts in order to understand how and in what ways the topics within this field have evolved. In order to accomplish this, the abstracts and titles in each article were collected in two batches; one file containing abstracts and titles of articles from 1999-2005 and the second file from 2006-2012. The analysis concentrated on these years rather than the entire set, as there were multiple publications per year during this period. The texts were then entered into the freely available visualization software Many Eyes. This was used to create phrase-maps using the top 50 occurring keywords in these texts. These visualizations were produced by ignoring common words and connecting words such as ‘and’, ‘the’, ‘of’ etc. and used one place space between terms to determine the connections between the terms (see Figures 6 and 7).

These maps visualize two main characteristics of the text: (1) connections between terms are depicted by the gray lines, where a thicker line notes a stronger relationship between the terms; and (2) the centrality of the terms which are depicted by their font size (the bigger the font, the more frequently a term appears in the text).  Clusters of connections may appear when a connection is found between single words but not to other clusters.

The first two striking observations when looking at these two maps are the complexity of Figure 6 compared to Figure 7 and the close connectivity of the themes in Figure 7 compared to the scattered nature of their appearance in Figure 6.

Figure 6: Phrase map of highly occurring keywords 1999-2005.

The thematic characteristics of the 1999-2005 abstracts and titles text show several scattered connections between two words, seen on the right and left sides of the map. For example, neural networks analysis, on the right side of the map, is a common concept in the field of artificial intelligence computing. This map is conceptually quite simple, with most concepts concentrated around computer related terms, such as ‘data mining ’,‘ data sets’, XML and applications.  When compared to Figure 7 it can be easily seen how the term ‘big’, although strongly connected to ‘data’ is not as noticeable as it is in later dates.

Figure 7: Phrase map of highly occurring keywords 2006-2012.

The map in Figure 7 represents a tighter network of terms all closely related to one another and to the Big Data concept. Figure 7 also represents a much richer picture of the research on Big Data. There’s a clear evolution from basic data mining to specific issues of interest such as data storage and management which lead to cloud and distributed computing. It could be said that the first period demonstrates a naïve picture, in which solutions and topics revolve around a more ‘traditional’ view of the topic using known concepts of XML and data mining while the second period shows a more complex view of the topic while demonstrating innovative solutions such as cloud computing with emphasis on networks. This also holds for terms such as ‘model’, ‘framework’, and ‘analytics’, that appear in  Figure 7, which indicate development and growth in research directions.

A comparison of these two maps also reveals the appearance of diversity in the topics surrounding Big Data such as ‘social data’ , ‘user data’ and even specific solutions such as ‘MapReduce’,  a model for processing large datasets implemented by Google (http://mapreduce.meetup.com/), and ‘hadoop’, an open source software framework that supports data-intensive distributed applications (www.hadoop.apache.org).

As mentioned in the section above analyzing document types, conference papers are central to research in this area. As can be seen in Figure 7, the ACM or IEEE conferences in 2010-2012 play an important role in this area which can be seen by the clear appearance of these terms and their connection to the topic.

Conclusions

Research on Big Data emerged in the 1970s but has seen an explosion of publications since 2008. Although the term is commonly associated with computer science, the data shows that it is applied to many different disciplines including earth, health, engineering, arts and humanities and environmental sciences. Conferences, especially those sponsored by IEEE and/or ACM, are the leaders in the progression of publications in this area followed by journal articles. Geographically, research is led by the USA followed by China and some European countries.

A closer look at the concepts and themes within the abstracts and titles over time show how this area, which began as a computer and technology focus area with some satellite applications, developed into  a close and tight-knit discipline featuring applications, methodologies and innovative solutions ranging from could to distributed computing and focusing on user experience.

In May 2012, Elsevier sponsored a 2-day conference in Canberra, Australia dedicated to the topics of Big Data, E-science and Science policy (see videos and links to the presentations here: https://www.youtube.com/playlist?list=PL61DD522B24108837). The topic was treated from a variety of viewpoints including the analytics of Big Data sets in publishing, digital scholarship, research assessment and science policy. The multi-dimensional characteristic of this topic is seen in the literature as well as in the social media and online publications. The concept of Big Data as a research topic seems to be growing and it is probable that by the end of 2012 the number of publications will double, if not more, and its analytics and applications will be seen in various disciplines.

Limitations

This study was conducted using Scopus.com in August 2012 and the numbers and percentages presented in this article reflect the indexed publications at the time. These are bound to change as Scopus.com is updated daily with new publications, covering articles in press.

In addition, the dates and document types presented in this study are direct derivatives of Scopus coverage as far as sources and dates. A similar search on other databases might result in slightly different findings and may vary according to the database coverage.

Useful Links

1. http://strata.oreilly.com/2010/01/roger-magoulas-on-big-data.html
2. http://www.eol.ucar.edu/projects/bomex/
3. http://www.eol.ucar.edu/projects/bomex/images/DataAcquisitionSystem.jpg
4. http://www.bis.gov.uk/assets/biscore/science/docs/i/11-p123-international-comparative-performance-uk-research-base-2011.pdf
VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

The term Big Data is used almost anywhere these days; from news articles to professional magazines, from tweets to YouTube videos and blog discussions. The term coined by Roger Magoulas from O’Reilly media in 2005 (1), refers to a wide range of large data sets almost impossible to manage and process using traditional data management tools—due to their size, but also their complexity. Big Data can be seen in the finance and  business where enormous amount of stock exchange, banking, online and onsite purchasing data flows through computerized systems every day and are then captured and stored for inventory monitoring, customer behavior and market behavior. It can also be seen in the life sciences where big sets of data such as genome sequencing, clinical data and patient data are analyzed and used to advance breakthroughs in science in research. Other areas of research where Big Data is of central importance are astronomy, oceanography, and engineering among many others. The leap in computational and storage power enables the collection, storage and analysis of these Big Data sets and companies introducing innovative technological solutions to Big Data analytics are flourishing. In this article, we explore the term Big Data as it emerged from the peer reviewed literature. As opposed to news items and social media articles, peer reviewed articles offer a glimpse into Big Data as a topic of study and the scientific problems methodologies and solutions that researchers are focusing on in relation to it. The purpose of this article, therefore, is to sketch the emergence of Big Data as a research topic from several points: (1) timeline, (2) geographic output, (3) disciplinary output, (4) types of published papers, and (5) thematic and conceptual development. To accomplish this overview we used Scopus.

Method

The term Big Data was searched on Scopus using the index and author keywords fields. No variations of the term were used in order to capture only this specific phrase. It should be noted that there are other phrases such as “large datasets” or “big size data” that appear throughout the literature and might refer to the same concept as Big Data. However, the focus of this article was to capture the prevalent Big Data phrase itself and examine the ways in which the research community adapted and embedded it in the mainstream research literature. The search results were further examined manually in order to determine the complete match between the articles’ content and the phrase Big Data. Special attention was given to articles from the 1960s and 1970s which were retrieved using the above fields. After close evaluation of the results set, only 4 older articles were removed from the final results set which left 306 core articles. These core articles were then analyzed using the Scopus analytics tool which enables different aggregated views of the results set based on year, source title, author, affiliation, country, document type and subject area. In addition, a content analysis of the titles and abstracts was performed in order to extract a timeline of themes and concepts within the results set.

Results

The growth of research articles about Big Data from 2008 to the present can be easily explained as the topic gained much attention over the last few years (see Figure 1). It is, however, interesting to take a closer look at older instances where the term was used. For example, the first appearance of term Big Data appears in a 1970 article on atmospheric and oceanic soundings (according to data available in Scopus; see study limitations). The 1970 article discusses the Barbados Oceanographic and Meteorological Experiment (BOMEX) which was conducted in 1969 (2).   This was a joint project of seven US departments and agencies with the cooperation of Barbados. A look at the BOMEX site features a photo of a large computer probably used at the time to process the large amounts of data generated by this project (3). Other early occurrences of the term are usually related to computer modeling and software/hardware development for large data sets in areas such as linguistics, geography and engineering.

Figure 1: Time line of Big Data as topic of research. The dotted line represents the exponential growth curve best fitting the data represented by the blue bars. This shows the number of Big Data articles increasing faster than the best exponential fit.

When segmenting the timeline and examining the subject areas covered in different timeframes, one can see that the early papers (i.e. until 2000) are led by engineering especially in the areas of computer engineering (neural networks, artificial intelligence, computer simulation, data management, mining and storage) but also in areas such as building materials, electric generators, electrical engineering, telecommunication equipment, cellular telephone systems and electronics. From 2000 onwards, the field is led by computer science followed by engineering and mathematics. Another interesting finding in terms of document types is that conference papers are most frequent followed by articles (see Figures 2 and 3). As we see in the thematic analysis, these conference papers become visible through the abstracts and titles analysis.

Figure 2: Document types of Big Data papers.

Figure 3: Conference papers and Articles growth over time.

The top subject area in this research field is, not surprisingly, computer science; but one can notice other disciplines that investigate the topic such as engineering, mathematics, business and also social and decision sciences (see Figure 4). Other subject areas that are evident in the results sets but not yet showing significant growth are chemistry, energy, arts and humanities and environmental sciences. In the arts and humanities for example, there is a growing interest in the development of infrastructure for e-science for humanities digital ecosystems (for instance, text mining), or in using census data to improve the allocation of funds from public resources.

Figure 4: Subject areas researching Big Data.

Finally, we took a look at the geographical distribution of papers. The USA has published the highest number of papers on Big Data by far, followed by China in second place (see Figure 5).  In both countries the research on Big Data is concentrated in the areas of computer science and engineering. However, while in the USA these two areas are followed by biochemistry, genetics and molecular biology, in China computer science and engineering are followed by mathematics, material sciences and physics. This observation coincides with other research findings such as the report on International Comparative Performance of the UK Research Base: 2011 (4) which indicated that the USA is strong in research areas such as medical, health and brain research while China is strong in areas such as computer science, engineering and mathematics.

Figure 5: Geographical Distribution of Big Data papers.

In addition to the overall characteristics of the publications on Big Data, we also conducted a thematic contextual analysis of the titles and abstracts in order to understand how and in what ways the topics within this field have evolved. In order to accomplish this, the abstracts and titles in each article were collected in two batches; one file containing abstracts and titles of articles from 1999-2005 and the second file from 2006-2012. The analysis concentrated on these years rather than the entire set, as there were multiple publications per year during this period. The texts were then entered into the freely available visualization software Many Eyes. This was used to create phrase-maps using the top 50 occurring keywords in these texts. These visualizations were produced by ignoring common words and connecting words such as ‘and’, ‘the’, ‘of’ etc. and used one place space between terms to determine the connections between the terms (see Figures 6 and 7).

These maps visualize two main characteristics of the text: (1) connections between terms are depicted by the gray lines, where a thicker line notes a stronger relationship between the terms; and (2) the centrality of the terms which are depicted by their font size (the bigger the font, the more frequently a term appears in the text).  Clusters of connections may appear when a connection is found between single words but not to other clusters.

The first two striking observations when looking at these two maps are the complexity of Figure 6 compared to Figure 7 and the close connectivity of the themes in Figure 7 compared to the scattered nature of their appearance in Figure 6.

Figure 6: Phrase map of highly occurring keywords 1999-2005.

The thematic characteristics of the 1999-2005 abstracts and titles text show several scattered connections between two words, seen on the right and left sides of the map. For example, neural networks analysis, on the right side of the map, is a common concept in the field of artificial intelligence computing. This map is conceptually quite simple, with most concepts concentrated around computer related terms, such as ‘data mining ’,‘ data sets’, XML and applications.  When compared to Figure 7 it can be easily seen how the term ‘big’, although strongly connected to ‘data’ is not as noticeable as it is in later dates.

Figure 7: Phrase map of highly occurring keywords 2006-2012.

The map in Figure 7 represents a tighter network of terms all closely related to one another and to the Big Data concept. Figure 7 also represents a much richer picture of the research on Big Data. There’s a clear evolution from basic data mining to specific issues of interest such as data storage and management which lead to cloud and distributed computing. It could be said that the first period demonstrates a naïve picture, in which solutions and topics revolve around a more ‘traditional’ view of the topic using known concepts of XML and data mining while the second period shows a more complex view of the topic while demonstrating innovative solutions such as cloud computing with emphasis on networks. This also holds for terms such as ‘model’, ‘framework’, and ‘analytics’, that appear in  Figure 7, which indicate development and growth in research directions.

A comparison of these two maps also reveals the appearance of diversity in the topics surrounding Big Data such as ‘social data’ , ‘user data’ and even specific solutions such as ‘MapReduce’,  a model for processing large datasets implemented by Google (http://mapreduce.meetup.com/), and ‘hadoop’, an open source software framework that supports data-intensive distributed applications (www.hadoop.apache.org).

As mentioned in the section above analyzing document types, conference papers are central to research in this area. As can be seen in Figure 7, the ACM or IEEE conferences in 2010-2012 play an important role in this area which can be seen by the clear appearance of these terms and their connection to the topic.

Conclusions

Research on Big Data emerged in the 1970s but has seen an explosion of publications since 2008. Although the term is commonly associated with computer science, the data shows that it is applied to many different disciplines including earth, health, engineering, arts and humanities and environmental sciences. Conferences, especially those sponsored by IEEE and/or ACM, are the leaders in the progression of publications in this area followed by journal articles. Geographically, research is led by the USA followed by China and some European countries.

A closer look at the concepts and themes within the abstracts and titles over time show how this area, which began as a computer and technology focus area with some satellite applications, developed into  a close and tight-knit discipline featuring applications, methodologies and innovative solutions ranging from could to distributed computing and focusing on user experience.

In May 2012, Elsevier sponsored a 2-day conference in Canberra, Australia dedicated to the topics of Big Data, E-science and Science policy (see videos and links to the presentations here: https://www.youtube.com/playlist?list=PL61DD522B24108837). The topic was treated from a variety of viewpoints including the analytics of Big Data sets in publishing, digital scholarship, research assessment and science policy. The multi-dimensional characteristic of this topic is seen in the literature as well as in the social media and online publications. The concept of Big Data as a research topic seems to be growing and it is probable that by the end of 2012 the number of publications will double, if not more, and its analytics and applications will be seen in various disciplines.

Limitations

This study was conducted using Scopus.com in August 2012 and the numbers and percentages presented in this article reflect the indexed publications at the time. These are bound to change as Scopus.com is updated daily with new publications, covering articles in press.

In addition, the dates and document types presented in this study are direct derivatives of Scopus coverage as far as sources and dates. A similar search on other databases might result in slightly different findings and may vary according to the database coverage.

Useful Links

1. http://strata.oreilly.com/2010/01/roger-magoulas-on-big-data.html
2. http://www.eol.ucar.edu/projects/bomex/
3. http://www.eol.ucar.edu/projects/bomex/images/DataAcquisitionSystem.jpg
4. http://www.bis.gov.uk/assets/biscore/science/docs/i/11-p123-international-comparative-performance-uk-research-base-2011.pdf
VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

Big Data, E-Science and Science Policy: Managing and Measuring Research Outcome (part 2)

A presentation by Dr. Michiel Kolman, SVP Academic Relations, Elsevier at the Big Data, E-Science and Science Policy conference in Canberra, Australia, 16th-17th May 2012. Day 2.

Read more >


A presentation by Dr. Michiel Kolman, SVP Academic Relations, Elsevier at the Big Data, E-Science and Science Policy conference in Canberra, Australia, 16th-17th May 2012. Day 2.

Link to presentation.

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

A presentation by Dr. Michiel Kolman, SVP Academic Relations, Elsevier at the Big Data, E-Science and Science Policy conference in Canberra, Australia, 16th-17th May 2012. Day 2.

Link to presentation.

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

Research Evaluation in Practice: Interview with Linda Butler

During your career, you have taken part in government-driven research projects using bibliometrics methodologies. Could you give an example or two of the outcomes of these research projects and the way they informed scientific funding? The most influential body of research I have undertaken relates to analyses of the way Australian academics responded to the […]

Read more >


Linda Butler

During your career, you have taken part in government-driven research projects using bibliometrics methodologies. Could you give an example or two of the outcomes of these research projects and the way they informed scientific funding?

The most influential body of research I have undertaken relates to analyses of the way Australian academics responded to the introduction of a sector-wide funding scheme that distributes research funding to universities on the basis of a very blunt formula.  The formula is based on data on research students, success in obtaining competitive grant income, and the number of research outputs produced.  For research outputs, a simple count is used.  It does not matter where a publication appeared – the rewards are the same.  By looking in detail at the higher education sector, and after eliminating other possible causal factors, I was able to demonstrate that the introduction of the formula led to Australian academics significantly increasing their productivity above long-term trend lines.  While the increase was welcome, what was of major concern to policy makers were the findings that the increase in output was particularly high in lower impact journals, and that Australia’s relative citation impact had fallen below that of a number of its traditional OECD comparators.

These findings were part, though not all, of the driver for Australia to introduce a new funding system for research.  The same blunt formula is still being used, but it is anticipated that much of the funding it distributes will before long be based on the results of the Excellence in Research for Australia (ERA) initiative, the second exercise of which will be conducted in 2014 (the first was held in 2012). The same research has also been influential in Norway and other Scandinavian countries where governments sought to avoid the pitfalls of simple publication counts by introducing a tiered system of outputs, with those in more prestigious journals or from more prestigious publishers receiving a higher weighting and therefore resulting in greater funding.

See also: Powerful Numbers: Interview with Dr. Diana Hicks

Examining the literature, there appear to be far more research evaluation studies focusing on life and medical sciences. Why, in your opinion, are these not as prevalent in the social sciences?

I believe this is primarily because quantitative indicators are seen as fairly robust in the biomedical disciplines and are therefore, on the whole, reasonably well accepted by researchers in those fields.  This is not the case for the social sciences.  There is nothing surprising in this. The biomedical literature is well covered by major bibliometric databases.  In addition, sociological studies have given us much evidence on the meaning of citations in the life sciences and this, together with evaluative studies that have been shown to correlate well with peer review, means researchers have some confidence that measures based on the data are reasonably robust – though always with the proviso they are not used as a blunt instrument in isolation from peer or expert interpretation of the results.

The same can’t be said for the social sciences (or the humanities and arts).  There is some evidence that a citation in these disciplines has a different meaning – their scholarship does not build on past research in the same way that it does in the life sciences.  It is also well known that coverage of the social sciences is very poor in many disciplines, and only moderate in the best cases.  Evaluative studies that use only the indexed journal literature have sometimes demonstrated poor correlation to peer review assessments, and there is understandably little confidence in the application of the standard measures used in the life sciences.

What can be done to measure arts & humanities as well as social sciences better?

I think the most promising initiatives are those coming out of the European Science Foundation, which has for a number of years been investigating the potential for a citation index specifically constructed to cover these disciplines.  The problem is that, as it would need to cover books and many journals not indexed by the major citation databases, it is a huge undertaking.  Given the current European financial climate I don’t have much confidence that this initiative will progress very far in the short-term.  It is also an initiative fraught with problems, as seen in the ESF’s first foray into this domain with its journal classification scheme. Discipline and national interest groups have been very vocal in their criticisms of the initial lists, and a citation index is likely to be just as controversial.

Many scholars in these disciplines pin their hopes on Google Scholar (GS) to provide measures that take account of all their forms of scholarship.  The problem with GS is that it is not a static database, but rather a search engine.  As GS itself clearly points out, if a website disappears, then all the citations from publications found solely in that website will also disappear, so over time there can be considerable variability in results, particularly for individual papers or researchers.  In addition, it has to date been impossible to obtain data from GS that would enable world benchmarks to be calculated – essential information for any evaluative studies.

Do you think that open access publishing will have an effect on journals’ content quality, citations tracking and general impact?

The answers to these questions depend on what “open access publishing” means.  If it refers to making articles in the journal literature that are currently only accessible through paid subscription services publicly available, I would expect the journal “gatekeepers” – the editors and reviewers – to continue with the same quality control measures that currently exist.  If all (or most) literature becomes open access, then the short-term citation advantage that is said to exist for those currently in open access form will disappear, but general impact could increase as all publications will have the potential to reach a much wider audience than was previously possible.

But if “open access publishing” is interpreted in its broadest sense – the publishing of all research output irrespective of whether or not it undergoes any form of peer review – then there is potential for negative impact on quality.  There is so much literature in existence that researchers need some form of assessment to allow them to identify the most appropriate literature and avoid the all too real danger of being swamped by the sheer volume of what is available.  Some form of peer validation is absolutely essential.  That is not to say that peer validation must take the same form as that used by journals – it may be in the form of online commentary, blogs, or the like – but it is essential in some format.

Any new mode of publication presents its own challenges for citation tracking.  On the one hand, open access publishing presents huge possibilities in a much more comprehensive coverage of the literature, and potential efficiencies in harvesting the data.  But on the other hand they present problems for constructing benchmarks against which to judge performance – how is the “world” to be defined?  Will we be able to continue using existing techniques for delineating fields?  Will author or institutional disambiguation become so difficult that few analysts will possess the knowledge and computer power required to do this?

What forms of measurements, other than citations, should be applied when evaluating research quality and output impact in your opinion? (i.e. usage, patents)

It is important to use a suite of indicators that is as multi-dimensional as possible.  In addition to citation-based measures, other measures of quality that may be relevant include those based on journal rankings, publisher rankings, journal impact measures (i.e. SNIP, SJR etc.) and success in competitive funding schemes.  Any indicator chosen must be valid, must actually relate to the quality of research, must be transparent, and must enable the construction of appropriate field-specific benchmarks.  Even then, no single indicator, nor even a diverse suite of indicators, will give a definitive answer on quality – the data still need to be interpreted by experts in the relevant disciplines who understand the nuances of what the data is showing.

Choosing indicators of wider impact is a much more fraught task.  Those that are readily available are either limited in their application (e.g. patents are not relevant for all disciplines), or refer merely to engagement rather than demonstrated achievement (e.g. data on giving non-academic presentations, or meetings with end-users attended).  And perhaps the biggest hurdle is attribution – which piece (or body) of work led to a particular outcome?  For this reason, the current attempts to assess the wider impact of academic research are focussing on a case study approach rather than being limited to quantitative indicators.  The assessment of impact in the UK’s Research Excellence Framework is the major example of such an approach currently being undertaken, and much information on this assessment approach can be found on the website of the agency overseeing this process – the Higher Education Funding Council of England.

See also: Research Impact in the broadest sense: REF 14

During your years as a university academic, did you notice a change among university leaders and research managers in the perception and application of bibliometrics?

From a global perspective, the biggest change has occurred since the appearance of university rankings such as the Jiao Tong and THE rankings.  Prior to this, few senior administrators had much knowledge of the use of bibliometrics in performance assessments, other than the ubiquitous journal impact factor. The weightings given to citation data in the university rankings now ensure that bibliometrics are at the forefront of universities’ strategic thinking and many universities have signed up to obtain the data that relates to their own university and use it internally for performance assessment.

In Australia, most university research managers had at least a passing knowledge of the use of bibliometrics in evaluation exercises by the 1990s, through the analyses undertaken by the unit I headed at The Australian National University, the Research Evaluation and Policy Project.  However their interest increased with the announcement that bibliometrics were to form an integral part of a new performance assessment system for Australian universities – the Research Quality Framework which was ultimately superseded by the ERA framework.  This interest was further heightened by the appearance of the institutional rankings mentioned above. While ERA is not currently linked to any substantial funding outcomes, it is expected to have financial implications by the time the results have been published from the second exercise to be held in 2014.  Australian universities are now acutely aware of the citation performance of their academics’ publications, and many monitor that performance internally through their research offices.

The downside of all this increased interest in, and exposure to, bibliometrics is the proliferation of what some commentators have labelled “amateur bibliometrics” – studies undertaken by those with little knowledge of existing sophisticated techniques, nor any understanding of the strengths and weaknesses of the underlying data.  Sometimes the data is seriously misused, particularly in its application to assessing the work of individuals.

What are your thoughts about using social media as a form of indication about scientific trends and researchers’ impact?

I have deep reservations about the use of data from social media to construct performance indicators. They relate more to popularity than to the inherent quality of the underpinning research, and at this point in time are incredibly easy to manipulate. They may be able to be used to develop some idea of the outreach of a particular idea, or a set of research outcomes, but are unlikely to provide much indication of any real impact on the broader community. As with many of the new Web 2.0 developments, the biggest challenge is determining the meaning of any data that can be harvested, and judging whether any of it relates to real impact on either the research community, on policy, on practice, or on other end-users of that research.

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

Linda Butler

During your career, you have taken part in government-driven research projects using bibliometrics methodologies. Could you give an example or two of the outcomes of these research projects and the way they informed scientific funding?

The most influential body of research I have undertaken relates to analyses of the way Australian academics responded to the introduction of a sector-wide funding scheme that distributes research funding to universities on the basis of a very blunt formula.  The formula is based on data on research students, success in obtaining competitive grant income, and the number of research outputs produced.  For research outputs, a simple count is used.  It does not matter where a publication appeared – the rewards are the same.  By looking in detail at the higher education sector, and after eliminating other possible causal factors, I was able to demonstrate that the introduction of the formula led to Australian academics significantly increasing their productivity above long-term trend lines.  While the increase was welcome, what was of major concern to policy makers were the findings that the increase in output was particularly high in lower impact journals, and that Australia’s relative citation impact had fallen below that of a number of its traditional OECD comparators.

These findings were part, though not all, of the driver for Australia to introduce a new funding system for research.  The same blunt formula is still being used, but it is anticipated that much of the funding it distributes will before long be based on the results of the Excellence in Research for Australia (ERA) initiative, the second exercise of which will be conducted in 2014 (the first was held in 2012). The same research has also been influential in Norway and other Scandinavian countries where governments sought to avoid the pitfalls of simple publication counts by introducing a tiered system of outputs, with those in more prestigious journals or from more prestigious publishers receiving a higher weighting and therefore resulting in greater funding.

See also: Powerful Numbers: Interview with Dr. Diana Hicks

Examining the literature, there appear to be far more research evaluation studies focusing on life and medical sciences. Why, in your opinion, are these not as prevalent in the social sciences?

I believe this is primarily because quantitative indicators are seen as fairly robust in the biomedical disciplines and are therefore, on the whole, reasonably well accepted by researchers in those fields.  This is not the case for the social sciences.  There is nothing surprising in this. The biomedical literature is well covered by major bibliometric databases.  In addition, sociological studies have given us much evidence on the meaning of citations in the life sciences and this, together with evaluative studies that have been shown to correlate well with peer review, means researchers have some confidence that measures based on the data are reasonably robust – though always with the proviso they are not used as a blunt instrument in isolation from peer or expert interpretation of the results.

The same can’t be said for the social sciences (or the humanities and arts).  There is some evidence that a citation in these disciplines has a different meaning – their scholarship does not build on past research in the same way that it does in the life sciences.  It is also well known that coverage of the social sciences is very poor in many disciplines, and only moderate in the best cases.  Evaluative studies that use only the indexed journal literature have sometimes demonstrated poor correlation to peer review assessments, and there is understandably little confidence in the application of the standard measures used in the life sciences.

What can be done to measure arts & humanities as well as social sciences better?

I think the most promising initiatives are those coming out of the European Science Foundation, which has for a number of years been investigating the potential for a citation index specifically constructed to cover these disciplines.  The problem is that, as it would need to cover books and many journals not indexed by the major citation databases, it is a huge undertaking.  Given the current European financial climate I don’t have much confidence that this initiative will progress very far in the short-term.  It is also an initiative fraught with problems, as seen in the ESF’s first foray into this domain with its journal classification scheme. Discipline and national interest groups have been very vocal in their criticisms of the initial lists, and a citation index is likely to be just as controversial.

Many scholars in these disciplines pin their hopes on Google Scholar (GS) to provide measures that take account of all their forms of scholarship.  The problem with GS is that it is not a static database, but rather a search engine.  As GS itself clearly points out, if a website disappears, then all the citations from publications found solely in that website will also disappear, so over time there can be considerable variability in results, particularly for individual papers or researchers.  In addition, it has to date been impossible to obtain data from GS that would enable world benchmarks to be calculated – essential information for any evaluative studies.

Do you think that open access publishing will have an effect on journals’ content quality, citations tracking and general impact?

The answers to these questions depend on what “open access publishing” means.  If it refers to making articles in the journal literature that are currently only accessible through paid subscription services publicly available, I would expect the journal “gatekeepers” – the editors and reviewers – to continue with the same quality control measures that currently exist.  If all (or most) literature becomes open access, then the short-term citation advantage that is said to exist for those currently in open access form will disappear, but general impact could increase as all publications will have the potential to reach a much wider audience than was previously possible.

But if “open access publishing” is interpreted in its broadest sense – the publishing of all research output irrespective of whether or not it undergoes any form of peer review – then there is potential for negative impact on quality.  There is so much literature in existence that researchers need some form of assessment to allow them to identify the most appropriate literature and avoid the all too real danger of being swamped by the sheer volume of what is available.  Some form of peer validation is absolutely essential.  That is not to say that peer validation must take the same form as that used by journals – it may be in the form of online commentary, blogs, or the like – but it is essential in some format.

Any new mode of publication presents its own challenges for citation tracking.  On the one hand, open access publishing presents huge possibilities in a much more comprehensive coverage of the literature, and potential efficiencies in harvesting the data.  But on the other hand they present problems for constructing benchmarks against which to judge performance – how is the “world” to be defined?  Will we be able to continue using existing techniques for delineating fields?  Will author or institutional disambiguation become so difficult that few analysts will possess the knowledge and computer power required to do this?

What forms of measurements, other than citations, should be applied when evaluating research quality and output impact in your opinion? (i.e. usage, patents)

It is important to use a suite of indicators that is as multi-dimensional as possible.  In addition to citation-based measures, other measures of quality that may be relevant include those based on journal rankings, publisher rankings, journal impact measures (i.e. SNIP, SJR etc.) and success in competitive funding schemes.  Any indicator chosen must be valid, must actually relate to the quality of research, must be transparent, and must enable the construction of appropriate field-specific benchmarks.  Even then, no single indicator, nor even a diverse suite of indicators, will give a definitive answer on quality – the data still need to be interpreted by experts in the relevant disciplines who understand the nuances of what the data is showing.

Choosing indicators of wider impact is a much more fraught task.  Those that are readily available are either limited in their application (e.g. patents are not relevant for all disciplines), or refer merely to engagement rather than demonstrated achievement (e.g. data on giving non-academic presentations, or meetings with end-users attended).  And perhaps the biggest hurdle is attribution – which piece (or body) of work led to a particular outcome?  For this reason, the current attempts to assess the wider impact of academic research are focussing on a case study approach rather than being limited to quantitative indicators.  The assessment of impact in the UK’s Research Excellence Framework is the major example of such an approach currently being undertaken, and much information on this assessment approach can be found on the website of the agency overseeing this process – the Higher Education Funding Council of England.

See also: Research Impact in the broadest sense: REF 14

During your years as a university academic, did you notice a change among university leaders and research managers in the perception and application of bibliometrics?

From a global perspective, the biggest change has occurred since the appearance of university rankings such as the Jiao Tong and THE rankings.  Prior to this, few senior administrators had much knowledge of the use of bibliometrics in performance assessments, other than the ubiquitous journal impact factor. The weightings given to citation data in the university rankings now ensure that bibliometrics are at the forefront of universities’ strategic thinking and many universities have signed up to obtain the data that relates to their own university and use it internally for performance assessment.

In Australia, most university research managers had at least a passing knowledge of the use of bibliometrics in evaluation exercises by the 1990s, through the analyses undertaken by the unit I headed at The Australian National University, the Research Evaluation and Policy Project.  However their interest increased with the announcement that bibliometrics were to form an integral part of a new performance assessment system for Australian universities – the Research Quality Framework which was ultimately superseded by the ERA framework.  This interest was further heightened by the appearance of the institutional rankings mentioned above. While ERA is not currently linked to any substantial funding outcomes, it is expected to have financial implications by the time the results have been published from the second exercise to be held in 2014.  Australian universities are now acutely aware of the citation performance of their academics’ publications, and many monitor that performance internally through their research offices.

The downside of all this increased interest in, and exposure to, bibliometrics is the proliferation of what some commentators have labelled “amateur bibliometrics” – studies undertaken by those with little knowledge of existing sophisticated techniques, nor any understanding of the strengths and weaknesses of the underlying data.  Sometimes the data is seriously misused, particularly in its application to assessing the work of individuals.

What are your thoughts about using social media as a form of indication about scientific trends and researchers’ impact?

I have deep reservations about the use of data from social media to construct performance indicators. They relate more to popularity than to the inherent quality of the underpinning research, and at this point in time are incredibly easy to manipulate. They may be able to be used to develop some idea of the outreach of a particular idea, or a set of research outcomes, but are unlikely to provide much indication of any real impact on the broader community. As with many of the new Web 2.0 developments, the biggest challenge is determining the meaning of any data that can be harvested, and judging whether any of it relates to real impact on either the research community, on policy, on practice, or on other end-users of that research.

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

Identifying emerging research topics in Wind Energy research using author given keywords

The value of well constructed thesauri as means for effective searching and structuring of information is something a seasoned searcher is very familiar with. Thesauri are useful for numerous information management objectives such as grouping, defining and linking terms, and identifying synonyms and near-synonyms as well as broader and narrower terms. Searches based on thesauri […]

Read more >


The value of well constructed thesauri as means for effective searching and structuring of information is something a seasoned searcher is very familiar with. Thesauri are useful for numerous information management objectives such as grouping, defining and linking terms, and identifying synonyms and near-synonyms as well as broader and narrower terms. Searches based on thesauri terms are considered better in terms of both recall and precision (1,2,3).

Yet the construction of a comprehensive thesaurus is a laborious task which often requires the intervention of an indexer who is expert in the subject. Terms incorporated in a thesaurus are selected carefully and examined for their capability to describe content accurately while keeping the integrity of the thesaurus as a whole. Terms incorporated in a thesaurus are referred to as controlled vocabulary or terms. Uncontrolled vocabulary on the other hand, consists of freely assigned keywords which the authors use to describe their work. These terms can usually be found as a part of an abstract, and appear in most databases as “author keywords” or “uncontrolled vocabularies”. In today’s fast moving world of science where new discoveries and technologies develop rapidly, the pace by which thesauri capture new areas of research may be questioned, and so the value of now using author keywords in retrieving new, domain-specific research should be examined.

This study sought to examine the manners by which thesauri keywords and author keywords manage to capture new and emerging research in the field of “Wind Energy”. The research questions were as follows:

  1. Do author keywords include new terms that are not to be found in a thesaurus function?
  2. Can new areas of research be identified through author keywords?
  3. Is there a time lapse between the appearance of a keyword assigned by an author and its appearance in a thesaurus?

Methodology

In order to answer these questions we analyzed controlled and uncontrolled terms of 4000 articles grouped under the main heading “Wind Power” in Compendex captured between the years 2005–2012. Compendex is a comprehensive bibliographic database of scientific and technical engineering research available, covering all engineering disciplines. It includes millions of bibliographic citations and abstracts from thousands of engineering journals and conference proceedings. When combined with the Engineering Index Backfile (1884-1969), Compendex covers well over 120 years of core engineering literature.

In each Compendex record a list of controlled and uncontrolled terms are listed and can be searched on.  Over 17,000 terms were extracted from the Compendex records and sorted by frequency. Two separate files were created; one depicting all the controlled terms and the second depicting the author given keywords (i.e. uncontrolled terms). For each term a count of the number of times they appear in each year from 2005–2012 and the total number of articles in which each term appears was recorded. In addition, a simple trend analysis compared the number of the times each term appears on average in papers published during the years 2009–2012 with the same measure calculated for 2005–2008. This trend analysis allowed for a view of terms that increase in usage in the past 3 years, compared to the overall time period.

To answer the research questions, the following steps were taken:

  1. All author keywords that appear 100 times or more were collected.
  2. The author keywords were searched in the Compendex Thesaurus: if an author keyword appeared, the year in which it was introduced was recorded.
  3. The author keyword was then searched for in Compendex across all years and the year in which it first appeared was recorded.
  4. The author keywords that appeared more than 100 times were grouped into themes. In addition these author keywords were searched for in Compendex in order to identify their corresponding articles and the topics they cover.

Findings

Table 1 shows the most recurring uncontrolled terms. The terms were categorized in 4 groups as follows:

Topic Group Environment Mechanics Integration Computerization
Uncontrolled terms Renewable energies 

Renewable energy source

Wind energy

Wind speed

Wind Resources

Doubly-fed induction generator 

Offshore wind farms

Permanent magnet

Synchronous generator

Wind farm(s)

Wind turbine generators

Wind generators

Wind generation

Wind energy conversion system

Control strategies 

Power grids

Power output

Grid-connected

Simulation result

Table 1 - Most recurring uncontrolled terms in the retrieved articles. Source: Engineering Village

Looking at the corresponding literature within Compendex, there were three main topics that emerged from the author key words which indicate specialized areas of research within the overall ‘wind power’ main heading. These terms did not appear in the Compendex thesaurus.

Wind Farms: This term first appeared as an uncontrolled term (i.e. Author keywords) in 1985 in an article by NASA researchers (4). The term refers to large areas of land on which wind turbines are grouped. Some examples of such wind farms are The Alta Wind Energy Center (AWEC) which is located in the Tehachapi-Mojave Wind Resource Area in the USA and the Dabancheng Wind Farm in China. This research includes a wide variety of topics ranging from agriculture, turbines mechanics, and effects on the atmosphere and power grid integrations. The term has shown substantial growth in use as an author keyword between 2006 and 2012 with peak of 757 articles in 2011 (see Figure 1).

In the thesaurus, however, this term is included under “Farm buildings” which also contains livestock buildings and other structures that are to be found in farms.


Figure 1 - Use of keyword Wind Farm by authors. Source: Engineering Village

Offshore wind farms: This term first appeared as an uncontrolled terms in 1993 (5) and refers to the construction of wind farms in deep waters. Some examples of such wind farms include Lillgrund Wind Farm in Sweden and Walney in the UK.   In the thesaurus articles with this keyword are assigned to the term “Ocean structures”. This of course includes other structures such as ocean drilling, gas pipelines and oil wells. The use of this term has been steadily growing (see Figure 2) with substantial increase between 2008 and 2011.


Figure 2 - Use of keyword Offshore Wind Farms by authors. Source: Engineering Village

Most surprisingly, however, is the fact that the term Wind energy itself doesn’t appear in the thesaurus at all. The topic as a whole appears under “Wind Power” which also applies to damages caused by wind, wind turbulences, wind speed and so forth. The term has been used by authors since 1976 and first appeared in an article by the Department of the Environment, Building Research Establishment of UK Government (6), and has seen constant growth between 2006 and 2012 (see Figure 3).


Figure 3 - Use of keyword Wind Energy by authors. Source: Engineering Village

Other emerging topics include: wind energy integration into power grids, effects of wind farms on the atmosphere, wind farms and turbines computer simulations and control software.  In addition, comparing the uncontrolled and controlled terms that appeared most commonly there are apparent differences in foci as they emerge from the vocabulary. While the uncontrolled vocabulary highlights Wind speed and farms, the controlled vocabulary features Wind power, Electric utilities, and Turbomachine blades. This could be due to the fact that the Compendex thesaurus is engineering focused, thus giving the mechanics of wind power conversion prominent descriptors. In this case, the author given keywords are valuable and they provide a supplementary view on these topics by depicting the environmental aspects of these research articles. Table 2 illustrates the different foci of the keywords.

Uncontrolled terms Controlled terms
Wind speed (43 papers, 10%) Wind power (172 papers, 41%)
Wind farm (37, 9%) Wind turbines (171, 41%)
Wind farms (22, 5%) Computer simulation (74, 18%)
Wind turbine blades (17, 4%) Mathematical models (73, 18%)
Fatigue loads (12, 3%) Aerodynamics (72, 17%)
Wind energy (12, 3%) Electric utilities (63, 15%)
Wind turbine wakes (12, 3%) Turbomachine blades (58, 14%)
Control strategies (11, 3%) Wind effects (49, 12%)
Offshore wind farms (11, 3%) Rotors (48, 12%)
Power systems (11, 3%) Wakes (45, 11%)

Table 2 - Most common controlled and uncontrolled terms on search. Source: Engineering Village

Discussion

Wind energy is by no means a new area of exploration, yet in the past 4 to 5 years this area has seen a considerable growth in research output especially in wind turbines technology and wind harvesting. Although the data sample analyzed is small and covers one subject field only, our findings illustrate that author keywords may indeed include new terms that are not to be found in a thesaurus function. The use of thesauri terms is usually recommended as a part of precision strategy in searching. Yet, in our case controlled terms have a more general scope. Table 3 below summarizes some of our major conclusions as they pertain to the properties of using author-given keywords and controlled terms in the search process. Our findings show that the use of author given keywords as a search strategy will be beneficial when one searches for more specific technologies and applications or new research areas within the overall topic (see Table 3).

Controlled Uncontrolled Notes
Recall Using controlled terms retrieves a larger number of articles since they are lumped under broader descriptors.
Precision Uncontrolled terms are very specific and enable retrieval of detailed topics.
Discoverability Uncontrolled terms enable the discovery of the new topics and can serve as indicators of the latest discoveries made in this field. Controlled terms enable the clustering of such topics thus enabling connections between larger numbers of articles and topics.
Serendipity Controlled terms are broader thus retrieving a larger amount of article and enabling serendipity through browsing.
State of the Art Uncontrolled terms depict the latest descriptors of methods, applications and processes in a certain topic.

Table 3 - Evaluation of the impact of controlled and uncontrolled terms on search.

Our analysis showed, for example, that strongly emerging areas identified in our sample are wind farms and offshore wind farms. These terms, although appearing in the author given keywords for over 20 years have not entered the Compendex thesaurus. This could be due to the fact that the Compendex database is engineering-focused and built to serve engineers therefore grouping these articles under terms that are mechanical in nature. However, this might hinder a broader understanding of the topics in context.

In this case using the thesaurus as basis for searching Wind Energy articles would create broader results sets. Depending on what the purpose of the search is, this could be viewed as a positive or negative outcome. Our analysis shows that the two types of terms have different properties and serve different purposes in the search process. In the analysis of emerging topics author-given keywords are useful tools, as they enable one to specify a topic in a way that seems difficult to carry out when one uses only terms from a controlled thesaurus.

References

1. Sihvonen, A., Vakkari, P. (2004)”Subject knowledge, thesaurus-assisted query expansion and search success”, Proceedings of RIAO2004 Conference, pp. 393-404.
2. Sihvonen, A., & Vakkari, P. (2004) “Subject knowledge improves interactive query expansion assisted by a thesaurus”, Journal of Documentation, 60(6), 673-690.
3. Shiri, A.A.,Revie, C.,Chowdhury, G. (2002) “Thesaurus-enhanced search interfaces”, Journal of Information Science, Volume 28, Issue 2, 2002, Pages 111-122.
4. Neustadter, H. E., & Spera, D. A. (1985) “Method for Evaluating Wind Turbine Wake Effects on Wind Farm Performance”, Journal of Solar Energy Engineering, Transactions of the ASME, 107(3), 240-243.
5. Olsen, F., & Dyre, K. (1993) “Vindeby off-shore wind farm - construction and operation“, Wind Engineering, 17(3), 120-128.
6. Rayment, R. (1976) “Wind Energy in the UK”, Building Services Engineer, (44), 63-69.
VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

The value of well constructed thesauri as means for effective searching and structuring of information is something a seasoned searcher is very familiar with. Thesauri are useful for numerous information management objectives such as grouping, defining and linking terms, and identifying synonyms and near-synonyms as well as broader and narrower terms. Searches based on thesauri terms are considered better in terms of both recall and precision (1,2,3).

Yet the construction of a comprehensive thesaurus is a laborious task which often requires the intervention of an indexer who is expert in the subject. Terms incorporated in a thesaurus are selected carefully and examined for their capability to describe content accurately while keeping the integrity of the thesaurus as a whole. Terms incorporated in a thesaurus are referred to as controlled vocabulary or terms. Uncontrolled vocabulary on the other hand, consists of freely assigned keywords which the authors use to describe their work. These terms can usually be found as a part of an abstract, and appear in most databases as “author keywords” or “uncontrolled vocabularies”. In today’s fast moving world of science where new discoveries and technologies develop rapidly, the pace by which thesauri capture new areas of research may be questioned, and so the value of now using author keywords in retrieving new, domain-specific research should be examined.

This study sought to examine the manners by which thesauri keywords and author keywords manage to capture new and emerging research in the field of “Wind Energy”. The research questions were as follows:

  1. Do author keywords include new terms that are not to be found in a thesaurus function?
  2. Can new areas of research be identified through author keywords?
  3. Is there a time lapse between the appearance of a keyword assigned by an author and its appearance in a thesaurus?

Methodology

In order to answer these questions we analyzed controlled and uncontrolled terms of 4000 articles grouped under the main heading “Wind Power” in Compendex captured between the years 2005–2012. Compendex is a comprehensive bibliographic database of scientific and technical engineering research available, covering all engineering disciplines. It includes millions of bibliographic citations and abstracts from thousands of engineering journals and conference proceedings. When combined with the Engineering Index Backfile (1884-1969), Compendex covers well over 120 years of core engineering literature.

In each Compendex record a list of controlled and uncontrolled terms are listed and can be searched on.  Over 17,000 terms were extracted from the Compendex records and sorted by frequency. Two separate files were created; one depicting all the controlled terms and the second depicting the author given keywords (i.e. uncontrolled terms). For each term a count of the number of times they appear in each year from 2005–2012 and the total number of articles in which each term appears was recorded. In addition, a simple trend analysis compared the number of the times each term appears on average in papers published during the years 2009–2012 with the same measure calculated for 2005–2008. This trend analysis allowed for a view of terms that increase in usage in the past 3 years, compared to the overall time period.

To answer the research questions, the following steps were taken:

  1. All author keywords that appear 100 times or more were collected.
  2. The author keywords were searched in the Compendex Thesaurus: if an author keyword appeared, the year in which it was introduced was recorded.
  3. The author keyword was then searched for in Compendex across all years and the year in which it first appeared was recorded.
  4. The author keywords that appeared more than 100 times were grouped into themes. In addition these author keywords were searched for in Compendex in order to identify their corresponding articles and the topics they cover.

Findings

Table 1 shows the most recurring uncontrolled terms. The terms were categorized in 4 groups as follows:

Topic Group Environment Mechanics Integration Computerization
Uncontrolled terms Renewable energies 

Renewable energy source

Wind energy

Wind speed

Wind Resources

Doubly-fed induction generator 

Offshore wind farms

Permanent magnet

Synchronous generator

Wind farm(s)

Wind turbine generators

Wind generators

Wind generation

Wind energy conversion system

Control strategies 

Power grids

Power output

Grid-connected

Simulation result

Table 1 - Most recurring uncontrolled terms in the retrieved articles. Source: Engineering Village

Looking at the corresponding literature within Compendex, there were three main topics that emerged from the author key words which indicate specialized areas of research within the overall ‘wind power’ main heading. These terms did not appear in the Compendex thesaurus.

Wind Farms: This term first appeared as an uncontrolled term (i.e. Author keywords) in 1985 in an article by NASA researchers (4). The term refers to large areas of land on which wind turbines are grouped. Some examples of such wind farms are The Alta Wind Energy Center (AWEC) which is located in the Tehachapi-Mojave Wind Resource Area in the USA and the Dabancheng Wind Farm in China. This research includes a wide variety of topics ranging from agriculture, turbines mechanics, and effects on the atmosphere and power grid integrations. The term has shown substantial growth in use as an author keyword between 2006 and 2012 with peak of 757 articles in 2011 (see Figure 1).

In the thesaurus, however, this term is included under “Farm buildings” which also contains livestock buildings and other structures that are to be found in farms.


Figure 1 - Use of keyword Wind Farm by authors. Source: Engineering Village

Offshore wind farms: This term first appeared as an uncontrolled terms in 1993 (5) and refers to the construction of wind farms in deep waters. Some examples of such wind farms include Lillgrund Wind Farm in Sweden and Walney in the UK.   In the thesaurus articles with this keyword are assigned to the term “Ocean structures”. This of course includes other structures such as ocean drilling, gas pipelines and oil wells. The use of this term has been steadily growing (see Figure 2) with substantial increase between 2008 and 2011.


Figure 2 - Use of keyword Offshore Wind Farms by authors. Source: Engineering Village

Most surprisingly, however, is the fact that the term Wind energy itself doesn’t appear in the thesaurus at all. The topic as a whole appears under “Wind Power” which also applies to damages caused by wind, wind turbulences, wind speed and so forth. The term has been used by authors since 1976 and first appeared in an article by the Department of the Environment, Building Research Establishment of UK Government (6), and has seen constant growth between 2006 and 2012 (see Figure 3).


Figure 3 - Use of keyword Wind Energy by authors. Source: Engineering Village

Other emerging topics include: wind energy integration into power grids, effects of wind farms on the atmosphere, wind farms and turbines computer simulations and control software.  In addition, comparing the uncontrolled and controlled terms that appeared most commonly there are apparent differences in foci as they emerge from the vocabulary. While the uncontrolled vocabulary highlights Wind speed and farms, the controlled vocabulary features Wind power, Electric utilities, and Turbomachine blades. This could be due to the fact that the Compendex thesaurus is engineering focused, thus giving the mechanics of wind power conversion prominent descriptors. In this case, the author given keywords are valuable and they provide a supplementary view on these topics by depicting the environmental aspects of these research articles. Table 2 illustrates the different foci of the keywords.

Uncontrolled terms Controlled terms
Wind speed (43 papers, 10%) Wind power (172 papers, 41%)
Wind farm (37, 9%) Wind turbines (171, 41%)
Wind farms (22, 5%) Computer simulation (74, 18%)
Wind turbine blades (17, 4%) Mathematical models (73, 18%)
Fatigue loads (12, 3%) Aerodynamics (72, 17%)
Wind energy (12, 3%) Electric utilities (63, 15%)
Wind turbine wakes (12, 3%) Turbomachine blades (58, 14%)
Control strategies (11, 3%) Wind effects (49, 12%)
Offshore wind farms (11, 3%) Rotors (48, 12%)
Power systems (11, 3%) Wakes (45, 11%)

Table 2 - Most common controlled and uncontrolled terms on search. Source: Engineering Village

Discussion

Wind energy is by no means a new area of exploration, yet in the past 4 to 5 years this area has seen a considerable growth in research output especially in wind turbines technology and wind harvesting. Although the data sample analyzed is small and covers one subject field only, our findings illustrate that author keywords may indeed include new terms that are not to be found in a thesaurus function. The use of thesauri terms is usually recommended as a part of precision strategy in searching. Yet, in our case controlled terms have a more general scope. Table 3 below summarizes some of our major conclusions as they pertain to the properties of using author-given keywords and controlled terms in the search process. Our findings show that the use of author given keywords as a search strategy will be beneficial when one searches for more specific technologies and applications or new research areas within the overall topic (see Table 3).

Controlled Uncontrolled Notes
Recall Using controlled terms retrieves a larger number of articles since they are lumped under broader descriptors.
Precision Uncontrolled terms are very specific and enable retrieval of detailed topics.
Discoverability Uncontrolled terms enable the discovery of the new topics and can serve as indicators of the latest discoveries made in this field. Controlled terms enable the clustering of such topics thus enabling connections between larger numbers of articles and topics.
Serendipity Controlled terms are broader thus retrieving a larger amount of article and enabling serendipity through browsing.
State of the Art Uncontrolled terms depict the latest descriptors of methods, applications and processes in a certain topic.

Table 3 - Evaluation of the impact of controlled and uncontrolled terms on search.

Our analysis showed, for example, that strongly emerging areas identified in our sample are wind farms and offshore wind farms. These terms, although appearing in the author given keywords for over 20 years have not entered the Compendex thesaurus. This could be due to the fact that the Compendex database is engineering-focused and built to serve engineers therefore grouping these articles under terms that are mechanical in nature. However, this might hinder a broader understanding of the topics in context.

In this case using the thesaurus as basis for searching Wind Energy articles would create broader results sets. Depending on what the purpose of the search is, this could be viewed as a positive or negative outcome. Our analysis shows that the two types of terms have different properties and serve different purposes in the search process. In the analysis of emerging topics author-given keywords are useful tools, as they enable one to specify a topic in a way that seems difficult to carry out when one uses only terms from a controlled thesaurus.

References

1. Sihvonen, A., Vakkari, P. (2004)”Subject knowledge, thesaurus-assisted query expansion and search success”, Proceedings of RIAO2004 Conference, pp. 393-404.
2. Sihvonen, A., & Vakkari, P. (2004) “Subject knowledge improves interactive query expansion assisted by a thesaurus”, Journal of Documentation, 60(6), 673-690.
3. Shiri, A.A.,Revie, C.,Chowdhury, G. (2002) “Thesaurus-enhanced search interfaces”, Journal of Information Science, Volume 28, Issue 2, 2002, Pages 111-122.
4. Neustadter, H. E., & Spera, D. A. (1985) “Method for Evaluating Wind Turbine Wake Effects on Wind Farm Performance”, Journal of Solar Energy Engineering, Transactions of the ASME, 107(3), 240-243.
5. Olsen, F., & Dyre, K. (1993) “Vindeby off-shore wind farm - construction and operation“, Wind Engineering, 17(3), 120-128.
6. Rayment, R. (1976) “Wind Energy in the UK”, Building Services Engineer, (44), 63-69.
VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

Bibliometrics and Urban Research, part II: Mapping author affiliations

The previous issue of Research Trends presented a preliminary keyword analysis of urban research, in which three branches of the overall discipline are defined and contrasted. The analysis shows that not only do researchers in these three areas discuss different elements of urban studies, they also tend to be based in different countries. Together these […]

Read more >


The previous issue of Research Trends presented a preliminary keyword analysis of urban research, in which three branches of the overall discipline are defined and contrasted. The analysis shows that not only do researchers in these three areas discuss different elements of urban studies, they also tend to be based in different countries. Together these suggest a “limited integration of research efforts undertaken by those who work explicitly in urban studies, social scientists who work in cities, and scientists who are concerned with the environmental impacts of urban development.” (1,2)

As well as looking at the countries that authors come from, it is also possible to look at author distributions in finer detail: rather than assigning all authors with a UK affiliation to the nation as a whole, we can view the specific locations of each affiliation on a map (and only group together those that are actually in the same place). The methods used to map author affiliations from the Scopus database are set out by Bornmann et al., (3) and here we follow their process to show author distributions in the three branches of urban research: Sciences, Social Sciences and Urban Studies.

The affiliation plot

There are certain differences when you work using a full author affiliation, rather than country data alone. First, papers can be assigned to multiple locations within a country: for example, a paper co-authored by researchers from institutes in Lille and Paris is shown at both locations, rather than as a single paper for France. Second, distributions within a country can be seen: for example, the capital city might be host to all of the active researchers in a country, or they could be spread across the country. Third, you can make direct comparisons between cities or institutes to see which published the most.

The first grouping of urban research consists of relevant papers within a set of 38 journals assigned to the Thomson-Reuters urban studies cluster. We have seen that papers come mainly from the US, the UK, Australia, Canada and Netherlands; but there is a long list beyond the top 5, and it quickly becomes difficult to retain a sense of all the countries. Plotting the locations on a map immediately shows you the distribution of authors and the quantities from different regions of the world (see Figure 1a).


Figure 1a - Distribution of urban studies authors in 2010. Following the method described by Bornmann et al. (3), circles are sized and colored according to the number of papers originating from each location. Data source: Scopus

Large countries such as the US, Australia and China benefit particularly from such a map, as institutes across the country can be located and compared. In China’s case, there are multiple papers from Beijing, Shanghai, Wuhan, Nanjing, Guangzhou, as well as Hong Kong.

The map also allows you to see the overall distribution at a single glance, including both the strong contributions in Europe and the US and the single papers from Argentina, Ghana, Nigeria, Ethiopia, Saudi Arabia, Pakistan, and Indonesia, among others.

We can also examine the same search over a number of years to see whether the distribution of authors changes over time. Figure 1b shows the publication years 2006 to 2010: while the smaller contributors appear and disappear each year, the larger locations remain fairly steady, and the concentration of authors in the US and Europe appears no weaker in 2010 than previous years.


Figure 1b - Distribution of urban studies authors in the years 2006 to 2010. Following the method described by Bornmann et al. (3), circles are sized and colored according to the number of papers originating from each location. Data source: Scopus

In the map of 2010 author affiliations 389 locations are marked, accounting for the 643 articles and reviews published. Each location therefore accounts for 1.65 papers on average; this represents a slight increase from previous years, when locations have on average accounted for 1.46 to 1.60 papers (see Table 1).

Publication year Locations Papers Papers per location
2006 344 529 1.538
2007 347 553 1.594
2008 335 490 1.463
2009 371 553 1.491
2010 389 643 1.653

Table 1 - The number of locations (in author affiliations) for each year, and the number of papers published in each year in the urban studies grouping. Source: Scopus

From one discipline to another

The other two branches of urban research are those published in Social Science and in Science journals, respectively. These can be compared using the same approach as that used above, but instead here we alter the approach to look at only the authors of the top-cited papers in each discipline. As we are including both articles and reviews in the analysis, but these types of papers have different expected numbers of citations, we rank the articles and reviews separately, and take the top 10% of each according to citations.  This allows us to map the distribution of the authors of the highest-impact articles and reviews together. Figure 2 shows the resulting distributions in the Social Sciences and Science clusters, plotted in different colors. Differences are apparent through a comparison of red (Social Science) and cyan (Science) authors. Some regions, such as South Africa and Australia, have more prominence in the Social Sciences; others, such as continental Europe, show a greater presence in the Sciences.

Figure 2 - Distribution of highly-cited Social Science (red) and Science (cyan) urban research authors in 2010. Where authors in the different disciplines are from the same location, this is shown by a darker red or darker cyan than where there is no overlap. Data source: Scopus

The maps of author affiliations show a finer level of detail than any aggregated country data can provide; and they allow for much more immediate interpretation of the affiliation data. We looked at the distributions of authors — whether including all authors, or only highly-cited authors — in the three identified branches of urban research.

There are two elements that may improve this approach further. The first is to include impact data more directly in the mapping process. The second would be to look at collaboration; here papers are duplicated for each affiliation, and there is no sense of the partnerships that go into that creation; a comparison of the collaborative trends in the various urban research clusters would add even deeper insight into their natures.

References

1. Kirby, A., & Kamalski, J. (2012) “Bibliometrics and Urban Research”, Research Trends, No. 28.
2. Kamalski, J., & Kirby, A. (2012, in press) “Bibliometrics and urban knowledge transfer”, Cities. http://dx.doi.org/10.1016/j.cities.2012.06.012.
3. Bornmann, L. et al. (2011) “Mapping excellence in the geography of science: An approach based on Scopus data”, Journal of Informetrics, Vol. 5, No. 4, pp. 537–546.
VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)

The previous issue of Research Trends presented a preliminary keyword analysis of urban research, in which three branches of the overall discipline are defined and contrasted. The analysis shows that not only do researchers in these three areas discuss different elements of urban studies, they also tend to be based in different countries. Together these suggest a “limited integration of research efforts undertaken by those who work explicitly in urban studies, social scientists who work in cities, and scientists who are concerned with the environmental impacts of urban development.” (1,2)

As well as looking at the countries that authors come from, it is also possible to look at author distributions in finer detail: rather than assigning all authors with a UK affiliation to the nation as a whole, we can view the specific locations of each affiliation on a map (and only group together those that are actually in the same place). The methods used to map author affiliations from the Scopus database are set out by Bornmann et al., (3) and here we follow their process to show author distributions in the three branches of urban research: Sciences, Social Sciences and Urban Studies.

The affiliation plot

There are certain differences when you work using a full author affiliation, rather than country data alone. First, papers can be assigned to multiple locations within a country: for example, a paper co-authored by researchers from institutes in Lille and Paris is shown at both locations, rather than as a single paper for France. Second, distributions within a country can be seen: for example, the capital city might be host to all of the active researchers in a country, or they could be spread across the country. Third, you can make direct comparisons between cities or institutes to see which published the most.

The first grouping of urban research consists of relevant papers within a set of 38 journals assigned to the Thomson-Reuters urban studies cluster. We have seen that papers come mainly from the US, the UK, Australia, Canada and Netherlands; but there is a long list beyond the top 5, and it quickly becomes difficult to retain a sense of all the countries. Plotting the locations on a map immediately shows you the distribution of authors and the quantities from different regions of the world (see Figure 1a).


Figure 1a - Distribution of urban studies authors in 2010. Following the method described by Bornmann et al. (3), circles are sized and colored according to the number of papers originating from each location. Data source: Scopus

Large countries such as the US, Australia and China benefit particularly from such a map, as institutes across the country can be located and compared. In China’s case, there are multiple papers from Beijing, Shanghai, Wuhan, Nanjing, Guangzhou, as well as Hong Kong.

The map also allows you to see the overall distribution at a single glance, including both the strong contributions in Europe and the US and the single papers from Argentina, Ghana, Nigeria, Ethiopia, Saudi Arabia, Pakistan, and Indonesia, among others.

We can also examine the same search over a number of years to see whether the distribution of authors changes over time. Figure 1b shows the publication years 2006 to 2010: while the smaller contributors appear and disappear each year, the larger locations remain fairly steady, and the concentration of authors in the US and Europe appears no weaker in 2010 than previous years.


Figure 1b - Distribution of urban studies authors in the years 2006 to 2010. Following the method described by Bornmann et al. (3), circles are sized and colored according to the number of papers originating from each location. Data source: Scopus

In the map of 2010 author affiliations 389 locations are marked, accounting for the 643 articles and reviews published. Each location therefore accounts for 1.65 papers on average; this represents a slight increase from previous years, when locations have on average accounted for 1.46 to 1.60 papers (see Table 1).

Publication year Locations Papers Papers per location
2006 344 529 1.538
2007 347 553 1.594
2008 335 490 1.463
2009 371 553 1.491
2010 389 643 1.653

Table 1 - The number of locations (in author affiliations) for each year, and the number of papers published in each year in the urban studies grouping. Source: Scopus

From one discipline to another

The other two branches of urban research are those published in Social Science and in Science journals, respectively. These can be compared using the same approach as that used above, but instead here we alter the approach to look at only the authors of the top-cited papers in each discipline. As we are including both articles and reviews in the analysis, but these types of papers have different expected numbers of citations, we rank the articles and reviews separately, and take the top 10% of each according to citations.  This allows us to map the distribution of the authors of the highest-impact articles and reviews together. Figure 2 shows the resulting distributions in the Social Sciences and Science clusters, plotted in different colors. Differences are apparent through a comparison of red (Social Science) and cyan (Science) authors. Some regions, such as South Africa and Australia, have more prominence in the Social Sciences; others, such as continental Europe, show a greater presence in the Sciences.

Figure 2 - Distribution of highly-cited Social Science (red) and Science (cyan) urban research authors in 2010. Where authors in the different disciplines are from the same location, this is shown by a darker red or darker cyan than where there is no overlap. Data source: Scopus

The maps of author affiliations show a finer level of detail than any aggregated country data can provide; and they allow for much more immediate interpretation of the affiliation data. We looked at the distributions of authors — whether including all authors, or only highly-cited authors — in the three identified branches of urban research.

There are two elements that may improve this approach further. The first is to include impact data more directly in the mapping process. The second would be to look at collaboration; here papers are duplicated for each affiliation, and there is no sense of the partnerships that go into that creation; a comparison of the collaborative trends in the various urban research clusters would add even deeper insight into their natures.

References

1. Kirby, A., & Kamalski, J. (2012) “Bibliometrics and Urban Research”, Research Trends, No. 28.
2. Kamalski, J., & Kirby, A. (2012, in press) “Bibliometrics and urban knowledge transfer”, Cities. http://dx.doi.org/10.1016/j.cities.2012.06.012.
3. Bornmann, L. et al. (2011) “Mapping excellence in the geography of science: An approach based on Scopus data”, Journal of Informetrics, Vol. 5, No. 4, pp. 537–546.
VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)