About NORA

Table of contents

Introduction, motivation and aims

Contributors and funders

Data and technology

Overview of architecture

Data sources and acquisition

Data mapping

Luigi and MongoDB data store

VIVO backend

Neo4j backend

VIVO frontend

Tableau frontend

VOSviewer frontend

Search Module

General Search

Publications Search

Datasets Search

Grants Search

Patents Search

Clinical Trials Search

University Profiles

Sustainable Development Goals

International Collaboration

National Collaboration

Network Analytics

Map of Science

Collaboration Map

  1. Introduction, motivation and aims

The National Open Research Analytics (NORA) is a prototype built with Dimensions data and open software tools for Denmark for the years 2014-2019. NORA provides curated research analytics for stakeholders such as universities, hospitals, and governmental institutions. NORA is one of the outcomes of the OPERA project (Open Research Analytics) https://deffopera.dk. The NORA prototype was launched in November 2020.

         

Establishing open and advanced research analytics based on open data, open citations, using open-source software on a national level, is not yet the norm, making it difficult to understand the many aspects, patterns, impacts and potentials of the Danish research landscape. The NORA pilot is an attempt to develop an analytical overview of the research landscape of the Danish universities and hospitals. NORA will focus on comprehensive coverage of metadata, citations and other impact indicators - on precise search functionalities - and on analytical visualizations giving overviews as well as access to explore the underlying data in full detail.

NORA is based on data on publications, datasets, patents, grants, and clinical trials from the Dimensions database, as well as data from the two Danish national indicators, the Open Access Indicator and the Bibliometric Research Indicator.

  1. Contributors and funders

NORA is a joint effort between several people and organizations:

Financial support - as part of the OPERA project 

2018-2019: DEFF, Denmark’s Electronic Research Library.

2019-2020: The Danish Agency for Science and Higher Education.  

 

Data, tools and API services

Dimensions: Brian Kirkegaard Lunn and Bo Alroe, Digital Science

Danish indicators: Franck Falcoz, Vox Novitas

MongoDB: https://www.mongodb.com/ 

Neo4J: https://neo4j.com/ 

VIVO: https://duraspace.org/vivo/ 

Tableau: https://www.tableau.com/ 

VOSviewer: https://www.vosviewer.com/ 

Software deployment, integration and enhancement 

VIVO: Brian Lowe, Ontocale SRL

MongoDB, Neo4j, Tableau and VOSviewer: Pedro Parraguez and Nelson Guaman, Dataverz ApS

IT service and hardware

Michael Rasmussen og Martin Holmquist Schimmel, DTU Research IT service

Bibliometric development, visualization concepts and support

DTU:         Karen Hytteballe Ibanez, Mogens Sandfær, Christina Steensboe, Nikoline Dohm Lauridsen

AAU:         Anne Lyhne Høj, Poul Melchiorsen, Birger Larsen

AU:         Kirsten Krogh Kruuse, Lars Lund Thomsen

CBS:         Lene Hald, Claus Rosenkrantz, Dicte Westergaard Madsen

KU:         Marianne Gauffriau, Adrian Price

RUC:         Kasper Bruun

SDU:         Karen Lindvig, Mette Detlevsen

  1. Data and technology

  1. Overview of architecture

  1. Data Ingestion

A more detailed explanation of the steps between data inputs and the ingestion at the MongoDB backend is provided in the figure below

  1. Code Repository

The core code produced by OPERA-NORA has been documented and made publicly available at: https://github.com/RAP-research-output-impact 

  1. Data sources and acquisition

Dimensions 

NORA uses data from the commercial database Dimensions as its primary source. Though being a commercial database, Dimensions offer a free version with access to publications and dataset but with limited access to filters, Patents, Grants, Clinical Trials.

The NORA pilot contains all data from the Dimensions database in the year range 2014-2019 from Danish Universities and Hospitals, generously provided by Digital Science. All records in NORA:

The choice of Dimensions data aligns with the aim of the Open Research Analytics (OPERA) project; through tests and explorations, to establish open and advanced research analytics practices and systems at Danish universities. Dimensions is a newer alternative to more established citation databases, offers more content types which allow for a more holistic view of the Danish Research Landscape and provides free access to a large part of its content.

Limitations

The Dimensions database has some limitations which are reflected in the NORA dataset. These are addressed below:

Danish funders (especially the public funders) are not yet well represented in Dimensions. This is a focus area for Digital Science, but at this point, funding data on Danish funded research output is underrepresented. This is the reason funder data is not yet included in the various NORA dashboards.

In Dimensions, organisational affiliation is based on an established connection between publisher metadata and the GRID organisational registry. If this connection is not established, the publication can not be queried in Dimensions based on “Organisation”, even if the publication is in fact indexed in Dimensions.

In order to make sure all Danish publications from Universities 2014-2019 are indexed in NORA, we have made the decision to identify - and index the “non-GRID” publications. This is done by querying DOI’s 2014-2019 from Danish Universities which are available from the Danish Research Database (DDF). All Danish Universities register local research output, which is then harvested by DDF. It has not been possible to add additional DOI’s from Danish Hospitals which are not also covered by the Universities.

In the NORA publications record, we have added a filter displaying how the publication is retrieved: “Retrieval”:

Establishing the correct affiliation between publication and organisation is a focus area for Digital Science with constant improvements.

 

Coverage

As with other commercial databases, Dimensions does not provide 100% coverage of research output from the Danish Universities. Especially Universities with strong Humanities and Social Science focus are relatively less covered in terms of publications than Science/Tech/medicine oriented universities. This data gap is a point that we stress in the various dashboards, which offer visual analytical insights on an overall level.

The OPERA project group has conducted a number of tests focusing on coverage, quality and comparisons of Dimensions data. The results have been shared with Digital Science for the purpose of improving the coverage and quality of the DImensions data displayed in NORA.

Update frequency 

The data in NORA has last been updated in late September 2020 with a plan to update in December 2020.

Danish Research Indicators -

Publication records are enriched with indicator data from two Danish research Indicators;

Open Access Indicator: The Danish Open Access Indicator monitors the share of Open Access on national- and university level. In NORA, publication records are enriched with below OAI specific metadata:

Bibliometric Indicator: The Danish Bibliometric Indicator (BFI) is an element of the performance-based model for distribution of the new block grant based on the production of research-based publications for all Danish universities. In NORA, publication records are enriched with below BFI specific metadata:

  1. Data mapping

In order to display data on an overall level and enabling large scale analyses, a number of category mappings are implemented in NORA. The mappings are documented in detail on the Github page with a condensed overview below.

Dimensions Subject Field of Research (FOR) 2-digt -> Main Subject Areas

22 FOR subject Field of Research categories used in Dimensions for article level subject classifications have been mapped to 5 subject areas in NORA. FOR is part of The Australian and New Zealand Standard Research Classification (ANZSRC) classification system.

Dimensions DK Universities -> University Acronyms

8 Danish universities identified in Dimensions based on GRID have been mapped to 8 University acronyms

Dimensions DK Health Organisations -> Main Hospital regions

53 hospitals identified in Dimensions based on GRID (Search; Type = Healthcare AND Country = Denmark) have been mapped to 5 main Hospital regions in NORA.

Dimensions Organisation types -> Organisation types

8 Organisation types identified in Dimensions based on GRID (Search: Type) has been mapped to 4 types in NORA

Dimensions Countries -> Continents

177 countries identified in Dimensions based on GRID have been mapped to 6 continents in NORA

  1. Luigi and MongoDB data store

We chose MongoDB as a "single source of truth" to store structured and unstructured data in the form of a service-agnostic NoSQL document database that collects data from multiple external sources (e.g. Dimension API and CSVs with additional data collected internally).

Through internal APIs MongoDB serves the data required by both Neo4j and the VIVO triple store, making sure we store the “raw” data from each of our sources and then we serve it on demand to the applications downstream in a consistent and reliable way.

  1. VIVO backend

The VIVO ontology provides a model for representing and interlinking the researchers and scholars in the Danish academic landscape to their collaborators and scholarly outputs, and also serves as a ready point of extension for representing additional details of the relationships implied by the data in the raw document store.  

NORA converts the raw document data to RDF triples structured according to the VIVO ontology. The VIVO software builds a Solr search index from these triples:  NORA adds a number of VIVO custom search index modifiers to support fine-grained filtering of search results by multiple facets.

  1. Neo4j backend

We chose Neo4j free community edition as our labelled property graph database because of its navigational nature that allows for fast analytics and flexible exploratory queries. This provides us with native graph storage that is better suited for deep and variable-length traversals and path queries.

When combined with VIVO´s triple store, we get the best of both worlds; the strong index-based nature of VIVO´s triple store which is ideal for traditional search operations and the analytical flexibility of Neo4j´s index-free adjacency graph database, which is particularly suited to serve as the backend of visualisation services such as VOSiewer.

  1. VIVO frontend

NORA extends VIVO’s front end interface to offer highly customized views of publication, dataset, grant, patent and clinical trial details as well as university profiles.  Links to related entities generally offer the user a set of search results involving that entity, inviting further exploration by search filters rather than traversing the graph link by link.  VIVO’s search interface is significantly extended not only to offer a large number of additional facets for result filtering but also to permit combining them in different AND and OR groupings.

  1. Tableau frontend

We use Tableau Public in order to publish rich dashboards that summarise the more than 200K records in our database and provide an easily accessible and visual exploration surface for our users. Tableau is directly connected to our graph database. In that way, we can pull information from the graph itself, making it possible to, for example, explore the vast amount of international research collaborations.

  1. VOSviewer frontend

The OPERA project is one of the first users of the newest version of VOSiewer, which now exists as a rich web application. One of the biggest advantages of this online version of VOSviewer, apart from its ability to be embedded as a native web component, is the possibility of feeding data from a server on the fly. In our case, we take advantage of this feature connecting VOSviewer to our graph database through a custom middleware developed internally that feeds VOSviewer in near real-time.

  1. Search Module

  1. General Search

The search module is the search interface for the Dimensions data 2014-2019 for Danish universities and hospitals. The module allows searching across content types based on predefined filters and/or free text search.

NORA includes five content types from Dimensions:

General filters are similar across all content type records. This allows for cross content type filtering.

  1. Publications Search

Publication records are identified in Dimensions based on two different approaches:

The retrieval status is included in the publication record and as a filter.

Content type specific filters

  1. Datasets Search

Dataset records are identified in Dimensions based on two different approaches:

Content type specific filters

  1. Grants Search

Grants records are identified in Dimensions based on:

Content type specific filters:

Choices

Grants do not have a “Publication Year”, hence the identification of Grant records is based on the metadata field “Active Year”. This search parameter identifies all Grants that are active within the time scope 2014-2019, regardless of start year and the filter “Start Year” therefore show values outside the 2014-2019 scope.  

  1. Patents Search

Patent records are identified in Dimensions based on:

Content type specific filters

  1. Clinical Trials Search

Clinical Trial records are identified in Dimensions based on:

Content type specific filters

Choices

Clinical trials do not have a “Publication Year”, hence the identification of Clinical Trial records is based on the metadata field “Active Year”. This search parameter  identifies all Clinical Trials that are active within the time scope 2014-2019, regardless of start year and the filter “Start Year” therefore show values outside the 2014-2019 scope.  

  1. University Profiles

We have created this view to provide a quick overview of some of the key variables that describe the production of scientific publications for each university. Each of the panels on this view acts as a cross-filter. This allows the user to click on any variable to isolate results relevant for that particular filter. For example, clicking on a country provides a quick overview of the evolutions of the co-authoring collaborations with that country, the collaborating organisations in that country, the top publishers, top journals, topics and sustainable development goals.

Data

Dimensions data for scientific publications published between 2014 and 2019 by the selected Danish university.

The original query to extract data from Dimensions includes documents authored by that organisation using its GRID id. In addition, we have a second complementary query using DOIs extracted from the Current Research Information Systems (CRIS) for the selected organization for the 2014-2019 period that seeks to add Danish DOIs and authoring affiliations missed from the Dimensions´ database.

Data gaps

Given restrictions in the data source, we are unable to include scientific publications co-authored with organisations that do not have a GRID id. We are also restricted to records that exist within the Dimensions dataset.

Choices

To maintain a sense of scale we do not show in “co-authoring collaborations” collaborations within the same selected university.

  1. Sustainable Development Goals

The focus of this dashboard is on the exploration of SDG-related scientific publications produced by one or more of the 8 Danish universities. The exploration is divided into four views; 1) “Denmark” which provides an overview of the total SDG-related production for each of the 17 SDGs and the 8 Danish universities. This is an ideal starting point for an examination of the relative focus of the universities and the distribution of outputs across SDGs. 2) “Over time” which as the name implies, seeks to provide a sense of the overall trends and evolution either for the 8 Danish universities together or one by one. 3) “Per University” which puts the emphasis on a detailed examination of the SDGs one university at a time. 4) “Per goal” which puts the emphasis on a detailed examination of the production one SDG at a time.

Data

Dimensions data for scientific publications tagged under one or more SDGs and published between 2014 and 2019 by one or more of the 8 Danish universities.

The original query to extract data from Dimensions includes documents authored by at least one organisation with a Danish GRID id in the categories “education” or “healthcare”. In addition, we have a second complementary query using DOIs extracted from the Current Research Information Systems (CRIS) of the Danish research institutions for the 2014-2019 period that seeks to add Danish DOIs and authoring affiliations missed from the Dimensions´ database.

The overarching approach used by Dimensions was “to generate the 17 individual SDG publication training sets relied on producing and curating extensive and specific search queries, thus avoiding the need to have a manual build-up of a sufficiently large enough publication training set.

Keyword search strings for each of the goals were defined in order to produce training sets based on publications from the Dimensions platform. Key phrases and terminology were based on UN definitions of SDGs, including the target and indicator definitions, and narratives.”

Data gaps

The main limitation is related to the accuracy of the classification process. Additional information about the classification process as well as its strengths and weaknesses is available at: https://digitalscience.figshare.com/articles/Contextualizing_Sustainable_Development_Research/12200081

Choices

We only include scientific publications classified by Dimensions under one or more SDGs. To make the interpretation and explanation of the document counts and % easier, we have not applied fractional counting. This means for example that a publication tagged under two SDGs will be counted twice, once for each SDG under which it is tagged.

  1. International Collaboration

The focus of this set of views is on co-authoring collaborations between the 8 Danish universities and the rest of the world. The dashboard consists of two main views; the “continent” view that shows the continents and countries with which the danish universities co-author scientific publications and the “country” view that goes one level deeper, showing the countries and the organisations within those countries.

Data 

Dimensions data for scientific publications published between 2014 and 2019 by one or more of the 8 Danish universities and one or more foreign organisations.

The original query to extract data from Dimensions includes documents authored by at least one organisation with a Danish GRID id in the categories “education” or “healthcare”. In addition, we have a second complementary query using DOIs extracted from the Current Research Information Systems (CRIS) of the Danish research institutions for the 2014-2019 period that seeks to add Danish DOIs and authoring affiliations missed from the Dimensions´ database.

Data gaps

Given restrictions in the data source, we are unable to include scientific publications co-authored with organisations that do not have a GRID id. We are also restricted to records that exist within the Dimensions dataset.

Choices

For Denmark, we only include the following 8 Danish universities: DTU, CBS, AAU, AU, KU, RUC, ITU, SDU.

When we display the total number of publications for Europe, we are not counting publications authored only by Danish organisations.

In the country view, we are not showing Denmark to stay focused on international collaborations. We reserve that information for the view on national collaborations.

  1. National Collaboration

The focus of this set of views is on co-authoring collaborations within the 8 Danish universities and between those organisations and other Danish organisations. The dashboard consists of two main views; 1) the “danish collaborations” view that shows the map of Denmark and the organisations with which the 8 Danish universities collaborate and 2) the “matrix” view that focuses on the collaborations between the 8 Danish universities using an adjacency matrix.

Data

Dimensions data for scientific publications published between 2014 and 2019 by one or more of the 8 Danish universities with or without one or more danish organisations.

The original query to extract data from Dimensions includes documents authored by at least one organisation with a Danish GRID id in the categories “education” or “healthcare”. In addition, we have a second complementary query using DOIs extracted from the Current Research Information Systems (CRIS) of the Danish research institutions for the 2014-2019 period that seeks to add Danish DOIs and authoring affiliations missed from the Dimensions´ database.

Data gaps

Given restrictions in the data source, we are unable to include scientific publications co-authored with organisations that do not have a GRID id. We are also restricted to records that exist within the Dimensions dataset.

Choices

We are only including scientific publications that have as a co-author at least one of the following 8 Danish universities: DTU, CBS, AAU, AU, KU, RUC, ITU, SDU.

In the matrix view, we are not showing the diagonal that displays the publications between a given organisation and itself. This is to present a view focused on inter-organisational collaborations.  

  1. Network Analytics

Leveraging the newest version of VOSviewer we have created a set of network views showing “maps of science” per university, for the whole Denmark, and network maps showing inter-organisational collaborations.

The “maps” show nodes and edges in a network graph. Depending on the view, the nodes represent either organisations or topics. Likewise, the edges represent either co-authoring collaborations between organisations or the co-occurrence of documents between topics (number of times a pair of topics co-occurred on publications). The spatial closeness of the nodes is a proxy of relatedness. However, it is important to note that the network layout is unable to capture in two dimensions the actual relatedness between nodes. For a more detailed walkthrough of the network views and VOSviewer, please visit: https://www.vosviewer.com/getting-started

Data

Dimensions data for scientific publications published between 2014 and 2019 by one or more of the 8 Danish universities.

The original query to extract data from Dimensions includes documents authored by at least one organisation with a Danish GRID id in the categories “education” or “healthcare”. In addition, we have a second complementary query using DOIs extracted from the Current Research Information Systems (CRIS) of the Danish research institutions for the 2014-2019 period that seeks to add Danish DOIs and authoring affiliations missed from the Dimensions´ database.

Data Gaps

The most important limitation refers to affiliation information that is not fully captured by Dimensions and records that have not been indexed by Dimensions.

Choices   

To simplify the interpretation of the network graphs, we are not using fractional counting when building the network graphs. Also, the views show only the top “X” (often the top 1000) “relations” between nodes. When looking at a network of inter-organisational collaborations, this means that only the top 1000 largest co-authoring collaborations (based on the count of records) are displayed. The reason for this is that otherwise, the network becomes too “visually busy” to be able to read the network graph.

  1. Map of Science

This visualisation shows a network representation of the connections between the main fields of research covered by Danish publications (showing by default the top 1000 connections between topics).

Each node in the network represents one field of research at either the top or second level of the classification system (a publication can be categorised under one or more "Fields of Research (FOR)").

Each connection between two nodes represents one or more instances of those fields of research co-occurring on one or more publications.

The size of the node represents the number of publications connected to that field of research. The "thickness" of the connection represents the number of publications in which those two fields of research co-occurred.

  1. Collaboration Map

This visualisation shows a network representation of the top co-authoring collaborations for Danish research organisation organisations (showing by default the top 1000 collaborations).

Each node in the network represents one organisation.

Each connection between two nodes represents one or more co-authoring instances between those organisations.

The size of the node represents the number publications with co-authors outside that organisation. The "thickness" of the connection represents the number of co-authorship collaborations between the two organisations.