Knowledge graphs in the fight against COVID-19

For researchers, policy makers and others trying to combat the spread and mitigate the impact of COVID-19, trying to make sense of the data surrounding the virus is a Herculean task. Vast in volume and ceaselessly produced, this data emanates from domains as different as virology and economics and is produced by a multitude of people and organizations. Unsurprisingly the standards to which this data conforms are as multitudinous as its sources.

It just so happens that making sense of messy data from disparate sources is one of the things at which knowledge graphs excel. Moreover, knowledge graphs make it possible to derive new knowledge from intelligently connecting information residing in those disparate data repositories.

Given that the ability to better analyze data and gain new insights is of obvious use to people trying to respond to the pandemic, those working with knowledge graph technologies have started to talk about how those technologies – and their skills – might be employed to help tackle the problem.

To that end François Scharffe convened a virtual Meetup on 1 April 2020, “Knowledge Graphs to Fight COVID-19“, which featured seven speakers presenting information about the COVID-19 initiatives on which they’ve been working.

In this post I summarize the material presented during that Meetup, along with a brief summary of other pandemic-facing efforts from those in the linked data space. But first two quick things.

First, by popular demand François has scheduled a follow-up Meetup on 8 April, “Knowledge Graphs to Fight COVID-19 – Part II“. Check it out if you’re interested and available 12:00-2:00 EDT.

Second, many if not most of the initiatives discussed below are actively seeking volunteer help. If one of these projects strikes your fancy and you think you might be able to help please do contact the organizer.

Virtual Meetup: “Knowledge Graphs to Fight COVID-19” (1 April 2020)

Opening remarks by François Scharffe

François Scharffe kicks off the virtual meetup,

François Scharffe, (New York) Knowledge Graphs Meetup and The Knowledge Graph Conference
Recording (at 6:09)

François kicked off the Meetup by noting there are “a lot of people building knowledge graphs around the topic of COVID-19. There are different styles of effort: some people represent news, some present represent genetic information about the virus, some people get data.”

“The goal of this meeting,” he said, “will be to enable people to federate their efforts. Ideally we’d like a knowledge graph that is connected – all these knowledge graphs should be connected, they should be linked together.”

François noted that when they were contacting potential speakers a biologist observed that knowledge graphs are “not really what people need now.” On this point François acknowledged that now is not the time to introduce knowledge graph technologies to previously-uninitiated domain experts or to evangelize their use. Rather we should be using graph technologies right now to help make sense of already-available data.

“If access to knowledge in a well-organized way has any chance to help in that direction then that’s what we want to do. We just want to provide all the data that’s around in a nicely integrated way because that’s the power of knowledge graphs.”

The Fight Against COVID-19 by Unleashing the FORCE of the LOD Cloud Knowledge Graph

Kingsley Idenhen, OpenLink Software
Recording (at 13:00) | Presentation (Google Slides)

The COVID-19 problem
- Understanding of the virus is sub-optimal due to shortcomings related to measurements (data), metrics (information) and insights (knowledge)
- This suboptimal understanding compounded by the existence of disparate data, dashboards and data visualizations
- “The problem right is that we’re in a crises, time is of the essence, perfection is a journey to a destination that we hopefully never reach, but we have to have a solution right now that allows us to better understand this virus. And to do that what we have to be able to do is harmonize data, information and knowledge without the traditional impediments that come from data wrangling.”
  - Open standards play a vital role in this
The FORCE behind the solution
- Solution is to be found in the global generation of data, information and knowledge networks at rates that match or exceed the COVID-19 infection rate
- This entails leveraging what’s right before us – namely the internet, web documents, data networks and knowledge networks.
The LOD cloud knowledge graph
- In this public data network we use hyperlinks to create simple sentences that allow us to describe anything.
- “Because the LOD cloud itself is already extremely rich in data, information and knowledge that covers what I see as the main area of challenge. We need to understand the virus, right at the genetic level. We need to understand its spread, its rate of infection. We need to know rates of hospitalization, we need to know the death rate. But we don’t need to know this at the global level, we need to decompose this all the way down to specific countries, provinces, municipalities until you actually get to local groups.”
Harnessing the FORCE
- We want to get to a point where we progressively update the LOD cloud knowledge graph by using the power of hyperlinks as super keys to identify things
- By doing this you harmonize data across disparate data sources.

Creating a Knowledge Graph of COVID-19 Literature to Facilitate Meta Analysis

Gilles Vandewiele, IDLab (Internet Technology & Data Science Lab), Ghent University
Recording (at 29:45) | Presentation (Google Slides) | COVID-19 Literature Knowledge Graph

Information on the initiative from Twitter

He and others at the IDLab “created a Knowledge Graph from the recently published Kaggle dataset about COVID-19 literature. It contains different information for each paper such as author information, content information and citation information.”
“The CSV and JSON data from Kaggle has been mapped to RDF data using RML (Matey). The knowledge graph is currently hosted on Kaggle and open for contributions. We hope this allows for easier meta-analysis of current work around COVID-19.”
“The knowledge graph has different applications. One interesting application is creating embeddings with RDF2Vec for each of the entities in our KG, such as papers, authors or journals.”

Resources

Meetup presentation summary

“Today I’ll be discussing our ongoing effort on creating a knowledge graph of the available COVID-19 literature. By constructing this knowledge graph we hope to facilitate meta-analysis of other researchers.”
The CORD-19 dataset, the original dataset on which they built their knowledge graph
- Over 45,000 scholarly articles about various types of coronavirus
  - 33,000 full texts
- From different resource domains, and various sources (bioRxiv, medRxiv, PubMed)
- Data provided as a CSV with metadata
- Identifier links to a JSON with some of the metadata in the CSV, text from the abstract and from the body, a bibliography and information about figures and tables
How they went about constructing the knowledge graph
- Goal was to take these structured data formats and convert them to RDF for the knowledge graph
- The knowledge graph should contain at least the information that is already present in the structured data, as well as enriching that data
  - Data is enriched by linking all the entities or concepts in the text with external resources
  - The linking process extends the JSON with extra information
  - Because citation information exists a citation network can be constructed, and this extra information can be added to the JSON
- With the enriched JSON in place, it is converted to RDF with the RDF Mapping Language (RML)
  - More specifically used a YARRRML dialect of RML which allows the construction of a YAML file that defines the mapping from JSON to RDF (using the online tool Matey)
Some potential applications
- An improved reading tool
- Advanced querying (e.g. with SPARQL)
  - SPARQL endpoint available
- General graph applications (e.g. use algorithm to extract most prominent reference works in the literature)
- Creating embeddings to find clusters
- Nearest neighbors
Some potential next steps
- Named Entity Recognition (NER) to link to other data sources
- Linking authors to ORCID based on papers and name
- Canonicalizing the citations
- Building applications based on embeddings
Ways others can help
- Software engineers can help with environment, architecture
- Data scientists can help with applying NLP techniques, linking data to external resources, buiding applications on top of the graph
- Semantic folks can help look for good ontologies, enrich the data by applying reasoning
- Front end folks are vital because nobody will use the tool if it’s not accessible
- Non-technical folks can help spread the word

COVID-19 Community Project

Peter W. Rose, Structural Bioinformatics Lab, San Diego Supercomputer Center, University of California San Diego
Recording (at 43:58) | Presentation (PDF) | COVID-19 Community Project GitHub

The goal of this community-based work is to link heterogeneous datasets about COVID-19, in three main areas
- The host (typically a human)
- Data about the virus itself
- The environment surrounding the virus
The Lab created a prototype COVID-19 graph in Jan. 2020, now obsolete
Did work looking at the community spread of the A2a strain of the coronavirus
The Community Project needs help
- Need expertise to help clean up messy, non-standardized data
- Need new datasets
- Need help with visualizations (in part supporting the efficient generation of up-to-date infographics)
- Need help with data analysis and forecast generation
This initiative is now a part of Graphs4Good (GraphHack) 2020 (organized by Neo4j)
In order to make it easy for the community to contribute, they have built an automated workflow
They are also “working with the schema.org people” in order to harvest data and annotate websites based on new schema.org vocabulary
Also a part of a National Science Foundation initiative, The NSF Convergence Accelerator, which aims to build an open knowledge network that connects different disciplines

COVID-19 Project by STKO Lab

Krzysztof Janowicz, STKO (Spatio-Temporal Knowledge Observatory) Lab, Geography Department of the University of California, Santa Barbara
Recording (at 52:59) | COVID-19 Project by STKO Lab
No presentation for this talk

Started in January 2020 as a side project
Because their domain is geographic information science, initial efforts were around disruption to air travel and tourism
Also became interested in disruption to supply chains
- While supply chains are normally fairly straightforward chains (that is, they don’t usually require graphs), they cannot handle “constantly changing regulation, different social quarantine and distancing measures, harvests being closed down” and other major and multiple disruptions to the supply chain caused by the coronavirus outbreak
- Supply chain domain experts expect that the full force of supply chain disruptions for medical supplies, food supplies and just everyday goods in the weeks and months to come
- This is a time frame where many in the semantic domain may contribute because this is an area where “the role of linked data really shines”
- They want to do a “continuous integration” that integrates data on the fly and brings it back to geographic regions
- Aggregating the data from a range of different sources is a challenging cross-data integration problem, but is going to become more and more important as efforts start to get underway to reopen things
In summary, data integration is of the essence, and that this needs a lot of work they have expertise in, and that they’ll be able to ingest data and make sure it’s properly aligned to the geographic regions
- An example of the sort of linked data reasoning that can be employed here is that if quarantine and social distancing measures are in place for a region, then a community that’s part of this region will be subject to those same restrictions, so you don’t need to materialize everything in the graph

Project Domino

Leo Meyerovich, Graphistry
Recording (at 1:00:27) | Presentation (PowerPoint) | Project Domino GitHub

Project Domino spun out of a related Graphistry project on election misinformation
Project Domino is aimed at scaling behavior change and anti-misinformation for COVID-19 disaster response by leveraging social networks
They are working alongside Disaster Tech on this initiative
To prevent death one the most important COVID-19 disaster efforts right now is to change the behavior of citizens and decision-makers
Project Domino has two directions
- Promote positive behaviors at a mass scale
  - E.g. provide information and information tools to leaders of sub-communities that are at risk
- “Defang” misinformation
  - Use lessons learned from security and fraud threat intelligence to map out and defuse misinformation before it spreads
This work isn’t easy due to issues of scale, timeliness, precision and sustainability
Current Twitter efforts in place now includes a data pipeline that can handle about 100 events a second, which seems to be what’s needed to cope with COVID-19 information
Also looking at integrating necessary information from social networks to correlate that information with things like clinical trial databases and fact checking databases
Timeliness important: if misinformation surfaces they want to be able to address it in ten minutes, not in days or weeks
Taking action on this includes internal actions like bringing in traditional intelligence analysts, and external, public-facing actions like the generation of timely alerts for social media platforms
Highlighted interventions
- Some around mass public health information, including addressing bad medical and behavioral information (e.g. misinformation about unsafe cures)
- Some around an increase in digital crime (e.g. increased number of phishing attempts) as bad actors attempt to take advantage of the situation

The timbr COVID-19 Knowledge Graph

Amit Weitzner, timbr.ai
Recording (at 1:13:38) | Presentation (PDF)

In mid-March joined an Israeli effort to build a COVID-19 knowledge graph and shortly thereafter received permission to make this available for open use
Objective of the graph is to address the lack of connections between research questions and the large of amount of data that’s being published on the virus
- Wanted to take all the available data sources, build a knowledge graph and make it freely accessible to analysts, data scientists and researchers so that they can query it in SQL, Apache Spark, Python and R
Steps taken to build the timbr COVID-19 Knowledge Graph
- Curated a list of all data sources and research around COVID-19
- Crawl all the relevant datasets
- Clean and transform each dataset
- Generate an ontology based on the dataset
- Map the data the to the ontology
- Curate the ontology by building more hierarchies and adding ontological rules
- Give access to researchers and data scientists
All of the crawlers written in Python, with Apache Airflow used to schedule daily updated datasets
Once in the data lake they can use Apache Spark to transform and clean more complex datasets, or use Trifacta to load the data to SQL Server
timbr connects to existing databases and applies an ontology layer on top of it
timbr, a virtual layer that operates in standard SQL, transforms existing databases into knowledge graphs
- Allows seamless integration with business intelligence tools
timbr does all of this by having implemented semantic web principles in SQL, and is fully compatible with RDF
Current COVID-19 concepts in their graph includes weather station data, online and TV news about the virus, global personal COVID-19 cases and patients
- In an example, Amit generated a visualization of COVID-19 cases by country, by age, ordering the chart by age to making it easy to see countries where the average age of patients was high (the highest being Saudi Arabia, 68.0) and low (the lowest being India, 21.0)

COVID❋GRAPH Project

Martin Preusse, Kaiser & Preusse
Recording (at 1:27:19) | COVID❋GRAPH Project

They kind of do what everyone else has done, Martin says, in creating a knowledge graph that collects dataset, in this case built in Neo4j
About mid-March Martin started to play with case data from John Hopkins University, putting it in a graph, and the graph has grown from there
There are now about 40 people involved in this initiative, “essentially the European Neo4j community”
This work is not merely to collect data, but to build applications in direct communication with end users
- They have relationships with the German Center for Diabetes Research, with clinical researchers working on COVID-19 right, and with pharmaceutical companies working on co19
- They have built applications on top of their knowledge graph to provide value to researchers working on COVID-19 and related topics
- Their first prototype application is available now, which allows users to search for genes and see articles and patents about them
They broke down their patent and paper datasets into fragments, and then connected it to the gene molecular word by searching
The gene symbols are widely-used, and these symbols are connected to gene transcript process IDs
Next thing they want to load is GTEx, a gene expression database “that can say, essentially, a certain gene is expressed in a certain tissue in humans”
The project GitHub can be found here

Knowledge graphs in the fight against COVID-19