Knowledge graphs in the fight against COVID-19

Knowledge graphs in the fight against COVID-19

For researchers, policy makers and others trying to combat the spread and mitigate the impact of COVID-19, trying to make sense of the data surrounding the virus is a Herculean task. Vast in volume and ceaselessly produced, this data emanates from domains as different as virology and economics and is produced by a multitude of people and organizations. Unsurprisingly the standards to which this data conforms are as multitudinous as its sources.

It just so happens that making sense of messy data from disparate sources is one of the things at which knowledge graphs excel. Moreover, knowledge graphs make it possible to derive new knowledge from intelligently connecting information residing in those disparate data repositories.

Given that the ability to better analyze data and gain new insights is of obvious use to people trying to respond to the pandemic, those working with knowledge graph technologies have started to talk about how those technologies – and their skills – might be employed to help tackle the problem.

To that end François Scharffe convened a virtual Meetup on 1 April 2020, “Knowledge Graphs to Fight COVID-19“, which featured seven speakers presenting information about the COVID-19 initiatives on which they’ve been working.

In this post I summarize the material presented during that Meetup, along with a brief summary of other pandemic-facing efforts from those in the linked data space. But first two quick things.

First, by popular demand François has scheduled a follow-up Meetup on 8 April, “Knowledge Graphs to Fight COVID-19 – Part II“. Check it out if you’re interested and available 12:00-2:00 EDT.

Second, many if not most of the initiatives discussed below are actively seeking volunteer help. If one of these projects strikes your fancy and you think you might be able to help please do contact the organizer.

Virtual Meetup: “Knowledge Graphs to Fight COVID-19” (1 April 2020)

Opening remarks by François Scharffe

François Scharffe kicks off the virtual meetup,

François Scharffe, (New York) Knowledge Graphs Meetup and The Knowledge Graph Conference
Recording (at 6:09)

François kicked off the Meetup by noting there are “a lot of people building knowledge graphs around the topic of COVID-19. There are different styles of effort: some people represent news, some present represent genetic information about the virus, some people get data.”

“The goal of this meeting,” he said, “will be to enable people to federate their efforts. Ideally we’d like a knowledge graph that is connected – all these knowledge graphs should be connected, they should be linked together.”

François noted that when they were contacting potential speakers a biologist observed that knowledge graphs are “not really what people need now.” On this point François acknowledged that now is not the time to introduce knowledge graph technologies to previously-uninitiated domain experts or to evangelize their use. Rather we should be using graph technologies right now to help make sense of already-available data.

“If access to knowledge in a well-organized way has any chance to help in that direction then that’s what we want to do. We just want to provide all the data that’s around in a nicely integrated way because that’s the power of knowledge graphs.”

The Fight Against COVID-19 by Unleashing the FORCE of the LOD Cloud Knowledge Graph

Slide from the presentation "The Fight Against COVID-19 by Unleashing the FORCE of the LOD Cloud Knowledge Graph" by Kingsley Idenhen

Kingsley Idenhen, OpenLink Software
Recording (at 13:00) | Presentation (Google Slides)

  • The COVID-19 problem
    • Understanding of the virus is sub-optimal due to shortcomings related to measurements (data), metrics (information) and insights (knowledge)
    • This suboptimal understanding compounded by the existence of disparate data, dashboards and data visualizations
    • “The problem right is that we’re in a crises, time is of the essence, perfection is a journey to a destination that we hopefully never reach, but we have to have a solution right now that allows us to better understand this virus. And to do that what we have to be able to do is harmonize data, information and knowledge without the traditional impediments that come from data wrangling.”
      • Open standards play a vital role in this
  • The FORCE behind the solution
    • Solution is to be found in the global generation of data, information and knowledge networks at rates that match or exceed the COVID-19 infection rate
    • This entails leveraging what’s right before us – namely the internet, web documents, data networks and knowledge networks.
  • The LOD cloud knowledge graph
    • In this public data network we use hyperlinks to create simple sentences that allow us to describe anything.
    • “Because the LOD cloud itself is already extremely rich in data, information and knowledge that covers what I see as the main area of challenge. We need to understand the virus, right at the genetic level. We need to understand its spread, its rate of infection. We need to know rates of hospitalization, we need to know the death rate. But we don’t need to know this at the global level, we need to decompose this all the way down to specific countries, provinces, municipalities until you actually get to local groups.”
  • Harnessing the FORCE
    • We want to get to a point where we progressively update the LOD cloud knowledge graph by using the power of hyperlinks as super keys to identify things
    • By doing this you harmonize data across disparate data sources.

Creating a Knowledge Graph of COVID-19 Literature to Facilitate Meta Analysis

Slide from the presentation "Creating a Knowledge Graph of COVID-19 Literature to Facilitate Meta Analysis" by Gilles Vandewiele

Gilles Vandewiele, IDLab (Internet Technology & Data Science Lab), Ghent University
Recording (at 29:45) | Presentation (Google Slides) | COVID-19 Literature Knowledge Graph

Information on the initiative from Twitter

  • He and others at the IDLab “created a Knowledge Graph from the recently published Kaggle dataset about COVID-19 literature. It contains different information for each paper such as author information, content information and citation information.”
  • “The CSV and JSON data from Kaggle has been mapped to RDF data using RML (Matey). The knowledge graph is currently hosted on Kaggle and open for contributions. We hope this allows for easier meta-analysis of current work around COVID-19.”
  • “The knowledge graph has different applications. One interesting application is creating embeddings with RDF2Vec for each of the entities in our KG, such as papers, authors or journals.”

Resources

Meetup presentation summary

  • “Today I’ll be discussing our ongoing effort on creating a knowledge graph of the available COVID-19 literature. By constructing this knowledge graph we hope to facilitate meta-analysis of other researchers.”
  • The CORD-19 dataset, the original dataset on which they built their knowledge graph
    • Over 45,000 scholarly articles about various types of coronavirus
      • 33,000 full texts
    • From different resource domains, and various sources (bioRxiv, medRxiv, PubMed)
    • Data provided as a CSV with metadata
    • Identifier links to a JSON with some of the metadata in the CSV, text from the abstract and from the body, a bibliography and information about figures and tables
  • How they went about constructing the knowledge graph
    • Goal was to take these structured data formats and convert them to RDF for the knowledge graph
    • The knowledge graph should contain at least the information that is already present in the structured data, as well as enriching that data
      • Data is enriched by linking all the entities or concepts in the text with external resources
      • The linking process extends the JSON with extra information
      • Because citation information exists a citation network can be constructed, and this extra information can be added to the JSON
    • With the enriched JSON in place, it is converted to RDF with the RDF Mapping Language (RML)
      • More specifically used a YARRRML dialect of RML which allows the construction of a YAML file that defines the mapping from JSON to RDF (using the online tool Matey)
  • Some potential applications
    • An improved reading tool
    • Advanced querying (e.g. with SPARQL)
    • General graph applications (e.g. use algorithm to extract most prominent reference works in the literature)
    • Creating embeddings to find clusters
    • Nearest neighbors
  • Some potential next steps
    • Named Entity Recognition (NER) to link to other data sources
    • Linking authors to ORCID based on papers and name
    • Canonicalizing the citations
    • Building applications based on embeddings
  • Ways others can help
    • Software engineers can help with environment, architecture
    • Data scientists can help with applying NLP techniques, linking data to external resources, buiding applications on top of the graph
    • Semantic folks can help look for good ontologies, enrich the data by applying reasoning
    • Front end folks are vital because nobody will use the tool if it’s not accessible
    • Non-technical folks can help spread the word

COVID-19 Community Project

Slide from the presentation "COVID-19 Community Project" by Peter W. Rose

Peter W. Rose, Structural Bioinformatics Lab, San Diego Supercomputer Center, University of California San Diego
Recording (at 43:58) | Presentation (PDF) | COVID-19 Community Project GitHub

  • The goal of this community-based work is to link heterogeneous datasets about COVID-19, in three main areas
    • The host (typically a human)
    • Data about the virus itself
    • The environment surrounding the virus
  • The Lab created a prototype COVID-19 graph in Jan. 2020, now obsolete
  • Did work looking at the community spread of the A2a strain of the coronavirus
  • The Community Project needs help
    • Need expertise to help clean up messy, non-standardized data
    • Need new datasets
    • Need help with visualizations (in part supporting the efficient generation of up-to-date infographics)
    • Need help with data analysis and forecast generation
  • This initiative is now a part of Graphs4Good (GraphHack) 2020 (organized by Neo4j)
  • In order to make it easy for the community to contribute, they have built an automated workflow
  • They are also “working with the schema.org people” in order to harvest data and annotate websites based on new schema.org vocabulary
  • Also a part of a National Science Foundation initiative, The NSF Convergence Accelerator, which aims to build an open knowledge network that connects different disciplines

COVID-19 Project by STKO Lab

Graph visualization from the COVID-19 Project by STKO Lab, UC Santa Barbara

Krzysztof Janowicz, STKO (Spatio-Temporal Knowledge Observatory) Lab, Geography Department of the University of California, Santa Barbara
Recording (at 52:59) | COVID-19 Project by STKO Lab
No presentation for this talk

  • Started in January 2020 as a side project
  • Because their domain is geographic information science, initial efforts were around disruption to air travel and tourism
  • Also became interested in disruption to supply chains
    • While supply chains are normally fairly straightforward chains (that is, they don’t usually require graphs), they cannot handle “constantly changing regulation, different social quarantine and distancing measures, harvests being closed down” and other major and multiple disruptions to the supply chain caused by the coronavirus outbreak
    • Supply chain domain experts expect that the full force of supply chain disruptions for medical supplies, food supplies and just everyday goods in the weeks and months to come
    • This is a time frame where many in the semantic domain may contribute because this is an area where “the role of linked data really shines”
    • They want to do a “continuous integration” that integrates data on the fly and brings it back to geographic regions
    • Aggregating the data from a range of different sources is a challenging cross-data integration problem, but is going to become more and more important as efforts start to get underway to reopen things
  • In summary, data integration is of the essence, and that this needs a lot of work they have expertise in, and that they’ll be able to ingest data and make sure it’s properly aligned to the geographic regions
    • An example of the sort of linked data reasoning that can be employed here is that if quarantine and social distancing measures are in place for a region, then a community that’s part of this region will be subject to those same restrictions, so you don’t need to materialize everything in the graph

Project Domino

Slide from the presentation "Project Domino" by Leo Meyerovich

Leo Meyerovich, Graphistry
Recording (at 1:00:27) | Presentation (PowerPoint) | Project Domino GitHub

  • Project Domino spun out of a related Graphistry project on election misinformation
  • Project Domino is aimed at scaling behavior change and anti-misinformation for COVID-19 disaster response by leveraging social networks
  • They are working alongside Disaster Tech on this initiative
  • To prevent death one the most important COVID-19 disaster efforts right now is to change the behavior of citizens and decision-makers
  • Project Domino has two directions
    • Promote positive behaviors at a mass scale
      • E.g. provide information and information tools to leaders of sub-communities that are at risk
    • “Defang” misinformation
      • Use lessons learned from security and fraud threat intelligence to map out and defuse misinformation before it spreads
  • This work isn’t easy due to issues of scale, timeliness, precision and sustainability
  • Current Twitter efforts in place now includes a data pipeline that can handle about 100 events a second, which seems to be what’s needed to cope with COVID-19 information
  • Also looking at integrating necessary information from social networks to correlate that information with things like clinical trial databases and fact checking databases
  • Timeliness important: if misinformation surfaces they want to be able to address it in ten minutes, not in days or weeks
  • Taking action on this includes internal actions like bringing in traditional intelligence analysts, and external, public-facing actions like the generation of timely alerts for social media platforms
  • Highlighted interventions
    • Some around mass public health information, including addressing bad medical and behavioral information (e.g. misinformation about unsafe cures)
    • Some around an increase in digital crime (e.g. increased number of phishing attempts) as bad actors attempt to take advantage of the situation

The timbr COVID-19 Knowledge Graph

Slide from the presentation "The timbr COVID-19 Knowledge Graph" by Amit Weitzner

Amit Weitzner, timbr.ai
Recording (at 1:13:38) | Presentation (PDF)

  • In mid-March joined an Israeli effort to build a COVID-19 knowledge graph and shortly thereafter received permission to make this available for open use
  • Objective of the graph is to address the lack of connections between research questions and the large of amount of data that’s being published on the virus
    • Wanted to take all the available data sources, build a knowledge graph and make it freely accessible to analysts, data scientists and researchers so that they can query it in SQL, Apache Spark, Python and R
  • Steps taken to build the timbr COVID-19 Knowledge Graph
    • Curated a list of all data sources and research around COVID-19
    • Crawl all the relevant datasets
    • Clean and transform each dataset
    • Generate an ontology based on the dataset
    • Map the data the to the ontology
    • Curate the ontology by building more hierarchies and adding ontological rules
    • Give access to researchers and data scientists
  • All of the crawlers written in Python, with Apache Airflow used to schedule daily updated datasets
  • Once in the data lake they can use Apache Spark to transform and clean more complex datasets, or use Trifacta to load the data to SQL Server
  • timbr connects to existing databases and applies an ontology layer on top of it
  • timbr, a virtual layer that operates in standard SQL, transforms existing databases into knowledge graphs
    • Allows seamless integration with business intelligence tools
  • timbr does all of this by having implemented semantic web principles in SQL, and is fully compatible with RDF
  • Current COVID-19 concepts in their graph includes weather station data, online and TV news about the virus, global personal COVID-19 cases and patients
    • In an example, Amit generated a visualization of COVID-19 cases by country, by age, ordering the chart by age to making it easy to see countries where the average age of patients was high (the highest being Saudi Arabia, 68.0) and low (the lowest being India, 21.0)

COVID❋GRAPH Project

A visualization from the COVID❋GRAPH Project

Martin Preusse, Kaiser & Preusse
Recording (at 1:27:19) | COVID❋GRAPH Project

  • They kind of do what everyone else has done, Martin says, in creating a knowledge graph that collects dataset, in this case built in Neo4j
  • About mid-March Martin started to play with case data from John Hopkins University, putting it in a graph, and the graph has grown from there
  • There are now about 40 people involved in this initiative, “essentially the European Neo4j community”
  • This work is not merely to collect data, but to build applications in direct communication with end users
    • They have relationships with the German Center for Diabetes Research, with clinical researchers working on COVID-19 right, and with pharmaceutical companies working on co19
    • They have built applications on top of their knowledge graph to provide value to researchers working on COVID-19 and related topics
    • Their first prototype application is available now, which allows users to search for genes and see articles and patents about them
  • They broke down their patent and paper datasets into fragments, and then connected it to the gene molecular word by searching
  • The gene symbols are widely-used, and these symbols are connected to gene transcript process IDs
  • Next thing they want to load is GTEx, a gene expression database “that can say, essentially, a certain gene is expressed in a certain tissue in humans”
  • The project GitHub can be found here

Other coronavirus-related knowledge graphs and linked data initiatives

Graphs4Good 2020

As noted above and outlined in this blog post of 20 March, Neo4j has invited their users “to join us and the global development community in an effort to unite our skills and bring some good to our world.”

The post goes on to outline how those that have an idea for a project can create a GitHub project, and then register it. It also provides information on how interested parties can contribute to an existing project.

Here is the main hub for this initiative on GitHub, which currently lists 24 public repositories on the subject.

schema.org 7.0+ vocabulary additions

The folks at schema.org have “fast-tracked new vocabulary to assist the global response to the Coronavirus outbreak.” These changes include the addition of a special announcement type where sites can provide things like information on school closures and testing locations, much new vocabulary to better represent online events, and new types of places.

For more information you can check out my recent post on these changes.

The Kahun COVID-19 Knowledge Graph

As described in this article, the Israel-based medical technologies startup Kahun, “has released a new COVID-19 tool designed to help doctors make quicker decisions.”

The freely available tool is powered by an AI engine, which was fed more than 2,000 papers and articles sourced from the medical library PubMed. It allows doctors and researchers to devise a score from various symptoms to decide if a patient is at high risk of moving to a critical stage.

You can find the Kahun COVID-19 Knowledge Graph here.

Manipulating Kaggle-provided COVID-19 data

In this Medium article Yiwen Lai shares “some ideas on how to work on COVID-19 data provided by Kaggle.”

The data.world Coronavirus (COVID-19) Data Resource Hub

The folks at data.world have created a resource hub where they’ve “collected the most up-to-date and trusted open data related to COVID-19.”

The ThinkData Works COVID-19 data repository

In this blog post, Cheryl To discusses how data may be used to combat COVID-19, and goes to describe how the team at “ThinkData Works has collected a massive repository of COVID-19 data and made it accessible through Namara.” You can find these Namara-accessible resources linked at the bottom of the post.

Aaron Bradley

Aaron Bradley is Knowledge Graph Strategist at Electronic Arts, and chief cook and bottle washer at The Graph Lounge.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.