No, the title of this post isn’t some SEO ploy. Despite being now more than a year old, few have heard of this Google-led initiative. And that’s a pity, because there’s been some interesting work that’s been undertaken there.
I use the past tense because its not clear to what degree some parts of datacommons.org are a going concern. And I say “parts” because, as we’ll see, the Fact Check Markup Tool Data Feed is being continually updated.
As you’ve probably determined, datacommons.org isn’t a single thing, but rather a collection of data and tools. In broad strokes these comprise a knowledge graph linking data from a number of open sources, supporting schemas, an API to access data in the graph, a graph browser, and a fact-check dataset.
Let’s take a look at each of these in turn.
The Data Commons “Open Knowledge Graph” (OKG)
The Data Commons home page does a pretty good job of summarizing the nature of their knowledge graph:
Data Commons attempts to synthesize a single “Open Knowledge Graph” (OKG) from these different data sources. It links references to the same entities (such as cities, counties, organizations, etc.) across different datasets to nodes on the graph, so that users can access data about a particular entity aggregated from different sources without data cleaning or joining.
The “different data sources” referenced are data from open sources such as the US Census Bureau and the Centers for the Disease Control. At time of writing I see 29 different sources for datasets used by the Commons, mostly United States government agencies.
The sources are quite varied, and some are quite large – like the C.V. Starr Virtual Herbarium, which has data on more than 4,000,000 digitized plant specimens.
schema.datacommons.org
The Data Commons has its own schema, which effectively extends schema.org with additional vocabulary, although it frames this vocabulary simply as “additional schemas” rather than explicitly as an extension to schema.org (and, indeed, they say on the home page that “At this stage many of the terms still being refined and are not proposed for wider adoption”).
This vocabulary allows the Data Commons Graph to synthesize data from disparate sources by the addition of types and properties that are required to make sense of the source data, and to connect it to other things in the graph.
For example, the addition of the CriminalActivities type, the crimeType property and members of the FBI_CrimeTypeEnum enumeration allows FBI data to be merged into the Graph.
Here that vocabulary facilitates the generation of a chart (via the Graph browser) showing larceny-theft crime in Georgia from 2008 to 2017.
The Data Commons API
“The Data Commons API”, says the API documentation, “is a set of APIs that allow developers to programmatically access the data in the Data Commons graph. This access is provided through a set of REST APIs, with additional wrappers in Python and R.”
The APIs allow for local node exploration, domain specific APIs and SPARQL queries. The domain-specific APIs reflect the fact that “[a] substantial amount of the data in Data Commons is statistical in nature. The representation of this information uses the concept of an StatisticalPopulation with Observations on these populations.”
The Data Commons Graph Browser
The Graph Browser is “used both to explore what data is available and to understand the graph structure, to help use the API”
The Browser comes in two flavors. The first is a (chartable) data view.
There is also a version of the browser offering a simple user interface to explore “data about places (zip codes, cities, counties, states) from a variety of sources, including the Census, FBI, Bureau of Labor Statistics, CDC and others.”
Data Commons fact check data and associated tools
The initial blog post introducing datacommons.org described it as “schema.org-like initiative for the open sharing of data” and that they were starting “with a dataset aimed at helping us understand the characteristics of misinformation.”
In this initial release metadata for “a sample of fact checks from a number of different sources” was provided, generated from sites employing schema.org/ClaimReview markup.
By October 2018, Data Commons reported that the release of the sample dataset had stimulated further interest in the study of misinformation, and that “received requests from academia to update the fact check corpus regularly and allow more publishers and non-technical users to add ClaimReview markups.”
These requests were satisfied by the introduction of a suite of Google-based fact check tools: a fact check explorer providing a simple interface by which a user can “fact check results from the web about a topic or person”; a fact-check markup tool (branded as “in partnership with datacommons.org”) that allows fact checks to be submitted without the need for ClaimReview markup; and APIs that reading, writing and searching of ClaimReview data.
In March of 2019 markup from the ClaimReview Read/Write API was made “available to Data Commons, joining all ClaimReview markup data created via the Fact Check Markup Tool.”
In summary
While originally conceived of as a means to “open up fact check data” I’ve put the emphasis on Data Commons’ efforts to synthesize data from different open sources because I think this is an interesting application of one the most compelling use cases for constructing knowledge graphs: a means by which data from heterogeneous sources can be meaningfully combined.
As noted earlier, aside from the regularly-updated Fact Check Markup Tool data feed, it’s not clear just how much the site or its “Open Knowledge Graph” is being worked on, and whether or not the datasets used in the OKG will be updated.
All in all the site has an experiment-y, proof-of-concept feel to it. If that’s the case, the work around fact checks appears to have paid off in the form of expanding the means by which fact checks can be provided to consuming search engines and applications, and by the greater number of fact checks the Markup Tool and Read/Write API have made available.
And if the Data Commons Open Knowledge Graph doesn’t end up being maintained, it has at the very least shown how data of disparate lineage can be meaningfully combined and queried through the use of semantic technologies.
Aaron Bradley is Senior Structured Content Architect at Telus Digital, and chief cook and bottle washer at The Graph Lounge.