Data Discovery Engine

Project Description

To make data more FAIR (findable, accessible, interoperable, reusable), the flow of metadata must move efficiently. Streamlining this flow makes it easier and faster to capture metadata and minimize repetitive efforts. The Data Discovery Engine uses a "schema" playground to drive this work. This project is intended to provide a complete infrastructure ecosystem to disseminate and consume structured metadata using schema.org as the sharing mechanism.

The term "schema" refers to the organization of data as a blueprint and can be thought of as the graphical depiction of the database structure. A schema specifies facts that can be entered into a database, or those that may be of interest to the possible end-users. A schema also contains properties and classes, which can be thought as a set of 'types' for each associated set of properties. The Data Discovery Engine registry includes classes from schema.org's schema, which provides an extensive collection of schema classes in addition to schema classes provided by the research community.

Use of a schema is important, as it offers a way to logically group objects, such as tables, views, stored procedures, etc. In addition to providing structure to a database, data associated with an existing schema, or extended from one, makes data interoperable. Creating a schema derived from an existing schema contributes to data reusability efforts to make data more FAIR.

A sufficient set of cloud-hosted utilities will be provided to extend the success from the consumer-space search engines (such as Google and Bing) to the biomedical use cases from the CTSA community. In Phase II, we developed a set of metadata authoring widgets and a schema playground. Both focus on constructing metadata following schema.org best practices to maximize their discoverability as data-providers. In Phase III, we plan to streamline the process from distributed metadata to their data-portal consumers by building:

  • a metadata crawler to harvest the distributed metadata in real-time;
  • a centralized application programming interface (API) to deliver harvested metadata to multiple downstream data portals (e.g. CTSASearch)
  • a “metadata-plus” web application to convert non-schema.org compatible websites (e.g. GEO) to compatible ones

These new utilities provide necessary incentives to promote schema.org mechanism, with the hope to form a positive feedback-loop between data-providers and their data consumers. Further development of Data Discovery Engine schemas need to continue evolving together with the community, and more metadata is needed from consumers to show the incentives. The harvesting of structured dataset metadata will provide programmatic access for portal developers. 

EDUCATION MATERIALS:

Data Discovery engine recorded webinar 2.20.2020

Data Discovery Engine Slide Deck_2.20.2020

CTSA NIAD Portal Slide Deck_2.20.2020

iTHRIV Portal

Archived projects do not have active meetings, however, Informatics Maturity & Best Practices Core community meetings occur the 3rd Thursday of the month at 10 am PT/1 pm ET. Contact data2health@gmail.com for meeting invitation.

Project Leads

Project Cores