Cloud-based Sandbox for Analytics (Natural Language Processing)

Project Description

This project continues with specific application to the National COVID Cohort Collaborative (N3C) initiative. The NLP sandbox is a platform where the NLP stakeholders can meet and collaborate to create continuous benchmarking of NLP tools on public and private data hosted on multiple data sites. The launch of the N3C-specific NLP sandbox in November 2020 to create annotation for: dates, person-names, and physical addresses. IN 2021, there is ongoing addition of new NLP tasks taking place. More information can be found at: Sage-Bionetworks/nlp-sandbox-schemas.

A sandbox is an isolated testing environment that enables users to run programs or execute files without affecting the application, system, or platform on which they run. The sandbox allows developers to test programming code for optimal use of the tool. 

This sandbox project is a continuation of Phase II collaborative work with the Informatics Enterprise Committee (iEC) working group that aims to deploy a suite of natural language processing (NLP) tools and realize evaluation measures and tools as well as best practices. The ability to share and compare methods for text analytics in support of clinical and translational research is a critical need in the biomedical community. In response to such needs, this project will establish a cloud-based sandbox environment in which CTSA hubs can develop, evaluate, and share tools and methods.  Our objectives are to: (1) reduce redundancies in these efforts and increase economies-of-scale across the CTSA network, (2) ensure the reproducibility and rigor of assessment tools and methods, and (3) expedite access to “best-of-breed” tools and methods by all CTSA network participants and partners. The project has three specific aims:

  1. To create a cloud-based environment that can enable the systematic verification and validation of text analytics tools to solve specific tasks
  2. To populate the “text analytics sandbox” with necessary and appropriate reference datasets to be used in shared verification/validation tasks
  3. To demonstrate the “text analytics sandbox” by engaging a group of CTSA hubs for contribution of tools and methods and demonstrate their performance, reproducibility, and rigor in a shared environment.

The expected impacts of this work are to (1) improve data driven recruitment to clinical trials and clinical research, (2) transition real-world data to real-world evidence, (3) create essential infrastructure for a learning health systems, (4) create the phenotyping necessary for precision health, and (5) pave the way for artificial intelligence in digital health.

View the NLP Benchmark Proposal that describes stakeholders and identified use cases, as well as architecture.

Tools & Cloud Infrastructure Core community meetings occur the last Tuesday of the month at 12 pm PT/3 pm ET. Contact for meeting invitation.

Project Leadership

Project Cores

Tools & Cloud Infrastructure

Creating cloud compute infrastructure for shareable, scalable dissemination and execution of tools across CTSA hubs