Cloud-based Sandbox for Best Practices in Clinical Machine Learning (ML)

Project Description

This project continues with specific application to the National COVID Cohort Collaborative (N3C) initiative. The CD2H goals of the pre-N3C ML sandbox are being used to serve the needs of the N3C. ML best practices include: addressing missing data, feature selection, detecting over/under fitting, comparing ML approaches, clinical interpretation. The team will implement the approaches within the N3C environment and provide standard operating procedures and instructions. They will develop approaches to detect and mitigate racial bias in ML.

A sandbox is an isolated testing environment that enables users to run programs or execute files without affecting the application, system, or platform on which they run. The sandbox allows developers to test programming code for optimal use of the tool.

This sandbox project is designed to create a best practices platform for deploying and evaluating clinical machine learning tools and algorithms. The sandbox environment enables collaboration with the CTSA community to create a best practices platform for clinical machine learning that will provide community-vetted solutions to common challenges for data preparation, state-of-the-art machine learning algorithms, analysis of sources of bias, and evaluation/validation (e.g., as a collection of open-source Python libraries and Jupyter notebooks).

GitHub Repository

Onboard to CD2H

Tools & Cloud Infrastructure Core community meetings occur the last Tuesday of the month at 12 pm PT/3 pm ET. Contact data2health@gmail.com for meeting invitation.

Project Leadership

Peter Robinson, MD, MSc

Jackson Laboratory

Co-Program Director

Project Cores

Tools & Cloud Infrastructure

Creating cloud compute infrastructure for shareable, scalable dissemination and execution of tools across CTSA hubs