COVID-19 Testing Without a COVID Data Commons is Throwing Away Valuable Data

COVID-19 Testing Without a COVID Data Commons is Throwing Away Valuable Data

Robert Grossman - Data Scientist at the University of Chicago, Director of the Open Commons Consortium

We Need Three Types of Large-Scale COVID-19 Testing

The importance of COVID-19 testing is now more than clear to everyone, but what may not be so obvious is what type of testing is required, what the tests tell us, and what we can do with the data.

Large scale, robust testing needs to answer three critical questions to inform an appropriate public health response:

1. Who has COVID-19 now? This is of course critical so that these people can be isolated to reduce the spread of COVID-19 and so that public health officials have accurate data for the planning that they need to do.

2. Who has had COVID-19 in the past, perhaps without showing any symptoms or just mild symptoms? Knowing the answer to this question is important for determining who can go back to work.

3. How is the COVID-19 virus mutating? Mutations can create new viruses that require new quarantine measures, new mitigation strategies, new treatments, new drugs and drug regimes, and new vaccinations.


We Also Need A COVID-19 Commons

But, we also need more than testing. We need a national-scale data platform to provide the information decision makers need.

National testing for COVID-19 without a national system for managing COVID-19 related data is a missed opportunity. Answering the critical questions above and using the data to populate a national data system that makes the aggregate data publicly accessible to all stakeholders and decision makers will be essential to managing the different options for mitigating the effects of the pandemic (cancelling events, closing schools, sheltering in place, etc.), deciding when some of the these can be relaxed, and when mitigation actions are required again.

Modern systems for managing data can interoperate so that multiple systems can work together to provide the information that clinical research, public health, and local, state and national decision makers need. This means that we don’t need to build one single system that can not only privacy concerns but also efficiency concerns, but rather a data ecosystem of systems that can share information with well-defined interfaces and security and compliance rules. Often times, data can be analyzed and questions about an individual’s infection can be answered without exchanging personally identifying information with all the systems, but using technologies, such as cryptographic-based hashing.

A national data ecosystem for COVID-19 testing and related data has three important benefits. First, it can lead to better containment and mitigation strategies. Second, given the fact that in practice only some of the population will be tested, especially at the beginning, analyzing all of the data at large scale is necessary to see sometimes subtle patterns in how patients respond to treatments and drugs that are critical to improving health outcomes and may be missed otherwise. Third, testing is the best way to see new emerging hot spots, and to know when hot spots have cooled enough to restart, or partly restart, economic activity.

In the era today in which data is the new oil and AI is the new factory, a good general rule of thumb when generating data through testing is to reserve 15% to 25% of the total funds for building and operating data platforms that can clean, manage, curate and analyze the data and make it available through an ecosystem of applications to the healthcare providers, public health officials, public and private decision makers, and other stakeholders. These are often called data commons (see below) because they provide a broad community and societal benefit when properly used.

It is also critical that the data be made available to everyone that provided it, both individual tests back to those tested, but also aggregated data about the status of the local neighborhoods where they live, work, and visit.


Some Background

Understanding COVID-19 tests requires understanding a few simple facts about COVID-19 and how new infectious diseases are studied.

First, different tests are used to determine whether you have an active COVID-19 infection (PCR tests) versus whether you have recovered from a COVID-19 infection and now have antibodies that will help keep you from getting COVID-19 in the future (serological tests).

Second, the COVID-19 virus, like all viruses, changes over time through mutations. A third type of test is used to determine whether the COVID-19 virus has mutated (molecular sequencing tests, which are also called next generation sequencing). Detecting when a virus has mutated is important since with enough mutations, new tests are required, new drugs and treatments are required, and new vaccinations are required.

Third, when a patient is sick with a well-studied disease, a doctor examines the tests results of a single patient and can then make decisions about the best treatment. In contrast, with a new disease that is quickly infecting and impacting large populations, a different approach is needed. Patterns and trends in populations of individuals must be understood, including hot spots of new and emerging infections and the impact of different mitigation strategies. This is the role of epidemiologist and those involved in public health. It also requires a platform for collecting and analyzing all the necessary data.

Fourth, there are two quite different methodologies used in the analysis of public health data. The first methodology requires personally identifying information (PII). An example of this is contact tracing in which all the different people that an infected patient has had contact with and may have been infected are contacted and tested. These days, there are many new approaches to contact tracing being explored that involve smart phones, surveillance cameras, and other technologies. The second methodology only requires deidentified data and looks at aggregate data that doesn’t contain any PII. This analysis can be done at different levels, from neighborhoods or communities, to cities or portions of cities, to counties and states. Using deidentified data and aggregate data without any PII is often all that is needed to give local, state and national decision makers much of the information they need to make decisions about mitigations strategies and return to work strategies.


Donating Your Data

Population level data can be gathered in different ways.

One way that population data can be gathered is that citizens can use mobile and social apps to knowingly and voluntarily contribute data that includes whether they are well, and, if not, their symptoms, along with information about their location. Information like this is absolutely critical for using location analytics to detect hot spots and emerging hot spots of COVID-19 activity.

It is important to note that data provided this way is in general not regulated as healthcare data, unless the application is operated by a healthcare provider or used as part of a user receiving healthcare or taking part in a healthcare-related research project.

Valuable data can be collected in this way with questions as simple as: “Do you have any flu like symptoms, and how bad are they (green, yellow, red)?” Although this data would be incredibly noisy, by looking at this data in aggregate at scale, changes in this data would be statistically significant. Collecting this data could be gamified and the status (green, yellow, red) of the neighborhood could be provided, and this could all be done anonymously. Some of the greatest impacts would be answering questions such as:

Is there an emerging hot spot that requires social distance or other mitigating actions?

Is a neighborhood or region in good enough that social distance can be relaxed?


Other Ways That Your Data Is Collected

Another way that population scale personal and location data is often collected is through the byproducts of using online and mobile apps. Often this is done with a simple click through agreement and the person providing the data is not aware of exactly what data is being collected and how it is being used. This type of information can be collected both with personally identified information (PII) and without it, if the appropriate privacy preserving and deidentification mechanisms are used. Balancing the utility of the information collected and privacy preserving guarantees provided is essential for this type of non-traditional data collection


A COVID-19 Commons

commons is natural, cultural or digital resource accessible to all members of a community, or more broadly of a society. Examples include a pasture for animals to graze in a village, a dog park for dogs in a city neighborhood, or natural materials such as air, water for society in general. These resources are held in common, through a partnership, a not-for-profit, or other entity, but not owned privately for commercial gain.

A particularly important type of commons are digital commons containing data. A specific type of digital commons is a data commons, which are used by projects and communities to create open digital resources to accelerate the rate of discovery and increase the impact and the benefits of the data they hold. More formally, data commons are software platforms that co-locate: 1) data, 2) cloud-based computing infrastructure, and 3) commonly used software applications, tools and services to create a resource for managing, analyzing, integrating and sharing data with a community [1].

An important step for a region to mitigate COVID-19 and understand back to work timing is to create regional data commons for COVID-19 data.

Over the last several years, technology has been developed so data commons can interoperate and share data in safe and compliant ways. In this way, a national COVID-19 data common can be developed that brings together some of the data in regional data COVID-19 data commons, while leaving some of the more sensitive data in place. Even with some of the more sensitive data remaining in place, this information can still be used through what is a called a federated analysis, in which data is left in place in a commons, an analysis is done over the data in each commons, and the results are returned to provide an integrated picture. In this way, data from multiple regional data commons can be used to provide a national view of COVID-19 related issues.


Changing the Regulatory Framework

Finally, we mention, that at this time there is important opportunity to revisit the regulatory framework for the different types of data discussed above. First, sharing of healthcare data is essential to working through this crisis. On the other hand, under HIPAA and related regulations, there is only risk, and the threat of serious financial fines when healthcare data is shared to improve health outcomes, and, later, there is breach. This dramatically reduces what can be done to improve health outcomes. This is a good time to change the policy so that if healthcare data is shared for non-commercial purposes to improve research or health outcomes, best practices are followed, and there is no serious or gross negligence, then fines be dramatically reduced or eliminated, if some exposure of data later occurs.

Another important regulatory opportunity is to change the regulatory framework when individuals contribute or donate their data to a non-commercial project designed to improve health outcomes or otherwise benefit society to clarify that it is not healthcare data, but provide some basic privacy protection for those contributing it, and to provide safe harbor protections for those using it.



[1] Robert L. Grossman, Data Lakes, Clouds and Commons: A Review of Platforms for Analyzing and Sharing Genomic Data, Trends in Genetics 35, 2019, pages 223–234. PMID: 30691868 PMCID: PMC6474403

[2] Robert L. Grossman, Lauren C. Leiman, and the BloodPAC Policy and Reimbursement Working Group, The COVID-19 Testing We Need, Medium, April 2, 2020.