Directed, cooperative exploration of health data

This is an update and expansion of a post I wrote for the Health 2.0 Developer Challenge earlier this summer. The original is here.

With the recent launch of the Community Health Data Initiative (CHDI) and the emergence of a growing number of health apps, the magnitude of health data and the variety and number of tools to explore and analyze that data is expanding quickly. But sitting in the audience on June 2 at the HHS/IOM CHDI forum, watching teams demonstrate search and visualization applications for community health data, I found myself thinking about an unrealized opportunity: The lack of means to pool our capabilities and organize cooperative action around health data and its exploration. 

What I envisioned while watching the health app demos that day was the inevitability of redundant searches by visitors to these sites. How many thousands of people would map air pollution against asthma prevalence, or the association between obesity and the distribution of parks or grocery stores. On the whole, such engagement is a sizable positive step, encouraging interest in the epidemiology of health, healthcare and the environment. But so far, health apps stop short of delivering tools to create with the data. 

As an epidemiologist, what I’m hoping to see emerge is a way to capture the knowledge created by the data exploration carried out by community members. In short, a way to distill scientific value from widespread interactions with health data – if only to help map out dark corners. In fact, there are many health problems for which the data are (vastly) more abundant than the time to analyze it. In most cases, scientists cannot explore all the plausible associations between a single disease and the hundreds – if not thousands – of potentially relevant variables and other diseases in even one dataset, such as NHANES. There is also a scarcity of perspectives in professional science; many discoveries are made as a result of the variability amateurs bring to the puzzles. The topics of searches themselves will reveal community interests and priorities, and the results of those searches – if captured and analyzed in realtime – could highlight important data gaps or suggest potential, previously unknown relationships worth investigating.

So, while opening up new horizons of health data and providing tools for people to interact with this information, we can also design parallel tools to encourage popular epidemiology. It is easy to envision a health app that stores graphs and visualizations created by individuals, ties them to their constituent datasets, and provides parallel community discussions. Something that aims to capture the emergent, collective analysis of newly available US health data. It should encourage communities, agencies, amateur or citizen scientists, professional scientists, media outlets, and healthcare providers, to cooperate on raising and answering specific questions by using our collective cognitive surplus (see Clay Shirky) to mine the new horizons of data and the possibilities that occur in the minds of individuals from every corner and background.

It could be productively modeled on distributed version control systems, like Git, which enable a diverse group of individuals to participate collaboratively in the creation of software. One can imagine a health app where individuals or groups could create repositories focused on a given health research question or topic. These repositories would contain annotated data, methods, tabular and graphical results, documentation, discussions and supporting code. Most importantly, they would have a trajectory that develops as material and knowledge in the repository accumulates. 

For example, a person or group with an interest in examining regional variability in tuberculosis incidence could check out a repository, carry out an analysis, and then write a short note summarizing and committing their results to the pool. Another individual could review or replicate the same work, see the same results, and vote to underscore its accuracy or importance. Subsequent visitors could review the latest information, reach the decision that a new variable, say air quality, should be included in the analysis. At this point, the person could fork the repository, import relevant air quality data and begin a new branch of research. 

Open data will be more valuable if we also capture and share open methods. One way to seed such analyses would be to encourage university epidemiology and health sciences faculty (as well as public health practitioners and agencies) to construct curriculum and lesson plans around community health data. In other words, to write and post exercises that teach some aspect of epidemiology by examining the relationship between variables in health data. If students carried out different versions of these exercises each time, the repository of methods and results and interpretations would develop. Similarly, a growing number of projects (such as Educate to Innovate, Teaching Opportunities for Partners in Science; and Retirees Enhancing Science Education through Experiments and Demonstrations; National Lab Day) recruit volunteer scientists and technology educators in an attempt to increase the performance of students in science, technology and engineering and to bring science, statistical and data literacy to local communities. We can start to assemble such a volunteer army for health data, and provide them with the tools to make them successful.

In fact, there is no requirement (other than simplicity) to stop at the analysis of existing data. While rather more complicated, we can imagine (and work towards) health apps that engage communities explicitly in the actual collection of data. Given the increasing difficulties involved in enrolling participants in national population-based surveys, such as NHANES, the time for exploring alternative methods of population-based surveys is in order. The academic health community has for some time been trying with mixed success to bring communities more into the research process. Providing tools to create value from health data – and the guidance of a community of volunteer experts – will complement and speed research and improvements to public health.