Harnessing “Big Data” to Diagnose Health Problems

The New York Times recently reported that a group of researchers from Microsoft Corporation were able to identify patients who would develop pancreatic cancer just by analyzing patterns in their search terms. Pancreatic cancer is often quite difficult to diagnose, and as a result is usually discovered late. By scanning internet users’ search terms, they were able to identify those suffering from the disease before the patients themselves were even aware of their illness.

At the Computational Medicine Center (CMC) at Thomas Jefferson University, Center Director Dr. Isidore Rigoutsos and his colleagues take a similar approach to generating surprising health findings by mining and analyzing large public pools of healthcare data.

Dr. Rigoutsos discusses his approach:

How can anonymized data help individual patients?
We have now entered an era where much of what each one of us does during the day (whether it is work related or personal) involves interactions with websites and search engines. This modus operandi is generating tremendous new opportunities where the collective behavior of large groups of users can help create associations that one would not have fathomed otherwise. As you can imagine, search engines like Google and Bing, are well-positioned to generate such associations. A characteristic example of what is possible was recently reported by Microsoft researchers. By evaluating user queries on Bing, the company’s search engine, and studying how these queries evolved over time they were able to identify individuals who had just been diagnosed with pancreatic cancer.

There are two important things to stress here. First, the researchers were able to achieve this by using “anonymized data”, i.e. they generated their associations by analyzing what was asked of Bing (and not who asked it), which means that this could be achieved without compromising the privacy of the search engine’s users. Second, having identified a pattern in the way these users’ queries evolved over time, the researchers were able to develop a scheme for “predicting” an eventual pancreatic cancer diagnosis with very low false-positive rates. Even though they could do this confidently for a relatively small fraction (5-15 percent) of those that were eventually diagnosed, Microsoft’s team has nonetheless demonstrated the potential of such an approach in the context of what is admittedly a very difficult cancer to predict.

The article mentions a “Cortana for health.” Would it be something like having the computer interject to tell the patient to see a doctor based on their search queries?

In a nutshell, yes. Automated personal assistants are quickly becoming popular among users of providers such as Apple (Siri), Google (Assistant), Microsoft (Cortana), and Amazon (Alexa). What has contributed to the popularity of these assistants is their increasing ability to interact rather effectively with the user. For a company that already possesses the necessary infrastructure, it will be easy to “connect” the personal assistant module to an ‘artificial intelligence’ module that has been trained specifically using associations culled from the analysis of vast numbers of user searches.

One can go further and envision an even faster feedback loop where the Assistant’s intervention can be triggered from a signal generated by one’s personal fitness tracker (i.e.: Jawbone, Fitbit, Apple Watch, Garmin, etc.) that can measure a person’s key parameters in real time and could flag something out of the ordinary. Such an intervention is not far-fetched, given this recent case in New Jersey.

What are some surprising uses for big data in solving healthcare problems?

It is perhaps premature to speak of problems that have been solved already. What is happening is the realization that Big Data has demonstrably changed the way we think and the way in which we do science. We are quickly transitioning from a “hypothesis-driven” realm to practicing “data-driven” science: what experiment we will do next is based on our knowledge of the domain and our having mined relevant data. This is certainly true of all of the work that is going on in the CMC. Experiments are now more focused and, arguably, better-designed than they would have been just a few years back. Most importantly, Big Data approaches are helping untether us from the constraints of the existing knowledge and spur us to explore possibilities that are not contained in our books.

In turn, how will this work help patients in the long run?

By harnessing the power of Big Data we can dismiss unpromising leads and get to novel actionable results much faster than before. This is certainly a big win. It is the ability to generate these novel results faster that all of us are very excited about. The first step typically involves the discovery that a previously unsuspected molecule can be used as a prognostic biomarker in a specific disease context. This is important new knowledge that can be put to beneficial use quickly. The second step involves establishing that this molecule plays key roles in the disease being studied and determining what events it mediates. The third step, therapy, leverages the findings of the second step and its goal is to design new drugs.

Aside from your work on RNA molecules, what other projects is the CMC working on and how will they help researchers?

Since it was founded in 2010, the focus of the work in the CMC has been on studying potent regulatory molecules that are relevant for multiple diseases. Along the way, we have been able to discover several new categories of regulatory molecules and to show that their presence or absence is controlled by a person’s sex, a person’s race, a person population origin, the type of the ailing tissue, the disease subtype, and other variables. These findings have given us and the rest of the community an unprecedented level of detailed knowledge that is now being linked to previously unsuspected events that underlie various diseases. Most important here is that all of these findings were enabled by the availability of data and the ability to mine this data, and would not have been possible a mere 10 years ago.

No comments yet.

Add Your Comments and Join the Conversation

Comments are moderated and generally will be posted if they are on-topic and not abusive.
View our commenting policy.