In 2016, a major transformation occurred in how we evaluate clinical data for real-world effectiveness. The FDA signed into law the 21st Century Cures Act, which impacted the Real-World Evidence Program in the United States. It signaled a paradigm shift in medicine by formally recognizing the importance of real-world data (RWD) in bringing medical innovation to patients. But the immense promise of RWD comes with sizeable hurdles. Simply having massive quantities of data at your disposal does not automatically equate to having meaningful answers—especially when considering the sheer length and complexity of medical records.

We realize that medical records in their raw form are far from being set up in a consistently organized and tabulated format. Due to inconsistencies, redundancies, and format variations in medical records, it is more difficult to easily identify relevant aspects of a patient’s medical journey. To gain deeper clinical insight from information that holds the nuances and critical details about patient care, we must rely on more sophisticated machine learning approaches like natural language processing for healthcare. We need to be able to process and make sense of the information-rich, natural language that exists throughout medical records to gain an understanding at the highest fidelity when it comes to a patient’s health journey. This is needed regardless of where that information is stored.

The Challenges of Medical Data

The most important, insightful clinical data exist in narrative form. This captures the medical journey that is experienced by a patient as a series of events. Many important elements are hard to contextualize in ways that allow someone to make clinical sense of it, while at the same time maintaining patient privacy and data security. The processes of inputting information are inherently flawed because electronic medical records (EMRs) come from administrative systems that are designed for reimbursement purposes, not research.

The challenges of medical records are directly caused by the overall lack of standardization, clear labeling and enforcement of data consistency. It’s also not possible to go back and retroactively apply standardization. Much of this variability comes down to medical notes, such as progress or office notes, and the individual style in which people fill in medical information. This reflects a broader issue related to the “subjectiveness” of language. It’s known that inconsistencies are commonplace as multiple health care providers input content and are not mandated to apply consistent standards. They also may lack the financial or technical resources to do so. This is additionally complicated by the fact that medical conditions often vary in how they present and disease usually impacts people differently.

Clinical content is harder to decipher when the stated intent of care does not match what the actual documentation reflects. The challenge is in trying to document everything that a health care provider has thought about and done in the care of a patient versus how that information is captured in the medical record. Therefore, you have to be careful to consider how the coding of notes is impacted by human interpretation.

There is also the sheer size and volume of information existing in patient records. Consider that one patient can make dozens of visits to different providers over several years and each type of visit can generate different levels of data. In this sense, the amount of medical information that’s created becomes immense. As medical record data continue to grow at an exponential pace, hospitals continue storing it in their own siloed way, based on individual workflows and operations. The result makes cross-comparisons difficult when, for example, contradictory outcomes emerge from the same procedures or hospitals use outdated data formats and modes of information sharing, such as faxes, scanned PDFs and pathology reports with handwritten scribbles. The trick is to bring order to this explosion of scattered data and highlight key elements that can lead to an awareness and deeper insights of the nuances of data in a patient’s healthcare journey.

Machine Learning, Natural Language Processing and the Complexity of Health Data

To handle the Pandora’s box of health data that lives within medical records, we need to harness deep learning. Deep learning is a subset of machine learning that makes it possible for multi-layer, computational neural networks to solve complex problems. One form of machine learning is natural language processing (NLP), which helps process and understand human language in a way that gets at the heart of what matters in the data. It’s like extracting a valuable metal from an ore.

By enhancing or highlighting specific, clinically relevant content, the “noise” that is captured for regulatory purposes becomes less prominent and convoluting. The results are data sets that can be tailored based on the area of focus. Machine learning, together with NLP, transform data into an output that is “fit” enough for human review from which we can draw insights. For instance, NLP can classify sections of medical records so that they are more searchable. The significance of this machine-based approach is that we can “read and summarize” thousands of pages of text incredibly quick. This is a feat in scalable medical research that can never be achieved if we depended strictly on human methods of extraction.

Setting Machine Learning Up for Success

Machine learning is a voracious tool for data processing. But it requires a lot of time and training to be used in a meaningful way. Machines do not have the same level of cognitive reasoning as humans, so they need to be pointed in the right direction. Specifically, it needs human experts to “teach” it basic rules to follow by labeling the data to ensure that the correct information is extracted from medical records. By training the machine learning model to “read,” human experts are essentially guiding it with examples, themes and relevant concepts within the medical record text to create a coded (or structured) representation. With time, the process gets more efficient at extracting nuanced information that is vetted by human expertise to ensure that it accurately aligns with the clinical question being asked.

A Deeper Dive With NLP

Extracting this meaningful information from jumbled medical records depends on a tool that can understand the unique “grammar” of individual medical records. Critical questions that we often ask ourselves include: How do we differentiate various sections of a medical record and classify the many document types stored within? How do we distinguish a patient’s history from their physical, discharge summary, lab results, visits, and the like?

This is where NLP comes in. For instance, NLP relies on machine learning to be sensitive to the inconsistencies in how information is documented and the multiple ways a medical concept can be expressed, abbreviated or mistakenly written. NLP must also be adaptive to the constant, high-paced evolution of medicine, our disease understanding, how we test for diseases and the new terms and updated lexicon that reflect this change. In other words, NLP is at the heart of navigating semi-structured and unstructured data that exist throughout medical records, and it begins interrelating what it extracts.

The Richness of RWD Insight

Tackling real-world health data and extracting rich insights depends on knowing the exact information that is being sought. It’s crucial to know how one piece or component of information fits into the larger picture of the patient’s entire medical journey. By initially stating clear goals when building models, then richer results from RWD will emerge.

Formulating relevant questions has never been so crucial as now, especially amid our evolving understanding of how the virus that causes COVID-19 works and impacts the health of people. This includes, as an example, knowing the types and names of diagnostic tests that patients have taken or whether a ventilator was used at particular points of treatment. The source, or point, of information also factors into the reliability of data. These are a few of the reasons why it is important from both an epidemiological and a treatment perspective to rely on rich, large-scale data for the answers.

The Need to Dive Deep and Clear Out the Noise

As mentioned, to reveal the entire patient experience it is critical to access and analyze every aspect of a medical record in order to feed NLP and machine learning tools. At Ciox, we meet these data challenges by unlocking and diving right into the complete medical record. We rely on deep expertise that can separate out the clean, information-rich signals from unstructured noise and validate results for clarity and accuracy.

With biomedical NLP, we can get to a higher level of detail about a patient’s medical journey, including lab results, medication adherence, specific allergies, disease history, specific outcomes, encounter dates and the nature of those encounters. When you add in the expertise of clinical information specialists with over 20 years of experience, we are able to translate data into research grade information to answer scientific questions.

This technology and information management expertise enables Ciox to collect, identify and insert data into models at a quicker pace, saving valuable time and resources.

Where Health Data Will Take Medicine

We know how valuable medical records are in providing health care providers and clinical researchers with meaningful answers. This insight helps to paint a fuller medical picture of patient care and we need a greater understanding of RWD than that which exists today. Identifying and summarizing the full patient journey depends on sophisticated NLP that quickly extracts and makes sense of information that’s buried within an enormous mound of medical records. There needs to be a broader appreciation of how combining every bit of relevant, de-identified patient data can inform a full timeline and profile. Achieving this allows us to automate and quickly access rich summaries of patient journeys that can effectively inform medical innovation. Re-conforming unstructured data into structured formats will provide research organizations with data that are scaled to accelerate medical research and develop our ability to predict risk and effectively treat challenging conditions.

This article is a part of the series. Check out our previous article, Digital Health Innovation: The Predictive Impact of Curated Real-World Data in Times of Change, here.

About the authors who are employees of Ciox Health, LLC.: Ana Bargo, Data Scientist; Jeannine Cain, Healthcare Analyst; Shannon Fee, BioMedical Data Scientist; Mark Yap, Healthcare Analyst.

To learn more about Ciox’s DataFit Platform™, visit us at