Automated Data Quality and the EMA Data Quality Framework

Data Quality is key in clinical trials and other healthcare studies. Recently, the European Medicines Agency (EMA) published a framework around data quality that outlines a number of key considerations around what it means for Data to have Quality assurance.

You can read the entire EMA Data Quality Framework at this link. In this post, we’ll outline the key components of this framework and explain how automated data management systems, like ReadoutAI, related to them.

First and foremost, we need to explain how systems like ReadoutAI perform automated data quality checks. ReadoutAI is both a “batch” systems and a “streaming” system. This means that data can come to it in a big chunk, all at once (batch) or you can give it data, piece by piece (streaming). Regardless, data comes in, and each column of data is then matched to a set of data quality checks. If all of the checks pass, terrific! The data is good and ReadoutAI can do its statistical analysis, if you want. If not, it flags the errors so you can fix them.

These data quality checks fall into three different categories:

  1. Cross Study Knowledge – There are many (many!) measurements about studies in trials, and almost all of these measurements are associated with values that are considered “Consistent with Life.” For example, a human being’s age must be greater than 0 (no one is negative years old) and also less than 130 (as human beings don’t live that long… yet…). These rules apply for temperature, age, white blood cell counts, and more. Even something as simple as someone’s level of education has rules that imply its quality – for instance, it would be highly unusual to see a value of 0.2834 as someone’s Level of Education. In the context of the framework, this is called¬†Local Knowledge. With ReadoutAI, we’ve created a knowledge base of rules that cover labs, vitals, demographics and more, all of which are automatically applied to data, even in a streaming fashion.
  2. Study Specific Knowledge – This is a set of rules that define if data is valid, like Cross Study Knowledge, but that may be specific to a study. For instance, maybe a study leverages a Patient Reported Outcome (PRO) that is unusual or new. This PRO might have four acceptable values, and if a data point is not one of those four, it would be considered a data error. ReadoutAI can incorporate this type of local knowledge as well.
  3. Outliers – Finally, all of the data might look fine! Maybe something is measured for which there isn’t much local knowledge, and all of the data is a number, as expected. However, for 99 out of 100 people, the value falls between 1.0 and 5.0, but for that last 100th person, the value is 427.2. This seems strange! Maybe it’s not an error, but it’s certainly something a system should flag for human review and maybe someone meant type 4.272 and misplaced the decimal. To address this, systems like ReadoutAI use statistical and machine-learning techniques to predict what “normal” ranges should be and identify data points that fall outside of that.

So, now, that we understand how systems like ReadoutAI address data quality in an automated, streaming way, how does that fit with EMA’s Data Quality Framework?

The framework defines five core “dimensions” – which are ways to think about data and whether you, as a person, would consider it to be high quality. Those dimensions are Reliability, Extensiveness, Coherence, Timeliness and Relevance.

We’ll break down each of those here.


Fundamentally, Reliability is all about – “can we trust the data.” This is core to data quality. Among other components (accuracy and precision), perhaps the two most interesting aspects of Reliability that EMA defines are Plausibility and Traceability. Plausibility is really the trust part of reliability. “Does this data make sense?” Here is where we see the idea that an Age value must be greater than 0 and less than 130. The EMA also mentiosn that the facts should agree, and here is where detecting outliers is so crucial. Traceability is about the provenance of the data – where did it come from? Can we trace it back to the source (say, a site?). ReadoutAI ingests data from other sources (say an EDC), so this is about the chain-of-data from site to EDC to ReadoutAI. We believe that as automation permeates more and more trials traceability actually becomes easier! Rather than Excel being passed back and forth, data is logged automatically at a site into the EDC which passes it to a system like ReadoutAI. When the machines talk, they can record where everything comes from.


The next dimension in EMA’s framework is Extensiveness, which is all about if you have enough data. The framework defines a number of aspects to this. The first is Completeness – How much of our data is usable? Non-usable data could be missing data or even just weird data values, like Outliers. Therefore, an approach like ReadoutAI’s that ensures data conforms to a standard is key to identifying missing values, erroneous values or even just unusual values. All of those would define how complete the data is. Similar to Completeness is Missingness, which, just as it’s name implies, is about missing data. Note, we consider non-plausible values to be missing as well, since they can’t be used! Completeness and Missingness tell you about how extensive the data is that you currently have. But what about the data you haven’t captured? This is where the next two aspects come into play – Coverage is meant to convey how much of a population is actually covered. This is hard! You might not even know (for instance, maybe your trial covers a rare disease where the epidemiology is still unclear). Similarly to Coverage, the aspect of Representativeness focuses on whether the underlying data really represents the population. This is precisely why ReadoutAI ties together both the data quality engine and our statistical/reporting engine, as, perhaps uniquely, you can use ReadoutAI to validate the data and automatically generate a demograhpics report to ensure that your population reflects the characteristics that it should.


This is an interesting aspect in that it fundamentally boils down to – is the data the same in both format and meaning. If so, it’s validated. Within that concept, the EMA discusses a number of aspects. Format Coherence means that the data is the same format. For instance, are all the dates formatted as dates, or all the numbers number (or even if a categorical variable is in a set). Again, a system like ReadoutAI would use local knowledge to ensure data consistency and even use algorithms to do things like automatically determine date formats. Here, again, is where automation shines. The idea of Structural/Relational Coherence is that you should be able to consistently link the data across sources. This should be obvious, but data isn’t valid if you can’t ensure that linking data across sources ensures the right patient is linked to the right data. Semantic Coherence is an interesting one – the core concept is that words should mean the same thing across the data sources. So, if one data source says “Age” meaning participant age, if you see “Age” again, it should mean that (and not medicine shelf life or something). Again, here is where modern AI-based tools like ReadoutAI can shine. We use AI to identify which local knowledge in our knowledge base should be applied to which data, and we even leverage semantics to understand data for reporting. For instance, in our automated Adverse Event reporting, our AI automatically figures out which column of data relates to seriousness and treatment-relatedness, by analyzing the column headers and values (to be clear, we do not make the assessment if an AE is serious or treatment-related! We just use AI to figure out which columns report that). Finally, the EMA mentions Uniqueness. While it’s clear what this means (unique data should be unique!) this is actually not easy to do by hand (or in manual tools). For automated systems like ReadoutAI and others, it’s trivial.

Timeliness and Relevance are the final two dimensions of Data Quality discussed in the EMA framework. Timeliness is just that – is the data too old and stale, or is it usable for analysis? Similarly, Relevance defines whether the data is the right data to use in an analysis. Both of these aspects are more related to the process of data collection, in our view, however, they do touch on some aspects of automation. In our case, ReadoutAI can accept streaming data, not just batch data, and in that way, we can be even more timely than manual approaches in that we can identify data issues earlier. For Relevance, by combining data quality with our automated reporting, we can ensure that we have the relevant data to generate the report – otherwise, the system can’t generate the report, by definition!

This was a long one, so thank you for bearing with us!

We hope this helps shed some light on EMA’s framework, and, of course, if you’d like to learn more about ReadoutAI and our automated data management capability, please don’t hesitate to reach out!