Quality vs quantity of data

How much data is available to train AI vs how suitable that data is for training

Data quantity (or data volume) refers to the amount of data that exists i.e., how much data is being stored or used by an AI model.  Generally, when it comes to training machine learning algorithms, it is assumed that more data is better. Models trained on larger datasets (i.e., datasets with more data in them) are considered to be more accurate, more robust (perform better  in a variety of circumstances), and are less likely to be biased. However, more data is not necessarily better if the data is bad or of poor quality. 

Data quality describes whether data is fit for purpose. Good quality data is data that can support the outcomes it is being used for (i.e., to develop an algorithm capable of diagnosing a particular disease). If poor quality data is used to train an algorithm, its ability to complete the task it was designed for safely is likely to be undermined. That’s why it is sometimes better to have less data of higher quality than more data of poorer quality.

The challenge is that whilst data quantity is relatively easy to measure (think, for example, of 2GB of data on a mobile phone plan vs. 4GB of data), data quality is hard to measure because it depends on the context. This is why an important first step in any AI project is to decide, in terms of the data being used to train the model, ‘what good looks like’. This might, for example, involve considering whether the data is up-to-date, and whether it has any important information missing.

For example, when it comes to assessing electronic health record (EHR) data:

  1. Verify where information was sourced from and how it was collected. If it comes directly from a hospital it’s more likely to be accurate than from a general database. 
  2. Check when it was last updated and how often it’s been updated. For more accurate results, the data should be as current as possible, and data spanning a long length of time should have been consistently updated. 
  3. Perform data integrity checks to see if there are any missing values (e.g. some fields are missing which would compromise its usefulness in research), duplicate entries or other inconsistencies in the data. 
  4. Check that codes and descriptions are consistent across different records. If one hospital represents ‘diabetes’ as code 123456 and another records it as 234567, a computer would not be able to understand both as ‘diabetes.’
  5. Ask medical professionals to review a random sample of EHRs to ensure they are clinically relevant for the study in which the data is to be used.

Data quantity and data quality are both important concepts in the development of AI models for healthcare, but they have different meanings and differ in complexity. 

Data quantity is the simpler concept. It is a quantifiable measure of how much data there is available to train the AI algorithm. It’s generally assumed that more data is better and, indeed, there are several benefits to having a larger volume of data available. More data can increase statistical power, reduce sampling bias, capture more variability, which can enable more complex, less biased, and more robust models. However, the assumption that more data is better is only true if the data is of sufficient quality. The challenge is that the data’s quality is a far harder concept to measure.  

There is no agreed upon definition for data quality as what ‘good’ looks like varies depending on the context. The simplest interpretation of ‘good quality data’ is data that is fit for purpose i.e., data that is well-suited to the analytical problem at hand. However, whilst the precise requirements for high-quality data varies, the dimensions that should be included in the statement of these requirements are agreed:

  • Completeness
    Does the dataset contain all the elements needed? Are there any elements missing? Is the dataset biased in any way? 
  • Uniqueness
    Are there any duplicates in the dataset? For example, are any ‘patients’ appearing more than once? 
  • Conformance
    Is the dataset formatted to the required standard? For example, are all the values (such as dates) recorded in the expected format? 
  • Timeliness
    Is the data up to date? Can it be accessed easily when it is needed?   
  • Accuracy
    Is the data correct according to an agreed upon source of truth?  
  • Consistency
    Are specific metrics consistent across multiple sources. For example, does the number of patient records from a GP practice match the number of patients on the patient list?  
  • Concordance
    Is there agreement between elements within the dataset, for example, within an Electronic Health Record, or between two different datasets?  
  • Plausibility
    Does the data accurately represent the real-world object or construct that it’s supposed to?  
  • Relevance
    Is the data helpful and suitable for the task at hand?  
  • Usability
    How easily can the data be accessed, used, updated, maintained and managed?  
  • Security
    Is personal data appropriately protected and is access sufficiently controlled to ensure privacy and confidentiality?  
  • Information loss and degradation
    Will the quality of the data degrade over time?  
  • Flexibility
    Can the data be easily adapted for multiple tasks?  
  • Interpretability
    Can the data be readily understood?

 

The exact requirements for each of these dimensions should be agreed at the start of any AI development project as part of the data governance process. Once agreed, all datasets should be quality assured against these dimensions. Again, there are no universally agreed methods for data quality assurance, but there are a variety of tools and methods available. The most cited are:

  • Log review
    Document how the data in any dataset is supposed to be entered and check to see whether good practice was followed at the time of data collection. 
  • Element presence
    Check whether all data elements are present.  
  • Data element agreement
    Compare two or more elements within the dataset to see if they contain compatible information. 
  • Validity check
    Check to see whether recorded data values are in agreement with common or external knowledge. For example, are recorded BMI values within an expected range?  
  • Conformance check
    Assess whether there are any duplicates in the dataset, and whether the formatting of the dataset is in line with prespecified constraints.  
  • Data source agreement
    Check whether two or more data sources are in agreement. Do calculated values match between different datasets when calculated in the exact same way?  
  • Distribution comparison
    Compare summary statistics of the datasets with expected distributions of interest. For example, is the number of patients in the dataset diagnosed with diabetes, and the distribution of diagnosis in terms of age, sex, and ethnicity, in accordance with known population prevalence statistics?  
  • Gold standard
    Assess the values of specific elements within the datasets against a trusted reference standard or dataset. (note this may not always be available).  
  • Qualitative assessment
    Conduct a qualitative assessment of the data based on focus groups or interviews with key stakeholders to see if the dataset aligns with their expectations.  
  • Security analyses
    Subject the data storage mechanism to a security test to see whether it is vulnerable to attack.

Once an initial assessment has been completed, relevant actions to improve the data quality should be agreed. Such actions might include: data curation (e.g., removing outliers), data harmonisation, data standardisation, or missing data interpolation (i.e. estimating the unknown value based on previous values). 

In order to develop robust, reliable and safe AI algorithms data quality needs to be sufficiently high.. Poor quality data inputs can lead to dangerous and poorly performing algorithms (hence the expression “rubbish in, rubbish out.”) This is why it may sometimes be better to have less data of higher quality, than more data of poorer quality. 

An Owkin example

Further reading
  • Bian, Jiang et al. 2020. ‘Assessing the Practice of Data Quality Evaluation in a National Clinical Data Research Network through a Systematic Scoping Review in the Era of Real-World Data’. Journal of the American Medical Informatics Association 27(12): 1999–2010.
  • Chan, Kitty S., Jinnet B. Fowles, and Jonathan P. Weiner. 2010. ‘Review: Electronic Health Records and the Reliability and Validity of Quality Measures: A Review of the Literature’. Medical Care Research and Review 67(5): 503–27.
  • Kahn, Michael G. et al. 2016. ‘A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data’. eGEMs (Generating Evidence & Methods to improve patient outcomes) 4(1): 18.
  • Weiskopf, N. G., and C. Weng. 2013. ‘Methods and Dimensions of Electronic Health Record Data Quality Assessment: Enabling Reuse for Clinical Research’. Journal of the American Medical Informatics Association 20(1): 144–51.