Data preparation in healthcare: Part 3 – Cleaning healthcare data as structured data

Duration:9 mins

Tags: ML


Date:October 9th, 2020


Data preparation in healthcare: Part 3 – Cleaning healthcare data as structured data

What are the main challenges in Breast Cancer?

In our two previous ‘Data Preparation in Healthcare’ blogs, we discussed how the preparation and collection of individual data samples pose a challenge to the overall quality. However, as mentioned in Part One, healthcare datasets are composed of a multitude of samples. Therefore, it is fundamental to have a common data structure across all samples so as not to affect data quality. 

Machine learning (“ML”) algorithms require a systematic data structure in which every data sample should follow the same structure. While the human brain would just ignore missing data during the learning process, algorithms cannot process data with even minor inconsistencies. 

Challenges in structuring data: 

  • Firstly – The size of the datasets. The larger the volume of the dataset, the higher the probability of inconsistencies in the structure of each data sample.

  • Secondly – The complexity of healthcare tasks. Healthcare tasks, such as clinical trials, are incredibly complex. For instance, they often require multiple experiments and sub-experiments, posing an enormous challenge for correct data structuration.

Here we will look into cleaning structured healthcare data in more detail.

Radiology: Hierarchical data

Radiology is a typical example of the importance of structured data.

Such datasets expose multiple hierarchical levels, as shown in figure 1 below. Typically, a Computed Topography (‘CT’) scan dataset consists of:

  • A patient set;

  • Multiple CT scans (‘exams’) per patient. These exams are done at different fixed dates to study the evolution of a pathology;

  • Different data collection phases per exam. As mentioned in Part Two, CT scans often require a contrast agent injection to highlight different biological features in the image. The elapsed time between the injection and the image collection defines an “enhancement phase”. Each enhancement phase highlights different features; and

  • Multiple slices (‘images’) per data collection parameter. These enable the radiologist to have 3D information (which is the ultimate goal of a CT scan).

Figure 1. Hierarchical levels of a CT scan dataset

Due to the large number of samples required (163,200 individual slices for a relatively small dataset as in our example above), it is easy to end up missing a slice or a whole exam. 

In these situations, mitigation depends on the level of the missing data:

  • Interpolation can reconstruct the image if there is one missing slice;

  • But if multiple slices are missing, it may be impossible to reconstruct them, and then the entire 3D scan exam will be unusable; and

  • A missing exam in a patient’s folder may jeopardize the use of all the data from that patient.

Clinical Data Structure Example: Missing or corrupted data and imputation

Missing Data

Missing and/or corrupted data often contaminates Medical data. It occurs when there is no stored data value for the variable in an experiment or test. The absence of data is a common occurrence that can significantly affect the conclusions drawn from the data.

If the missing data were not meaningful (i.e. the data did not correlate to anything), this wouldn’t matter. Alas, this is rarely the case. Patients without a cancer biopsy in their medical records are usually much less likely to have cancer because healthy patients do not have random exams just for the sake of it.  Even in a study designed with limited measurements (and thus information) to mitigate missing data exposure, there will always be at least one patient with no declared age, or who has refused to report their sex. Furthermore, in a longitudinal model, i.e., a model based on the complete patient pathway over time, measurements are never taken simultaneously and result in missing data.

Dealing with missing data is often an art. The data analyst must make predictions that will replace the ‘holes’ in the dataset, with as little impact as possible, to essentially complete the dataset. These predictions are made based on assumptions about the association of missing data with other covariables. In a pure prediction setting, such as ML, models can usually gain good prediction ability even without consistent imputation (‘hole patching’). Advanced ML models let ‘holes’ be bona fide values by applying classical statistical methods.

Corrupted Data

On the other hand, data corruption refers to errors in computer data that occur during writing, reading, storage, transmission, or processing, which introduce unintended changes to the original data. Results could range from a minor loss of data to a system crash. Corrupted data is, therefore, data that gives no meaning. It can often be easy to notice and remove but extremely hard to fix. Fixing corrupted data is more complicated than dealing with missing data as it is entirely reliant on weak assumptions. Appropriate mechanisms must be employed to detect and remedy data corruption to ensure data integrity.

Human Example: Mistakes and ambiguity

Another possible source of bias is a human error during data collection. Considering the enormous amount of data to collect and the fact that applying ML models to medical data is relatively new, data collection protocols are not yet fully adapted for ML datasets.

1. Data structure input errors

Imagine a dataset used to predict if patients are at high or low risk of developing breast cancer. Each patient has a label, either an ‘H’ (for High risk) or an ‘L’ (for Low risk). But, when you analyze the dataset, you notice a patient with the label ‘J.’ This mistake is not a new type of risk, but merely an input error from the clinician: instead of typing in ‘H,’ the mistyped a ‘J,’ as it is the closest letter on a keyboard (see figure 2).

Figure 2. A Keyboard

2. Data structure standardization

It is often possible to have multiple ways to collect the same data sample. When different people gather a dataset, numerous collection techniques often emerge. This is especially true in the healthcare industry because data is collected by different doctors/nurses, sometimes even in different hospitals or centers.

The classic example of this is the case-sensitivity of labels. Some centers will label their at-risk patients as ‘RISK’, whereas some centers will label them as ‘Risk’, or ‘risk’. Of course, this is a simplistic example and very easy to catch, but it still requires proper attention and data cleaning.


These examples highlight that data cleaning requires a set of data-specific know-hows and, since there is no general rule or theory, it should be done exhaustively for each new dataset, whether it is histology, radiology, high-throughput screening or genetics.

Today, researchers do dataset curation after data collection. One can reach a higher level of quality by defining collection protocols attuned for ML. At Owkin, we work closely with our partners in Hospitals and Pharmaceutical companies to create protocols for the collection of the next generation of datasets.