Data preparation in healthcare: Part 2
In the second part of this series, we will continue to explore new data modalities in which the preparation and collection of each data sample play an essential role in the quality of the overall data. We will look at two specific examples in detail, High-Throughput Screening (‘HTS’) and radiology.
High-Throughput Screening
HTS is a method used in drug discovery and development to allow a researcher to conduct millions of chemical, genetic, or pharmacological tests quickly. In HTS, we treat the cells with chemical compounds (as seen in figure 1) and use a microscope to take pictures of the reaction (see figure 2). This process allows the researcher to rapidly identify active compounds, antibodies, or genes that modulate a particular biomolecular pathway.
In HTS, it is possible to treat cells with thousands, or even hundreds of thousands of different types of compounds. So, while the probability of error is low for each picture, the overall possibility of error for thousands of compounds is not negligible and errors will appear.
Possible errors include:
- An insufficient number of cells in a well (the place we put the cells); and
- A microscope camera fault that produces an image with insufficient contrast
A common way to mitigate these potential errors is to apply quality control (‘QC’) heuristics to the pictures. In a trade-off between ‘the amount of data versus the quality of data,’ in this case, we can afford to lose some data given the large volume of data obtained. The QC heuristics can be applied as filters to drop any data that would otherwise harm the model’s training performance.
Radiology: image acquisition parameters
Computed Topography (‘CT’) scanners and Magnetic Resonance Imaging (‘MRIs’) scanners are complex physical machines that reconstruct images from physical measurements (as depicted in figure 3). These machines are heavily dependent on their specific configuration for data sample acquisition.
Radiology: image acquisition parameters
If we take the CT scan as an example, its data collection configurations create heterogeneity by:
- Independent contrast enhancement phases – Firstly, radiocontrast agents are injected into the patient to enhance specific biological features. Then the machine captures these at particular time points in the machine’s running, called ‘phases’ (depicted in figure 4.) Therefore, it is not possible to compare examinations acquired at different phases;
- Variations in reconstruction kernels (filters) – Manufacturers apply filters to visualize the image. These filters range from ‘soft’ (i.e., for visual assessment of the liver) to ‘sharp’ (i.e., for the lungs). We can see an example of this in figure 5 below. Individual manufacturers define the Kernels and therefore impact the image’s grain differently, thus introducing bias. Furthermore, occasionally the filters applied to the image are inappropriate for the organ of interest and can be a source of unwanted ‘noise.’ In such cases, normalization methods mitigate bias; and
- Different image definitions – Definitions such as the ‘slice thickness’ or the ‘pixel spacing,’ determine the final image’s resolution and are defined by the CT scan manufacturer and configuration. These parameters may allow the model to directly identify the manufacturer, hence the patient’s hospital, which may be a bias source. Furthermore, a low image definition will impact the signal’s quality and consequently harm the model’s performance.
Conclusion
In Part One and Part Two of this series, we described how the preparation and collection of data samples could impact the quality of the data and ML algorithms’ performance. At Owkin, our lab team is capable of handling data from all modalities. For example, a full understanding of MRI requires in-depth proficiency in physics, mathematics, and computer science. Owkin also regularly holds Radiology seminars with our academic partners to further improve our understanding of this multifaceted modality. Stay tuned for the final part in this ‘Data Preparation in Healthcare’ series in which we describe the second big issue of data preparation in healthcare: cleaning structured healthcare data.