When sequencing individual cells identifying genes in each cell, modern technologies often lead to errors. Devices, environment and biology itself can be responsible for failures and differences between measurements. Now, researchers have developed algorithms that make it possible to predict and correct sources of error.
Munich/Germany — A visionary project of enormous scope, the Human Cell Atlas aims to map out all the tissues of the human body at various time points with the goal of creating a reference database for the development of personalized medicine, i.e. the ability to distinguish healthy from diseased cells. This is made possible by a technology known as single-cell RNA sequencing, which helps researchers understand exactly which genes are switched on or off at any given moment in these tiny components of life.
The increased sensitivity of the technique, however, also means increased susceptibility to the batch effect. The batch effect describes fluctuations between measurements that can occur, for example, if the temperature of the device deviates even slightly or the processing time of the cells changes. Although several models exist for the correction of these deviations, those methods are highly dependent on the actual magnitude of the effect. Researchers at the Helmholtz Zentrum München joined forces with colleagues from the Technical University of Munich (TUM) and the British Wellcome Sanger Institute to develop a user-friendly, robust and sensitive measure called kBET that quantifies differences between experiments and therefore facilitates the comparison of different correction results.
Apart from the batch effect, a phenomenon known as dropout events poses a major challenge in single-cell sequencing. “Let’s say we sequence a cell and observe that a particular gene in the cell does not emit any signal at all,” explains Dr. Dr. Fabian Theis, ICB Director and professor of Mathematical Modeling of Biological Systems at the TUM. The underlying cause of this could be biological or technical in nature: either the gene is not being read by the sequencer because it is simply not expressed, or it was not detected for technical reasons.
To identify these cases, bioinformaticians Gökcen Eraslan and Lukas Simon from Theis’s group used a large number of sequences of many single cells and developed what is known as a deep learning algorithm, i.e. artificial intelligence which simulates learning processes that occur in humans (neural networks).
Drawing on a new probabilistic model and comparing the original and reconstructed data, the algorithm determines whether the absence of a gene signal is due to a biological or technical failure. According to Theis, this model even allows cell type-specific corrections to be determined without two different cell types becoming artificially similar. As one of the first deep learning methods in the field of single-cell genomics, the algorithm had the added benefit that it scales up well to handle data sets containing millions of cells, he added.
But the technology is not intended to smooth out results. The chief goal was to identify and correct errors, Fabian Theis explains. Instead, the scientists are able to share these data, which are as accurate as possible, with their colleagues worldwide and compare their results with theirs — for example when the Helmholtz researchers contribute their algorithms and analyses to the Human Cell Atlas, because reliability and comparability of the data are of paramount importance.