Advanced data science to enhance decision-making in healthcare, based on complex but scarce data.

Technology

Our core expertise is data sciences applied to health care. We build predictive models and identify what (combination of) markers (in a broad sense) should contribute to these models. These models enable making predictions in clinical research, biomanufacturing or public health.

In most cases, when large companies such as Google, Amazon, Facebook, Apple, … discuss the concept of big data, they refer to their own situation where many observations (millions of users) are available, each of them however described by a limited set of features. In this context, statistical learning, i.e. Machine Learning, is quite an easy task.

In the Healthcare context, there is usually a huge number (potentially billions) of parameters, while at the same time, there is generally a very limited number of observations (patients, batches). That is why at DNAlytics we rather talk about fat data. This is a very hard context to « learn » something with data sciences methods. This is why we developed our unique technology: special algorithms and proprietary analysis pipelines thtat properly cope with this specific healthcare context.

Data Sciences

Data sciences is a not a unique domain, but rather a set of technologies related to several aspects of data processing.

Artificial Intelligence is a very trendy and popular denomination nowadays. It has no strict, consensually approved definition; we like to refer to AI  as “a set of techniques allowing a computer to perform tasks usually only achievable by a human”. In a medical context, it can be used typically to diagnose a disease, to provide medical images annotations or interpretations, to plan a treatment path, to predict response to a treatment…  It encompasses many different sets of approaches including constraint programming, adversarial search, but also machine learning. Bioinformatics is often associated to AI as well, sometimes arguably.

Machine Learning or ML is a set of techniques allowing a computer system to learn a concept based on a set of examples. In the case of medical diagnosis, this may be a set of patients data (-omics and clinical data), along with a diagnostic labels (disease or healthy state). A ML system is then trained to recognize new patients which it has never seen before, and classify them as being healthy or having the disease. By doing so, the ML system also learns on which variables, which “biomarkers”, it has to focus to establish those predictions. The same technique can be used on biomanufacturing data to build a digital twin of the production process (for example to make predictions about the yield) and identify the key process drivers (e.g. among raw materials characteristics, bioreactor settings, process parameters, which of those feature have an effective, deterministic influence on the process yield).

When it come to some specific types of data, such as images or video, a subfield of ML is particularly effective: Deep Learning. Deep Learning is an evolution of the Artificial Neural Networks from the eighties. What makes them different? They are much more complex (more “neurons”), but most of all they make heavy use of convolutions, as mathematical tool. Also, huge progresses have been made because data (images, namely) are much more accessible than decades ago, in huge amounts. Specific frameworks have been developed with a high level of expressiveness, such as Keras and Tensorflow, which we use as well when appropriate. In such a case, Python language is generally the preferred programming language (although R can achieve about the same).

Bioinformatics are a set of computer sciences techniques used to deal with and analyse mostly molecular biology data (DNA, RNA, epigenetics, protomics, …) They prove very useful in drug development and in some clinical development programs.

Data sciences tools

For Machine Learning and Bioinformatics applications, we generally use the R programming language, which is a reference in data sciences. On the basis of this open-source framework, we build our own pieces of software, some being redistributed in open-source, some remaining in closed-source. This constitutes our own software library, from which we then build customer applications.

Once data science results are obtained (new predictive models, biomarkers, decision rules, risk scoring, …) we ensure that they can be exploited directly by healthcare professionals or by process experts in business departments, even without data sciences practical capabilities. Accordingly, our team has also software development capabilities, in order to implement these results in custom pieces of software or webservice applications.

The technologies developed at DNAlytics are effective and recognized in the scientific community, as demonstrated by more than 40 publications co-authored with DNAlytics collaborators as well as our open-source libraries, that are downloaded more than 2000 per month.

Data Management

Before being able to analyse data and apply AI approaches, data must be available in a clean and consistent state. A large part of data sciences projects consists in retrieving data, formatting data, translate data, clean data. Various platforms may support the effective management of data in different contexts, such as Bioconductor for bioinformatics data, Simple ITK for imaging data, Open Clinica in clinical trials; the use of these platforms is generally complemented by custom data curation pipelines developed in R or Python.

Accessing Data

As a matter of fact, any data sciences project starts with… data ! We generally get them directly from our customers (proprietary data), but we may complement them data with additional data obtained from different public repositories. Specifically to bioinformatics or drug development projects, we have the experience for connecting and extracting data from large global databases such as KEGG, 1000Genomes, GEO (genes and genomes databases) or Drugbank (drug APIs)…

DevOps

To obtain the computing power we need, we make heavy use of cloud computing solutions, such as the Amazon Web Services (AWS) for which we are certified AWS Partner.

These infrastructures allow us to take advantage of the most recent computing technologies, including GPUs, an alternative to more classical CPUs. Both GPUs and CPUs show specificities making each of them the best choice for different kinds of mathematical operations. Deep Learning in particular makes heavy use of GPUs.

The deployment of data sciences applications to make them available in practice also requires to master several IT frameworks, such as Docker, R Shiny or Conda / Anaconda (non-exhaustive list).