Previously in the background, immunology has become a mainstream topic in the past few years due to the Covid-19 pandemic. Immunologists’ roles are challenging, as they need to collect and systematically analyze large amounts of biomedical data about the immune system itself, as well as clinical data to understand human health and disease progression. This research is a collective effort, involving geographically distributed, multidisciplinary teams, including basic researchers, clinicians, molecular biologists, computational analysts, data scientists, and software engineers.
For a leading bioscience research institute, we knew having the right technology partners would be critical for our teams to conduct their work. That is why in 2018 we started working with Google Cloud to build a platform to enable scientists and researchers to collaborate and find new ways to diagnose and ultimately treat disease.
Google Cloud provides the computational resources and storage solutions that can analyze and integrate large volumes of biomedical and clinical data, at scale. Ultimately, these solutions can enable research teams to unlock critical insights from the global data in a cost effective and on-demand model. Further, Google Cloud is able to help customers orchestrate their data in a secure, compliant, and private way – that can ensure that data resides on the local server and only the insights and trained models are aggregated across the platforms.
Google Workspace provides the tools needed to securely connect, create, and collaborate, all in one integrated solution. Together, these technologies provide the backbone of a collaborative research platform designed to help scientific teams arrive at novel insights into the immune system, predict disease onset, understand disease progression, and develop clinical therapies for treatment and prevention.
Deep immune profiling of the human immune system
In 2019, the Allen Institute for Immunology launched an ambitious effort to study health and disease in a longitudinal study tracking hundreds of individuals for several years. In this study, we collected and analyzed many diverse data sets derived from blood and tissue samples such as various genetic sequencing methods, techniques that detect and measure physical and chemical characteristics of a population of cells, and measurements of protein levels.
Additionally detailed human metadata was collected, including demographics, diet and lifestyle habits, and health history. This study was done in close collaboration with partner researchers at the Fred Hutchinson Cancer Research Center, the Benaroya Research Institute, the University of Pennsylvania, the University of California San Diego, and Colorado University. Borne out of this initiative was the Human Immune System Explorer (HISE), a scientific computing platform built entirely on Google Cloud to house and analyze this data.
HISE is a secure research platform to gather, analyze, interpret, and share findings. It is built on the conviction that science is a collaborative activity that involves the careful cooperation amongst a team of multidisciplinary scientists; its functionality reaches beyond data storage and analysis into interpretation and discussion. Since its launch, it is now being used to support multiple research collaborations, including studies of SARS-CoV-2 infections and long-term effects in long haulers.
Biomedical research requires a big data mindset
Data analysis at scale is one of the central tenets of HISE. In a longitudinal study, the same types of data are collected repeatedly over time for various groups of participants. Some data sets, in particular when it comes to genomics research, can be large in size or quantity. For example, a starting data set for single cell sequencing analysis can be sizable (about 1 terabyte) and in the case of the Allen Institute, where multiple cohorts of subjects are studied over a period of time, many of these data sets are generated. Automated analysis pipelines are the ideal solution to standardize analysis, scale computing resources as needed, and at the same time enable quality control verification by expert reviewers.
After standardized analysis, computational biologists engage in hypothesis-driven research, for instance to understand how much an individual’s immune system varies over time, responds after, for instance, a Covid-19 vaccine has been administered, or how different groups of healthy subjects and patients compare to each other. For this step of analysis, data scientists can rely on a fully-managed, Jupyter Notebook based compute infrastructure tailored to biomedical research, built on top of Google Cloud’s Vertex AI Workbench. Computation scientists are benefitting from HISE’S integration with Google’s Vertex AI. The Allen Institute built their own user interface on top of the Vertex AI backend and provided prebuilt images so scientists immediately can leverage the platform, be effective, and efficient.
Integrating diverse datasets leads to deeper insights
The complexity of the immune system is humbling. Overlaying the results of many research modalities is key to building up a cumulative picture of the patterns and dynamics that are at play to protect our body against intrusive bacteria and viruses, and secure immune balance. This means that we need to develop scientific algorithms to integrate data from diverse sources as well as enable researchers across scientific disciplines to jointly interpret results. Data visualization strategies feature prominently in interpretation. HISE offers a visualization framework enabling computational analysts to develop rich visualizations to inspect complex data sets. Instead of having to rely on select static images of results, collaborating scientists have direct access to these interactive visualizations to further inspect the data and gain deeper insights.
Science is a collaborative activity
Research is a team activity involving multidisciplinary teams. HISE supports all phases of the research lifecycle, providing a collaboration space to discuss ideation and planning, collate key data sets, visualizations and other insights, and prepare summaries and findings for publication and dissemination. With tools like Google Docs, Slides, and Drive, Google Workspace is how research teams of all sizes can connect, create, and collaborate—to drive innovation from many devices, and any location.
Tracking and reproducing the complexity of scientific analysis
Many steps and different processes typically are involved while doing data analysis in biomedical research—in this sense, it is no different from any other form of experimental science. Without strict management of the data, the complexity of the research endeavor often makes it incredibly difficult to understand the origin of interesting results.
In HISE, we track how data enters the system, what methods are used to generate new data sets, and how data is combined into summary data. Regardless of whether data is processed by automated pipelines, ad-hoc by data scientists in an IDE using custom code, or whether data is summarized in a visualization, a scientist can review what original raw data sets were used and how these were transformed and/or combined in any number of steps to produce interesting insights.
Much has been said about the reproducibility crisis in science referring to the observation that the findings of many published studies cannot be independently replicated. It seems evident that a built-in tracking system like the one described here can be extremely helpful in reviewing how data analysis was executed.
A typical phenomenon for any research project is that there are always more questions raised than can be addressed, as grants and other forms of funding set limits on budget and duration. Moreover, in cloud-based environments where resources can be spun up on demand, the potential for cost overruns is real and should be considered carefully. Near real-time cost monitoring and quota limitations are part of the functionality offered to scientists to assist them with managing their analysis costs against their budget. Separate tracking of storage costs can similarly help scientists devise data governance strategies.
Committing to open science
The Allen Institute has a deep commitment to open science. We want to share exciting key new findings with fellow scientists and any interested person alike. For this reason we recently launched the public portal of HISE where we showcase work that excites us. You can read more about the heavy investment our research laboratory team has made into building fast, efficient, highly reliable, and highly repeatable data generation techniques, understanding the effects of age of a fresh blood sample on the ability to detect rare cell populations, and a world-first highly sophisticated procedure where the exact same cell can be measured using three different types of genomic sampling techniques.
You can also learn more about the work of our computational analysts building algorithms to optimize automated analysis pipelines for sequencing data, and to recognize how the immune system of an individual changes over time or how one person’s immune system compares to another individual.
If you are interested in reading about the latest findings, check out the public portal (check back regularly as new findings will be posted all the time).
By: Paul Meijer, Ph.D. (Director of Software Development, Database and Pipelines at the Allen Institute)
Source: Google Cloud Blog