On February 9th and 10th, 2017, we hosted the Gene Variation to 3D (GVto3D) workshop at the Institute for Systems Biology in Seattle, WA. The goal of the workshop was to explore the state of the field connecting genetic variation and 3D protein structure, and to bring together some of the key researchers working on interpreting genetic variation data. The workshop consisted of a mixture of talks, discussion sessions, and breakout groups. Twenty-five speakers provided short (15 minute) summaries of their research. See the program. The oral presentations connected the workshop theme to diverse topics such as RNA sequencing, Big Data technologies, how precision medicine can help with specific diseases and cancer research. One breakout group discussed existing ontologies, tools and datasets in the field and considered potential architectures for an integrative framework. The second breakout group discussed unmet needs, ranging from improvements in structural interpretation of splicing variants to more effective dissemination of knowledge to clinical geneticists, tumor panels and the general public.
The workshop opened with a session on protein structures. Stephen Burley (Rutgers University, USA) provided examples of proteins in the PDB with mutations leading to functional effects on cellular movement and drug resistance; he stressed that “we cannot afford not to sequence each patient” and solicited feedback on needs for programmatic access to the protein structure resource. John Moult (University of Maryland, USA) discussed structure-based variant annotation and the difficulty in modeling gain-of-function mutations. He also highlighted that even though knowledge of structure is useful, it is rarely successful in the Computational Assessment of Genome Interpretation challenge. Geoff Barton (University of Dundee, UK) discussed learning about protein structure constraints from integrating knowledge on human variation; much can be learned by integrating observed human variation over genomic features and columns in protein domain alignments, and correlating with variation among species.
The second session focused on the landscape of variation. Gustavo Glusman (Institute for Systems Biology, USA) presented the Kaviar database and discussed the various applications of genome variation databases: for filtering observations in individual genomes, as interpretation references, and as research objects themselves. Lydie Lane (Swiss Institute of Bioinformatics, Switzerland) surveyed the landscape of human diversity at the protein level curated at neXtProt, and raised the question of how to scale up the ongoing annotation project in the context of cancers and genetic diseases. Emøke Bendixen (Aarhus University, Denmark) reported on the status of the 1000 Bull Genome Project, gave examples of phenotypically equivalent variation in cows and humans, and stressed that the methods and data needs for precision breeding are very similar to those of precision medicine. Andy Kong (University of Michigan, USA) discussed advances in identification of variant peptides from mass spectrometry based shotgun proteomics, including the large number of peptides modified post-translationally and their significance for human disease. Finally, Hagen Tilgner (Weill Cornell Medicine, USA) surveyed the complexity of transcript isoforms. He elaborated on the effects of non-synonymous SNP combinations and combinations of alternative splicing events in the same gene on protein molecules, also noting that alternatively spliced regions often correspond to unstructured regions of proteins, affecting protein-protein interactions 3437557, 22749401. In general terms, he added the full-length information from long-read RNA sequencing may provide highly useful additions to the proteomics community, incorporating all variable sites of each molecule 24961374, 25985263, 24108091.
In the first of two sessions on mapping variations to structure, Marija Buljan (ETH Zürich, Switzerland) described ~7000 interactions among 315 soluble kinases and discussed mapping cancer mutations to the kinase interaction interfaces. Julia Koehler Leman (Flatiron Institute, Simons Foundation, New York, USA) discussed methods for accelerating the analysis of transmembrane proteins, which are difficult to study, have lower quality structures in the PDB, and yet are the majority of drug targets, based on the previously established VIPUR pipeline and the RosettaMP framework for modeling membrane proteins. Elizabeth Brunk (UC San Diego, USA) presented multi-scale models from SNPs to reaction rates, integrating available protein structures: 45% of known reactions in humans have associated protein structures, and some networks are enriched with mutations in cancer. Both Dr. Brunk and Jianjiong Gao (Memorial Sloan Kettering Cancer Center, USA) reported on 3D clustering of cancer mutations which helps interpret variants – a recurring theme in several presentations. Stating that “common mutations are rare, and rare mutations are common” (i.e., most of the known variants are rare, and a minority are common), Dr. Gao called for analysis of more samples per cancer type, to improve detection of 3D mutational hotspots.
A second session on mapping played host to Torsten Schwede’s (SIB & University of Basel, Switzerland) presentation on methods for automated protein structure homology modeling and mapping of variants from UniProt at SWISS-MODEL; he emphasized the challenge of modeling quaternary structures, interfaces and ligands, where SVN based scores taking into account sequence conservation allow differentiating biologically relevant protein-protein interfaces from artefacts. Rachel Karchin (Johns Hopkins University, USA) described applications of MuPIT, a toolkit for visualizing mutations onto protein structures with support for easy integration. David Masica (Johns Hopkins University, USA) encouraged study of continuous endophenotypes (instead of qualitative phenotypes) for elucidating the effects of variants mapped onto structures and clustered, using cystic fibrosis as an example. Adam Godzik (Sanford Burnham Prebys Medical Discovery Institute, USA) discussed analysis of cancer mutations in the context of 3D protein structures, evaluating whether mutations cluster in interfaces and at binding sites, and showed the interpretive value of exploring the flexibility inherent in protein structures and of mechanistic interpretation of structure modifications.
During a session devoted to applications, Gil Omenn (University of Michigan, USA) warned against ignoring the differences between transcript and protein isoforms from key genes, identified the structural effects of transcript splicing variation as both a major gap in knowledge and a therapeutic opportunity, and discussed computational methods for their study, including the I-TASSER family of algorithms (3230717, 25549265, 22570420). Michael Hicks (Human Longevity Inc., USA) went on to report on 8600 novel and 11500 missense variants in newly sequenced individual genomes, and described analyses of genetic variation mapped onto 3D structures of drug targets and protein-ligand interactions. Frances Pearl (University of Sussex, UK) concluded the applications session, contrasting the plentiful, widespread inactivating mutations vs. the rare, clustered gain-of-function mutations; she pointed out that mutation hotspots can distinguish between tumor suppressors and oncogenes, and detailed modeling of water molecules can be crucial for predicting variant effects (28423505, 27284061).
The final session of the two-day workshop focused on Big Data technologies. Andreas Prlić (University of California, San Diego, USA) presented the status of human genome coverage in the Protein Data Bank and called for wide adoption of DataFrames as a scalable data exchange format to enable interoperable, quick and flexible querying. Ariel Rokem (University of Washington, USA) surveyed principles, tools and practices for open and reproducible data-driven discovery, emphasizing the challenges of ensuring persistence and longevity of research tools and datasets in an ever-changing digital world. Michael Heuer (AMPLab, University of California, Berkeley, USA) described massive speedups of standard DNA and RNA analysis pipelines by using specialized file formats and algorithms, in-memory cluster computing and support for interactive data analysis. Sheila Reynolds (Institute for Systems Biology, USA) finished out the final session, describing how multiple access modes (web-based, scripting, programmatic) help researchers efficiently compute on a single copy of big data in the ISB Cancer Genomics Cloud, democratizing access to this important resource.
After all the presentations and discussion sessions concluded, workshop participants separated into two breakout groups to brainstorm about ways that the community as a group could accelerate progress in the field in ways that individual labs could not. Breakout group 1 focused on how tools and resources could be made more interoperable to enable more widespread use of the tools and integration of inputs and outputs among the tools. Important aspects that emerged in the discussion include:
- Adoption or development of standardized formats for the various major data types (variants, splice isoforms, PTMs, structures, sequence annotations, phenotypes, etc.).
- Mechanisms to scale up the information exchange to large-scale queries using big data technologies such as DataFrames and BigQuery.
- Use of ontologies to standardize the terminology for the exchange of data and knowledge. These ontologies already mostly exist, and need only be specified as the standard, although some extension may be required.
- Selection of initial tools that should be part of a pilot phase of the development and initial deployment of the interoperability framework.
- Development of a tool registry and portal that would serve as a web-accessible resource for finding relevant tools, their inputs and outputs, and also reference data files that can be used to demonstrate and validate the tools and their interoperation.
Breakout group 2 focused on unmet needs and future directions. Salient points that were discussed included:
- How can we increase the actionability of variants observed in patients? Beyond facilitating access to knowledge on structural impacts of variants, there is need for a metric of confidence in the predicted impact. Gene editing technologies are likely to enhance experimental studies of salient variants.
- There is also a need to recognize multiple-variant interactions within single genes and proteins and mutation effects on protein-protein, protein-nucleic acid, or protein- ligand and drug interactions. Also, annotation of context in which each variant could have an effect is of importance. For instance, information on cell types or cellular conditions in which specific interactions or protein complexes are formed, as well as annotation of epistatic relationships with mutations elsewhere in the genome can help in interpreting mutation’s influence on the cell.
- How can we improve the interpretation of variants affecting splicing? A proposal was made to create a mechanism for collecting donated RNAseq data to derive a comprehensive set of splice variants and interpreting them in the context of protein structure effects. It may also be useful to organize data on splice variants by type of alternative splicing (exon swaps, intron retention, etc).
- How can we standardize annotation pipelines and data integration methods? It was recognized that this has already been partially solved independently by various teams, so there would be benefit from implementing an interoperation framework.
- Who are the target audiences? Scientists, tumor boards, clinical geneticists, developers of targeted drugs, patients, and lay people with interest in genetic testing.
- How can we improve documentation and outreach? Suggestions included development of documentation videos and tutorials, and contributing to Wikipedia sections describing the impact of variants on protein structure, building on current experience, like the Protein Standards Initiative of the Human Proteome Organization.
The workshop has had a positive impact on the field already. In fact, some collaboration and interoperability is beginning to emerge. For example, the Kaviar database of human SNPs and the PeptideAtlas database of proteins detected via mass spectrometry now have links to the MuPIT resource, so that the variations in the former resources can be depicted using the tools in the latter resource.