In 2018 and currently in 2020, DeepMind submitted their newly developed AlphaFold and AlphaFold 2 AI-based model to Critical Assessment of Techniques for Protein Structure Prediction (CASP13). CASP is an independent community wide experiment that invites participants every two years to test their predictive models on experimental proteins that are not yet publicly available. The aim is to predict 3D structures of these new proteins from amino acid sequences. Importantly, the performance is validated by a panel of independent assessors who do not know the identity of those who make the submissions, nor do the participants have access to the experimental structures of the proteins (double-blinded design).
CASP13* saw 98 research groups from 21 countries, submitting 185 modelling methods with more than 57,000 predictions in six categories:
- Overall structure of proteins.
- Refinement of an approximate structure closer to the experimental one.
- Accuracy of an overall structure of the model and of each residue.
- The structure of protein oligomers.
- The ability to improve models using a variety of sparse data.
- The accuracy of protein structure features that are important for predicting the function.
The assessments are split into two initial stages: a short (72 hour) time scale that is intended for fully automated pipelines, followed by a long-term (3-week) time scale which allows for more complex processes that incorporate human input. There are two further main categories for evaluation: template based (TBM), using existing templates as a “backbone” for prediction of analogous structures derived from varying sequences; and template free (FM), which are specific to targets that have no sequence detectable templates. For completeness, there is a third category that falls into a grey area between these two classes.
For more than 25 years, the precision of all state-of-the-art methods remained at random chance until 2016 (CASP12), where accuracy finally started to become much more successful with average results showing correct overall topology. And finally in 2018 (CASP13), most proposed methods were exceeding not only overall correct topology but also showing many correct atomic level details. It was reported by the committee that although many technical improvements contributed to the rise in the accuracy of prediction, the most notable contribution came from the use of deep neural networks (DNN’s; Kryshtafovych et al., 2019).
Importantly, before CASP13, template based modelling saw a dramatic increase in prediction accuracy across all six domains of assessment. However, template free modelling proved more challenging, taking longer to deliver good predictions. Traditionally, the most successful approaches were based on fragment assembly, optimised through stochastic sampling processes which were extremely computationally and time consuming. From 2016, we saw advent of neural network methods, but these did not reach their maturity until 2018 where the first carnation of AlphaFold emerged alongside of two other deep learning (DL) based methods (RaptorX and TripletRes). Although we are yet to see the full published report from CASP14 , the online published results show clear domination of the latest carnation of the AlphaFold model (AlphaFold2).
For a detailed description of AlphaFold, please refer to the original article (Senior, A.W., Evans, R., Jumper, J. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020)), but in short, Convolutional Neural Network (CNN) with two independent output heads was trained on multiple sequence alignment (MSA) from the Protein Data Bank (PDB; Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000)). The model was designed to predict three terms: contact points of protein residues, protein structure and a smoothing term to prevent steric clashes (unnatural overlap of any two non-bonding atoms). Contact points of protein residues are estimated from a distance of pairs of β-carbon atoms; protein general structure is predicted from the backbone torsion angles (angles at which amino acid residues are “projected” from the protein backbone); and prevention of steric clashes achieved by optimisation of van der Waals distance (the distance at which forces between two atoms become repulsive rather than attractive as the atoms approach one another). All parameters were optimised using gradient descent based, quasi-Newtonian L-BFGS method.
With this latest development, we have finally been able to achieve accuracy of template free model predictions that not only matched those of template modelling approaches but also the laborious lab-based methods. It is exciting to see that after some time of maturation, DL models are finally reaching prediction accuracies necessary to push our understanding of biology to the next level.
Another important lesson from this study is its illustrative proof that analytics do not always have to result in mathematically or structurally complex designs in order to tackle some of the most difficult conundrums in science. Researchers from DeepMind show that profound understanding of the domain, coupled with clever and relatively simple design can sometimes result in a highly successful and useful algorithm that despite its simplicity, can have a far reaching impact on a very complex problem. And this strongly resonates with our research lab at Max Kelsen, where we previously showed that sometimes simple models, such as ResNet-34 are enough to deliver capabilities, which can shed new insights onto biological phenomena, such as glaucoma. We are looking forward to the full CASP14 publication in the near future.
*CASP14 is planned for publication sometime mid 2021.