Modeling, understanding, and predicting >2 million Arabidopsis proteoforms in the post-genomic era
For the majority of the over 2 million non-synonymous single nucleotide polymorphisms (SNPs) found in the assembled Arabidopsis genomes of the 1001 Genomes Project, the effect on protein structure and function is unknown. The influence of the SNP-encoded single amino acid variations can be complex, including changes of protein expression, stability, transport, post-translational modification, molecular interaction, and other effects. In order to obtain a comprehensive overview of the landscape of single amino acid variations in Arabidopsis, our project performs proteome-wide bioinformatic analyses of SNP effects in all Arabidopsis proteins. Structural modeling with AlphaFold and classical biophysics-based approaches reveals the 3D distribution and energetic impact of amino acid variations in Arabidopsis proteins and protein complexes. In silico deep mutational profiling with artificial intelligence methods allows the identification of regions with high evolutionary constraint and functional importance. Furthermore, our project develops bioinformatics tools for protein variant 3D visualization and effect prediction, which are implemented in the SNPstar database, providing a user-friendly interface to explore the repertoire of SNPs in Arabidopsis.