20 May 2021

CHEMRXIV

Authors: Héléna Alexandra Gaspar, Mohamed Ahmed, Thomas Edlich, Benedek Fabian, Zsolt Varszegi, Marwin Segler ,Joshua Meyers ,Marco Fiscato 

Abstract

Proteochemometric (PCM) models of protein-ligand activity combine information from both the ligands and the proteins to which they bind. Several methods inspired by the field of natural language processing (NLP) have been proposed to represent protein sequences.

Here, we present PCM benchmark results on three multi-protein datasets: protein kinases, rhodopsin-like GPCRs (ChEMBL binding and functional assays), and cytochrome P450 enzymes. Keeping ligand descriptors fixed, we evaluate our own protein embeddings based on subword-segmented language models trained on mammalian sequences against pre-existing NLP-based descriptors, protein-protein similarity matrices derived from multiple sequence alignments (MSA), dummy protein one-hot encodings, and a combination of NLP-based and MSA-based descriptors. Our results show that performance gains over one-hot encodings are small and combining NLP-based and MSA-based descriptors increases predictive performance consistently across different splitting strategies. This work has been presented at the 3rd RSC-BMCS / RSC-CICAG Artificial Intelligence in Chemistry in September 2020.


Back to publications

Latest publications

01 Jun 2024
arXiv Computer Science
Retrieve to Explain: Evidence-driven Predictions with Language Models
Read more
01 May 2024
Journal of Biomedical Semantics, volume 15, Article number: 5 (2024)
Elucidating the Semantics-Topology Trade-off for Knowledge Inference-Based Pharmacological Discovery
Read more
12 Oct 2023
Translational Neurodegeneration. 2023; 12: 47
Janus kinase inhibitors are potential therapeutics for amyotrophic lateral sclerosis
Read more