
While individual Mendelian diseases (diseases caused by a single gene) are rare, their aggregate number is significant. Discovering which gene causes a Mendelian disease is crucial for accurate diagnosis and treatment. Despite decades of effort, the genetic cause driving over half of identified Mendelian diseases is unknown. To address this, we describe MENDELSEEK, a machine learning approach that predicts Mendelian genes by integrating the gene’s aggregate residue variation score with properties such as their involved pathways, Gene Ontology processes, and protein language models. In benchmarking on 16,946 human genes with 10-fold cross-validation, MENDELSEEK achieves an area under the receiver operating characteristic curve, AUC, and an area under precision-recall curve, AUPR, of 0.850 and 0.695 respectively, compared to the second best method that uses residue variation, ENTPRISE+ENTPRISE-X, with 0.781 and 0.604 scores, and the third best approach, REVEL, with 0.597 and 0.390 scores. Mendelian genes have significantly more protein-protein interactions than non-Mendelian genes and are evolutionarily ancient. Applying MENDELSEEK to 17,858 genes of the whole human genome, 1,024 novel Mendelian genes with a precision >0.7 are predicted. Thus, MENDELSEEK represents a major improvement over the state-of-the-art and provides valuable insights into the biochemical features that distinguish Mendelian from non-Mendelian genes.
Citation: Zhou, H, Skolnick J. Submitted. MENDELSEEK: An algorithm that predicts Mendelian Genes and elucidates what makes them special.
Source code: The source code of MENDELSEEK is freely available at github.