Journal of computational biology : a journal of computational molecular cell biology

Effect of Protein Repetitiveness on Protein-Protein Interaction Prediction Results Using Support Vector Machines.

PMID 27529135


There are many computational approaches to predict the protein-protein interactions using support vector machines (SVMs) with high performance. In fact, performance of currently reported methods are significantly over-estimated and affected by the object repetitiveness in the datasets used. To study the effect of object repetitiveness of datasets on predicting results. We present novel methods to construct different positive datasets with or without repeating proteins using graph maximum matching in the protein-protein interaction datasets and corresponding series of negative datasets with different proteins repetitiveness are constructed using graph adjacency matrix. The relationship between the SVM prediction results and the repeated proteins (repeat numbers and repeat rates) and the distributions of repeated proteins in the datasets are analyzed. Protein repetitiveness of positive and negative datasets can affect the prediction result: high protein repetitiveness of positive or negative datasets yield high performance prediction result. This indicate that dealing with object repetitiveness of datasets is a key issue in protein-protein interactions prediction using SVMs since real world data contain certain degrees of repeat proteins.