Design of protein solubility prediction model based on deep neural network
摘要:
蛋白质溶解性是生物信息学领域的重要研究课题,通过分析蛋白质溶解性数据,结合特征提取和深度学习技术,设计多种卷积神经网络预测蛋白质溶解性的模型.使用CD-HIT对蛋白质原始数据进行降噪,并利用G-gap对每个样本进行张量化处理,得到适用于卷积神经网络的特征数据,作为模型其中一路网络的输入;为提高模型预测精度,对每个样本利用SCRATCH工具提取6维序列特征和51维结构特征作为额外特征,作为模型的另一路网络输入.依据数据特点,通过对卷积层的串并联结构调整组合,设计4种不同网络模型,实现蛋白质溶解性预测.通过对比试验确定网络结构和参数,结果表明基于深度双路卷积神经网络DDcCNN(Deep Dual-channel Convolutional Neural Networks)的蛋白质溶解性预测模型整体性能最优,其预测精度、查全率、查准率、MCC(Matthews Correlation Coefficient)等性能指标分别达到76.31%、65.31%、75.05%、0.55.并通过与基于传统的深度神经网络、支持向量机、随机森林、决策树建立的预测模型以及现有的研究成果进行比较试验,证明了本研究设计的有效性.
Protein solubility is an important research in the field of bioinformatics.We designed multiple convolutional neural network models to predict protein solubility based on using combining feature extraction and deep learning technology.CD-HIT was used to denoise the original protein data,and the features of each sample were extracted by G-gap,which were input to a channel of convolutional neural networks.And SCRATCH tool was used to extract 6-dimensional sequence features and 51-dimensional structural features as additional features for each sample to improve the accuracy of the model,which were input as another channel of the model.We analyzed the characteristics of the data,which designed four different network models by adjusting the series-parallel structure of the convolutional layers.The network structure and parameters were determined through comparative experiments.The results showed that the protein solubility prediction model based on Deep Dual-channel Convolutional Neural Networks got the best overall performance.Its prediction accuracy,recall rate,precision rate,MCC(Matthews Correlation Coefficient)indicators reached 76.31%,65.31%,75.05%,0.55,respectively.The verification experiments were established to compare our method to the traditional Deep Neural Networks,Support Vector Machines,Random Forests,Decision Tree,and the models of existing research,the results showed that effectiveness of our method was proved.
作者:
王鲜芳 刘依锋 杜志勇 朱命冬 李启萌
Wang Xianfang;Liu Yifeng;Du Zhiyong;Zhu Mingdong;Li Qimeng(School of Computer Science and Technology,Henan Institute of Technology,Xinxiang 453003,China;School of Management,Henan Institute of Technology,Xinxiang 453003,China;School of Electrical Engineering and Automation,Henan Institute of Technology,Xinxiang 453003,China;School of Computer and Information Engineering,Henan Normal University,Xinxiang 453007,China)
机构地区:
河南工学院计算机科学与技术学院 河南工学院管理学院 河南工学院自动化学院 betway官方app 计算机与信息工程学院
出处:
《betway官方app 学报:自然科学版》 CAS 北大核心 2021年第2期31-39,共9页
基金:
国家自然科学基金(62072157 61802116) 河南省自然科学基金(202300410102) 河南工学院博士启动项目(KQ2002).
关键词:
深度双路卷积神经网络 蛋白质溶解性 G-gap二肽频率 预测模型
deep dual-channel convolutional neural networks protein solubility G-gap dipeptide frequency predict model
分类号:
TP181 [自动化与计算机技术—控制理论与控制工程]