Search In this Thesis
   Search In this Thesis  
العنوان
Artificial Intelligence Approach for Protein Sequence Analysis \
المؤلف
Mohamed, Farida Alaaeldin Mostafa.
هيئة الاعداد
باحث / فريده علاءالدين مصطفى محمد
مشرف / نجوى بدر
مشرف / رشا إسماعيل
مشرف / ياسمين عفيفي
تاريخ النشر
2022.
عدد الصفحات
94 P. :
اللغة
الإنجليزية
الدرجة
ماجستير
التخصص
Information Systems
تاريخ الإجازة
1/1/2022
مكان الإجازة
جامعة عين شمس - كلية الحاسبات والمعلومات - نظم المعلومات
الفهرس
Only 14 pages are availabe for public view

from 94

from 94

Abstract

Protein sequence analysis helps in the prediction of protein functions. The objective of this thesis is to propose new deep learning models that are capable of classifying proteins based on their features extracted in either 1D or 3D and investigate the impact of data variations using 3D features on the deep learning-based protein sequence classification.
Regarding the 1D features, different protein descriptors were used and decomposed into modified feature descriptors using Empirical Mode Decomposition that were not employed in protein studies. Uniquely, we introduced using Convolutional Neural Network to learn and classify protein diseases. A dataset of 1563 protein sequences was classified into 3 different disease classes: AIDS, Tumor suppressor, and Proto-oncogene.
Results showed a significant increase in the performance of the Convolutional Neural Network model using modified feature descriptor over Support Vector Machine using rbf kernel function by 23.3% in accuracy. CTDT modified feature descriptors improved the deep learning model results by 19.5%, 39.6%, 23.3%, 29.9%, 24.3%, and 31.2% in AUC, MCC, accuracy, F1- score, recall, and precision, evaluation metrices respectively.
Regarding the 3D features, uniquely five feature extraction groups were utilized to create 3D features with two sizes (7x7x7 and 9x9x9). Three datasets are employed in the assessment, which are different in their sorts, sizes, and balance state namely, Disease and two Phage Virion Proteins datasets.
Results showed that the 7x7x7 feature matrix has a positive correlation between its dimensions, which has positive impact on the results reaching 71% in PVP-Balanced and 86% in disease dataset. Using the sum of the first three Intrinsic Mode Function components had a better impact than using the first component improving accuracy to 86.6% for disease dataset. The dataset size had a significant positive impact on training the Convolutional Neural Network model reaching 84%.