الفهرس | Only 14 pages are availabe for public view |
Abstract Data analysis and machine learning have become an integrative part of the modern sci- entific methodology, offering automated procedures for the prediction of a phenomenon based on past observations, unraveling underlying patterns in data and providing insights about the problem. Yet, caution should avoid using machine learning as a black-box tool, but rather consider it as a methodology, with a rational thought process that is entirely dependent on the problem under study. In particular, the use of algorithms should ideally require a reasonable understanding of their mechanisms, properties and limitations, in or- der to better apprehend and interpret their results. In this thesis,Viral progress remains a major deterrent in the viability of antiviral drugs. The ability to anticipate this development will provide assistance in the early detection of drug-resistant strains and may encourage antiviral drugs to be the most effective plan. In recent years, a deep learning model called the seq2seq neural network has emerged and has been widely used in natural language processing. In this thesis, we borrow this ap- proach in predicting next generation sequences using the seq2seq LSTM neural network while considering these sequences as text data. We used hot single vectors to represent the sequences as input to the model; subsequently, it maintains the basic information position of each nucleotide in the sequences. Four RNA viruses sequence datasets used to evaluate the proposed model which achieved encouraging results. The achieved results illustrate the potential for utilizing LSTM neural network for DNA and RNA sequences in solving other sequencing issues in bioinformatics. The first part of this work studies the induction of sequence to sequence models and mo- tivating their design and purpose whenever possible. Our contributions follow with an original complexity analysis of, showing their good computational performance and scal- ability, along with an in-depth discussion of their implementation details, as contributed within Scikit-Learn. In the second part of this work, we analyze and discuss the biological data in the eyes of deep learning models. The core of our contributions rests in the representation and pre- possessing of biological data to deal with machine learning algorithms. In consequence of this work, our analysis demonstrates that preparing sequencing data in a new way lead up to optimal accuracy. The proposed approach consists of four main phases. In the first phase, sequences of datasets are preprocessed. In the second phase, once we have preprocessed sequences of data, they are transformed into a format that is suitable for training an LSTM network. In this case, a one-hot encoding of the integer values is used where each value is represented by a binary vector that is all “0” values except the pointer to the word, which is set to 1. |