الفهرس | Only 14 pages are availabe for public view |
Abstract The document classification problem becomes an important task for regulating data behavior due to the availability of a large number of electronic text documents from a range of sources containing unstructured and semi-structured information. This thesis proposes a document classification multimodel for categorizing textual semi-structured and unstructured materials. The goal of implementing this multimodel approach is to make it easier to manage and sort textual documents. The Stacked Ensemble based meta-model technique is used to integrate the outputs of individual classifiers to get better results than any of the previously described models. Tokenization and various text normalization techniques are used at the preprocessing level such as term Frequency Inverse Term frequency (TF-IDF) and Continuous Bag-of-Words (CBOW) that generate hand-crafted feature vectors at the feature level. Based on the stacked ensemble technique, the suggested multimodal integrates three independent classifiers: Deep Neural Networks(DNN), Recurrent Convolutional Neural Networks (RCNN), and Bidirectional LSTM (Bi-LSTM) at the classification level. The proposed model is validated using a dataset constructed from a variety of spaces containing a large number of documents in each class. The experimental results show that the proposed model is capable of achieving effective results. Upon investigating the PDF Documents classification, the proposed model has achieved accuracy up to 0.9045 and 0.959 for the TFIDF and CBOW features, respectively. Moreover, concerning the JSON Documents classification, the proposed model has achieved accuracy up to 0.914 and 0.956 for the TFIDF and CBOW features, respectively. Furthermore, as for the XML Documents classification, the proposed model has achieved accuracy values up to 0.92 and 0.959 for the TFIDF and CBOW features, respectively. |