Author: Bayoumi, Razan Hany Abd ElAziz./ Title: Design of A Smart System for Converting<br>Text to Photo Realistic Images Using<br>Artificial Intelligence Techniques /

Search In this Thesis

العنوان

Design of A Smart System for Converting
Text to Photo Realistic Images Using
Artificial Intelligence Techniques /

المؤلف

Bayoumi, Razan Hany Abd ElAziz.

هيئة الاعداد

باحث / رزان هانئ عبد العزيز بيومي

مشرف / عبد البديع محمد سالم

مناقش / هاني محمد كمال مهدى

مناقش / تيمور محمد نظمي

تاريخ النشر

2022.

عدد الصفحات

110 P. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

Computer Science (miscellaneous)

تاريخ الإجازة

1/1/2022

مكان الإجازة

جامعة عين شمس - كلية الحاسبات والمعلومات - قسم علوم الحاسب

الفهرس

Only 14 pages are availabe for public view

from

110

from

110

Abstract

Generating photo-realistic images from text is known as text-to-image synthesis, it’s a new challenging problem in computer vision that comes up a few years ago. It could be seen as a reverse task to image caption, aiming to generate realistic images from textual input description. Converting textual features into pixels and understanding the relation between the visual content and the natural language is considered a challenge because of the difference between the two latent spaces. Generating realistic images based on input text can be used in many useful aspects such as facilitating literacy, assisting communication tools for people with disabilities, helping in generating prototypes, sketches, architectures, and realistic images of faces can also help in Crime Scene Investigation (CSI) to recognize criminals, etc.
Deep convolutional generative adversarial networks (DCGANs) are a rapidly changing field that has shown a great revolution in generative models. GANs are the recent widely used networks in many different frameworks to generate realistic images conditionally and unconditionally as they have become the main solution to this task and shown promising results in synthesizing real-world images.
Studying the recent proposed models and overcoming the limitations of existing approaches was one of the main goals, and this was done by implementing an effective GAN architecture and training strategy that enables compelling text to image.
GANs are the main components that our work was based on. GANs are deep neural networks that show great potential in generating realistic images. The thesis presented a comparative study of the text-to-image generation GAN-based model and introduce a comprehensive comparison and analysis of their architecture and results. The recent proposed work can be grouped in classes according to the idea and the architecture of the model into 7 classes. First class includes the models that uses the baseline architecture. Second class consists of the multi-stage generation models. In the third class the models were built on using a single stream generator with multiple discriminators. Fourth class’s models used I2T (image to text) and T2I (text to image) in the training process. In the fifth class the generation process includes an intermediate layout layer. In the sixth class, the output images are generated based on input text and image. The seventh class includes different ideas and architecture.
To apply the text-to-image synthesis to the faces category, it was needed to build our dataset at the beginning, due to the shortage in the availability of suitable faces datasets. A portion of CelebA dataset images is used to generate textual English captions from their binary annotated attributes by writing several scripts with different words and sentence order, in order to have a variation in the generated text descriptions to simulate as close as the human-written descriptions. Our dataset consists of 10,000 images, which is split into 8000 images for the training phase and 2000 images for the testing phase. Each image is annotated with one English text description.
This thesis also includes designing system that generate realistic images (as close as possible) based on the input text description. Two GAN-based models are proposed. The first model is called (AttnDM-GAN) which stands for Attentional Dynamic Memory Generative Adversarial Memory. It seeks to generate realistic output semantically harmonious with an input text description. AttnDM-GAN is a three-stage hybrid model of the Attentional Generative Adversarial Network (AttnGAN) and the Dynamic Memory Generative Adversarial Network (DM-GAN). The 1st stage is called the Initial Image Generation, in which low resolution 64x64 images are generated conditioned on the encoded input textual description. The 2nd stage is the Attention Image Generation stage which generates higher-resolution images 128x128, and the last stage is Dynamic Memory Based Image Refinement which refines the images to 256x256 resolution images. An experiment is conducted on our model the AttnDM-GAN using the Caltech-UCSD Birds 200 dataset and evaluate it using the Frechet Inception Distance (FID) with a value of 19.78. The second proposed model is called Dynamic Memory Attention Generative Adversarial Networks (DMAttn-GAN) which is considered a variation of the AttnDM GAN model, where the second and the third stages are switched together, its FID value is 17.04. There another experiment is conducted on the faces category on CelebAText-HD dataset that applied for the two models: DMAttn-GAN and AttnDM-GAN which scores 52.384 and 51.195 respectively. The third and last experiment is applied to a different dataset category: Birds, on CUB dataset. On CUB dataset, DMAttn-GAN and AttnDM-GAN score 17.04, 19.78 respectively.