[Data Science] 4. Generative AI For Computer Vision
Link
https://app.datascientist.fr/learn/learning/57/62/194/844
Generative AI
- Generative AI enables users to quickly generate new content based on a variety of inputs. Inputs and outputs to these models can include text, images, sounds, animation, 3D models, or other types of data.
- Generative AI models use neural networks to identify the patterns and structures within existing data to generate new and original content.
- One of the breakthroughs with generative AI models is the ability to leverage different learning approaches, including unsupervised or semi-supervised learning for training. This has given organizations the ability to more easily and quickly leverage a large amount of unlabeled data to create foundation models. As the name suggests, foundation models can be used as a base for AI systems that can perform multiple task.
- Types
+ Language : Marketing (content), Note Taking, Gene Sequencing, Code Development, Essay Generation
+ Visual : Video Generation, 3D Models, Design, Image Generation
+ Auditory : Music Generation, Voice Generation
Generative Models
- Generative models aim to model the underlying distribution of the input data. They learn the joint probability distribution of the input features and the corresponding labels. Once trained, generative models can generate new samples that resemble the original data distribution.
- Generative models can be used for tasks such as generating new data, data augmentation, and handling, missing or incomplete data. They can also be used for classification by estimating the class posterior probability given the input features using Bayes' theorem.
- Examples
+ Gaussian Mixture Models(GMM)
+ Hidden Markov Models(HMMs)
+ Variational Autoencoders(VAEs)
Discriminative Models
- Discriminative models focus on learning the boundary or decision surface between different classes or categories in the input data. They learn the conditional probability distribution of the labels given the input features.
- Discriminative models aim to directly model the decision boundary without explicitly modeling the underlying distribution of the input features.
- Discriminative models are primarily used for classification tasks, where the goal is to assign labels to new, unseen instances based on their features
- Examples
+ Logistic Regression
+ Support Vector Machines(SVMs)
+ Neural Networks (specifically when used for classification tasks like image recognition or sentiment analysis).
Generative vs Discriminative
- It depends on the specific problem at hand and the available data.
- Generative models are useful when understanding and modeling the data distribution is crucial, and they can be used for various tasks beyond classification.
+ ex) Estimates the underlying distribution of a dataset(pictures of cats) and randomly generate realistic, yet synthetic, samples, according to their estimated distribution
- Discriminative models, on the other hand, are often preferred for classification tasks when the primary objective is accurate.
+ ex) Distinguish pictures of cats or dogs between two classes
Generative Adversarial Network(GAN)
- Real images -> Sample -> Discriminator -> Discriminator Loss
- Random inupt -> Generator -> Sample -> Discriminator -> Generator Loss
- Add randomness to generated picture -> Update to fool discriminator better -> Update to judge real/fake better
- Generator block(Linear, Batch Norm, ReLU) -> Noise(Randn) -> Linear, Sigmoid -> Prediction -> Discriminator block(Linear, Leaky ReLU)
- Generator(Learning)
+ Learns to generate plausible data, make fakes that look real
+ The generated instances become negative training examples for the discriminator
+ P(Features)
+ Fake -> 1
- Discriminator(A classifier)
+ Learns to distinguish the generator's fake data from real data
+ The discriminator penalizes the generator for producing implausible results
+ P(Fake Class | Real Features)
+ Real -> 1, Fake -> 0
- As training progresses, the generator gets closer to producing output that can fool the discriminator. Finally, if generator training goes well, the discriminator gets worse at telling the difference between real and fake. It starts to classify fake data as real, and its accuracy decreases.
- Both the generator and the discriminator are neural networks. The generator output is connected directly to the discriminator input. Through backpropagation, the discriminator's classification provides a signal that the generator uses to update its weights.
- Image-to-Image
- Head Models
- Image Synthesis
- Text-to-Image
Diffusion Model
- Known as denoising diffusion probabilistic models(DDPMs)
- Diffusion models are generative models that determine vectors in latent space through a two-step process during training.
- The two steps are forward diffusion and reverse diffusion. The forward diffusion process slowly adds random noise to training data, while the reverse process reverses the noise to reconstruct the data samples.
- Novel data can be generated by running the reverse denoising process starting from entirely random noise.
Transformer
- Introduced transformers in text, image, video, We don't use CNN anymore.
- Thinking & Machines -> Encoder 2개 -> Decoder 2개
+ Encoder
* Self-Attention -> Add & Normalize -> Feed Forward -> Add & Normalize
+ Decoder
* Self-Attention -> Add & Normalize -> Encoder-Decoder Attention -> Add & Normalize -> Feed Forward -> Add & Normalize -> Linear -> Softmax
- The attention-based architecture of Transformer enables it to capture long-range relationships between different parts of an image.
Transformer in Computer Vision
- Images are divided into small regions known as "patches"
- The patches are then flattend and converted into sequenes of vectors
- These sequence of vectors pass through attention layers in the Transformer model.
- The output of the transformer is used to perform the specific task, such as classification, detection, or segmentation.
Vision Transformer(VIT)
- The Vision Transformer (ViT) model is primarily used for image classification tasks.
- The ViT model can learn to recognize patterns and features in images
- The ViT model can also be applied to other computer vision tasks such as object detection and segmentation
Detection Transformer(DETR)
- Computer vision model used for object detection in images
- The encoder takes an image as input and generates a comprehensive representation
- The decoder then takes this representation and generates predictions for each detected object, including their class and position in the image.
Recurrent Neural Network(RNN)
- Weights
+ u : Weights to modify the input to the hidden state
+ v : Weights to modify the hidden state to the output
+ w : Weights to modify the current hidden state to the next
- Hidden state
+ The memory of the network
+ Calculated based on the current input and the hidden state from the previous time step, using non-linear, such as tanh or ReLU
- It is challenging for recurrent neural networks(RNNs) to memorize information over long sequences.
Encoder-Decoder
- Encoder(RNN)
+ Reads the input sequence and summarizes the information
- Decoder(RNN)
+ Starts generating the output sequence, and these outputs are also taken into account for future outputs
- It can lead to issues with long dependencies.
- It can store more information than recurrent neural networks(RNNs).
Attention
- A mechanism used in machine learning and deep learning models, to focus on specific parts of the input or context while performing a task.
- It allows the model to selectively attend to different parts of the input sequence, giving more importance or "attention" to certain elements.
- The idea behind the attention mechanism was to allow the decoder to flexibly use the most relevant parts of the input sequence.
- It combines weighted combinations of all the encoded input vectors, with the most relevant vectors being assigned the highest weights.
- Attention provides a better measure of long-term dependencies and enables models to more efficiently memorize relevant information.
- Scaled Dot-Production Attention
+ Attention used in Transformer
+ Inputs vectors are transformed into different information by linear layers
* Query
* Key
* Value
+ The attention map(Attention weight) is computed from the query and key. The dot product between the quey and key is calculated. The results are scaled. The Softmax function is used to transform to scaled results into probabilities. Multiply the attention weights by the value to obtain a new vector.
* Query 4x3 * Key 3x4 -> Score 4x4 -> Normalizing -> Scaled Score -> Using Softmax -> Score in probability, Attention Map, Attention Weights -> Attention Weight 4x4 * Value 4x3 -> The output of the attention layers 4x3
* Attention(Q, K, V) = Softmax(QK^T/√d_k)V
- Self Attention and Cross Attention
+ The Self Attention
* Used in encoder and decoder.
* Q, K and V are derived from the same input sequence.
+ The Crossed Attention
* Used in decoder.
* Q is derived from input sequence A.
* K and V are derived from the same input sequence B
- Multi-head Attention
+ Q, K, V -> Linear -> Sacled Dot-Product Attention(h) -> Concat -> Linear
Hugging Face
- A popular open source library and platform that provides a wide range of natural language processing (NLP) models, tools, and resources. It is widely used in the machine learning and NLP community for its extensive collection of pre-trained models and easy-to-use interfaces
- Hugging Face's pre-trained models are capable of excelling in various tasks without requiring additional fine-tuning
- And with additional training on small targeted datasets, it is likely that you can adapt these models to any specific situation.
- Base model -> Very large dataset, Calculation, Training days -> Pre-trained model -> Training can be done on one GPU, Small dataset, Easy reproducibility -> Fine-tuned model
- Pipeline
+ 1. Create a pipeline() and specify the inference task
```python
classifier = pipeline("image-classification", model="my-awesome-food-model")
```
+ 2. Pass your input image to the pipeline() function
```python
classifier(image)
```
- PreTrained Model
+ 1. Pre Processing
* Load the Pretrained Model for Pre Processing
```python
image_processor = AutoImageProcessor.from_pretrained("my-awesome-food-model")
```
* Prepare the inputs
```python
image_path = "image.jpg"
image = Image.open(image_path)
input = image_processor(image, return_tensors="pt")
```
+ 2. PreTrained Transformer Model
* Load the PreTrained Model
```python
model = AutoModelForImageClassification.from_pretrained("my-awesome-food-model")
```
* Generate output from input
```python
outputs = model(**inputs).logits
```
+ 3. Post Processing
* Pass your inputs to the model and retrieve the logits to obtain the predicted label
```python
predicted_label = logits.argmax(-1).item()
model.config.id2label(predicted_label)
# 'beignets'
```
Q&A
- AI
- CNN
- Forward
- Backpropagation
- Generative AI
- Stable Diffusion
- ControlNet
- UNet
- Encoder
- Decoder
- LoRA
- Dreambooth
- Embedding
- Gradient
- Loss
- Activation function