BLIP-2 官方库学习
文章目录
Overview
《BLIP-2:Bootstrapping Language Image Pre-training with Frozen Image Encoders and Large Language Models》 中提出了BLIP-2模型。BLIP-2通过在冻结的预训练图像编码器和大型语言模型(LLM)之间训练一个轻量级的12层Transformer编码器,利用它们,在各种视觉语言任务中实现最先进的性能。最值得注意的是,BLIP-2在可训练参数减少54倍的零样本VQAv2上比Flamingo(800亿参数模型)提高了8.7%。

 论文摘要如下:
由于大规模模型的端到端训练,视觉和语言预训练的成本变得越来越高。本文提出了 BLIP-2,这是一种通用且高效的预训练策略,可从现成的冻结预训练图像编码器和冻结大型语言模型 引导视觉语言预训练【 bootstraps vision-language pre-training】。 BLIP-2 通过轻量级查询转换器弥补了模态差距,该转换器分两个阶段进行预训练。第一阶段从冻结图像编码器引导视觉语言表示学习【 The first stage bootstraps vision-language representation learning from a frozen image encoder. 】。第二阶段从冻结的语言模型引导视觉到语言的生成学习。尽管可训练参数比现有方法少得多,但 BLIP-2 在各种视觉语言任务上实现了最先进的性能。例如,我们的模型在零样本 VQAv2 上的性能比 Flamingo80B 高出 8.7%,可训练参数减少了 54 倍。我们还展示了该模型的新兴功能,即可以遵循自然语言指令的零样本图像到文本生成功能。
Blip2Config
( vision_config = None, qformer_config = None, text_config = None, num_query_tokens = 32, **kwargs )
 
-  
vision_config(dict,可选)–用于初始化Blip2VisionConfig的配置选项字典。
 -  
qformer_config (dict,可选)-用于初始化Blip2QFormerConfig的配置选项字典。
 -  
num_query_tokens (int,可选,默认为32)-通过transformer传递的查询令牌数。
 -  
kwargs (可选)-关键字参数词典。
 
Blip2Config是用于存储Blip2ForConditional Generation的配置的配置类。它用于根据指定的参数实例化BLIP-2模型,定义视觉模型、Q-Former模型和语言模型配置。
Blip2Config {
  "initializer_factor": 1.0,
  "initializer_range": 0.02,
  "model_type": "blip-2",
  "num_query_tokens": 32,
  "qformer_config": {
    "model_type": "blip_2_qformer"
  },
  "text_config": {
    "model_type": "opt"
  },
  "transformers_version": "4.35.2",
  "use_decoder_only_language_model": true,
  "vision_config": {
    "model_type": "blip_2_vision_model"
  }
}
 
Blip2VisionConfig
( hidden_size = 1408, intermediate_size = 6144, num_hidden_layers = 39, num_attention_heads = 16, image_size = 224, patch_size = 14, hidden_act = 'gelu', layer_norm_eps = 1e-06, attention_dropout = 0.0, initializer_range = 1e-10, qkv_bias = True, **kwargs )
 
-  
hidden_size(int,可选,默认为1408)——编码器层和池器层的维度。
 -  
intermediate_size(int,可选,默认为6144)——Transformer编码器中“中间”(即前馈)层的维度。
 -  
num_hidden_layers(int,可选,默认为39)–Transformer编码器中的隐藏层数。
 
这是用于存储Blip2 VisionModel配置的配置类。
from transformers import Blip2VisionConfig, Blip2VisionModel
# Initializing a Blip2VisionConfig with Salesforce/blip2-opt-2.7b style configuration
configuration = Blip2VisionConfig()
# Initializing a Blip2VisionModel (with random weights) from the Salesforce/blip2-opt-2.7b style configuration
model = Blip2VisionModel(configuration)
# Accessing the model configuration
configuration = model.config
 
输出:
Blip2VisionConfig {
  "attention_dropout": 0.0,
  "hidden_act": "gelu",
  "hidden_size": 1408,
  "image_size": 224,
  "initializer_range": 1e-10,
  "intermediate_size": 6144,
  "layer_norm_eps": 1e-06,
  "model_type": "blip_2_vision_model",
  "num_attention_heads": 16,
  "num_hidden_layers": 39,
  "patch_size": 14,
  "qkv_bias": true,
  "transformers_version": "4.35.2"
}
 
Blip2QFormerConfig
( vocab_size = 30522, hidden_size = 768, num_hidden_layers = 12, num_attention_heads = 12, intermediate_size = 3072, hidden_act = 'gelu', hidden_dropout_prob = 0.1, attention_probs_dropout_prob = 0.1, max_position_embeddings = 512, initializer_range = 0.02, layer_norm_eps = 1e-12, pad_token_id = 0, position_embedding_type = 'absolute', cross_attention_frequency = 2, encoder_hidden_size = 1408, **kwargs )
 
- vocab_size(int,可选,默认为30522)–Q-Former模型的词汇大小 Vocabulary size。定义调用模型时传递的inputs_ids可以表示的不同令牌的数量。
 - num_hidden_layers(int,可选,默认为12)–Transformer编码器中的隐藏层数。
 
from transformers import Blip2QFormerConfig, Blip2QFormerModel
# Initializing a BLIP-2 Salesforce/blip2-opt-2.7b style configuration
configuration = Blip2QFormerConfig()
# Initializing a model (with random weights) from the Salesforce/blip2-opt-2.7b style configuration
model = Blip2QFormerModel(configuration)
# Accessing the model configuration
configuration = model.config
 
输出:
Blip2QFormerConfig {
  "attention_probs_dropout_prob": 0.1,
  "cross_attention_frequency": 2,
  "encoder_hidden_size": 1408,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "blip_2_qformer",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.35.2",
  "vocab_size": 30522
}
 
Blip2Model
outputs = model(inputs)
from PIL import Image
import requests
from transformers import Blip2Processor, Blip2Model
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2Model.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16)
model.to(device)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "Question: how many cats are there? Answer:"
inputs = processor(images=image, text=prompt, return_tensors="pt").to(device, torch.float16)
outputs = model(**inputs)
 
get_text_features
text_features = model.get_text_features(inputs)
import torch
from transformers import AutoTokenizer, Blip2Model
model = Blip2Model.from_pretrained("Salesforce/blip2-opt-2.7b")
tokenizer = AutoTokenizer.from_pretrained("Salesforce/blip2-opt-2.7b")
inputs = tokenizer(["a photo of a cat"], padding=True, return_tensors="pt")
text_features = model.get_text_features(**inputs)
 
get_image_features
image_outputs = model.get_image_features(inputs)
import torch
from PIL import Image
import requests
from transformers import AutoProcessor, Blip2Model
model = Blip2Model.from_pretrained("Salesforce/blip2-opt-2.7b")
processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")
image_outputs = model.get_image_features(**inputs)
 
get_qformer_feature
qformer_outputs = model.get_qformer_features(inputs)
import torch
from PIL import Image
import requests
from transformers import Blip2Processor, Blip2Model
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2Model.from_pretrained("Salesforce/blip2-opt-2.7b")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")
qformer_outputs = model.get_qformer_features(**inputs)
    		本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。 如若内容造成侵权/违法违规/事实不符,请联系我的编程经验分享网邮箱:veading@qq.com进行投诉反馈,一经查实,立即删除!