When the Transformer model was first released, it was mainly used to deal with NLP tasks, such as machine translation, but with the popularity of the attention mechanism model, many magic modification models based on the Transformer model have also been released, and the attention mechanism of the Transformer model has also been proven by the Google team to be used for computer vision tasks, especially SWIN The release of the Transformer model has brought the Transformer model into the field of computer vision. In the previous articles, we also introduced another model based on the Transformer model for computer vision tasks. And the betr model can not only be used in object detection, but also in object segmentation, in this issue we will implement the betr model based on the transformer model.
from pil import imageimport requestsimport matplotlib.pyplot as pltimport torchfrom torch import nnfrom torchvision.models import resnet50import torchvision.transforms as ttorch.set_grad_enabled(false);
The first step in the implementation of the object detection algorithm based on the Transformer model is to import the third-party library of Python, which is mainly torch, to ensure that you have successfully installed the torch library before running this issue.
classes = [ 'n/a', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'n/a', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'n/a', 'backpack', 'umbrella', 'n/a', 'n/a', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'n/a', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'n/a', 'dining table', 'n/a', 'n/a', 'toilet', 'n/a', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microw**e', 'oven', 'toaster', 'sink', 'refrigerator', 'n/a', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']colors = [[0.000, 0.447, 0.741], 0.850, 0.325, 0.098], 0.929, 0.694, 0.125], 0.494, 0.184, 0.556], 0.466, 0.674, 0.188], 0.301, 0.745, 0.933]]
Here we need to create a list, one is to store all the object labels that the betr model can detect, and the other is the color data, which is convenient for our later visualization.
transform = t.compose([ t.resize(800), t.totensor(),t.normalize([0.485, 0.456, 0.406], 0.229, 0.224, 0.225])]def box_cxcywh_to_xyxy(x): x_c, y_c, w, h = x.unbind(1) b = [(x_c - 0.5 * w), y_c - 0.5 * h), x_c + 0.5 * w), y_c + 0.5 * h)] return torch.stack(b, dim=1)def rescale_bboxes(out_bbox, size): img_w, img_h = size b = box_cxcywh_to_xyxy(out_bbox) b = b * torch.tensor([img_w, img_h, img_w, img_h], dtype=torch.float32) return b
Here we set up a few functions to facilitate the labeling of objects, which is more convenient for visualization operations in the later stage. Once the above initialization is complete, we can build our betr model.
According to the framework diagram of the Betr model, we know that there are two key frameworks here, one is the CNN convolutional neural network layer, and the other is the encoder and decoder part of the Transformer model.
class detr_model(nn.module): def __init__(self, num_classes, hidden_dim=256, nheads=8,num_encoder_layers=6, num_decoder_layers=6): super().init (CNN convolutional neural network layer selfbackbone = resnet50() del self.backbone.fc self.conv = nn.conv2d(2048, hidden dim, 1) transformer layer selftransformer = nn.transformer(hidden dim, nheads, num encoder layers, num decoder layers) **class and box layers selflinear_class = nn.linear(hidden_dim, num_classes + 1) self.linear_bbox = nn.linear(hidden dim, 4) output position encoding selfquery_pos = nn.parameter(torch.rand(100, hidden dim)) outputs the positional encoding of the space selfrow_embed = nn.parameter(torch.rand(50, hidden_dim // 2)) self.col_embed = nn.parameter(torch.rand(50, hidden dim 2)) def forward(self, inputs): resnet-50 cnn convolutional neural network x = self.backbone.conv1(inputs) x = self.backbone.bn1(x) x = self.backbone.relu(x) x = self.backbone.maxpool(x) x = self.backbone.layer1(x) x = self.backbone.layer2(x) x = self.backbone.layer3(x) x = self.backbone.layer4(x) converts from 2048 to 256 feature h = selfconv(x) positional encodings h, w = hshape[-2:] pos = torch.cat([self.col_embed[:w].unsqueeze(0).repeat(h, 1, 1),self.row_embed[:h].unsqueeze(1).repeat(1, w, 1),]dim=-1).flatten(0, 1).unsqueeze(1) transformer layer h = selftransformer(pos + 0.1 * h.flatten(2).permute(2, 0, 1),self.query_pos.unsqueeze(1)).transpose(0, 1) The final output is labels and boxes return
After building the betr model, we can use it directly, and before using it, we need the pre-trained model of betr.
detr = detr_model(num_classes=91)state_dict = torch.hub.load_state_dict_from_url( url='', map_location='cpu', check_hash=true)detr.load_state_dict(state_dict)detr.eval();
Now that we've set up the detr model, we need to use torchhub.Load State Dict from url function to betr pre-trained model, after the model is completed, we use Load State Dict to load the model and eval. Then we can use the betr model.
def detect(im, model, transform): img = transform(im).unsqueeze(0) assert img.shape[-2] <= 1600 and img.shape[-1] <= 1600, 'The maximum supported **1600*1600 is supported' outputs = model(img) probas = outputs['pred_logits'].softmax(-1)[0, :1] keep = probas.max(-1).values > 0.7 bboxes_scaled = rescale_bboxes(outputs['pred_boxes'][0, keep], im.size) return probas[keep], bboxes_scaled
Here we build a detect function that allows us to use the model for object detection, and then we can load a copy of the model for the model. Here we pick a confidence level greater than 07 labels and visualization of the data.
im = image.open('11.jpg').convert('rgb')scores, boxes = detect(im, detr, transform)def plot_results(pil_img, prob, boxes): plt.figure(figsize=(16,10)) plt.imshow(pil_img) ax = plt.gca() for p, (xmin, ymin, xmax, ymax), c in zip(prob, boxes.tolist(),colors * 100): ax.add_patch(plt.rectangle((xmin, ymin), xmax - xmin, ymax - ymin, fill=false, color=c, linewidth=3)) cl = p.argmax() text = f': ' ax.text(xmin, ymin, text, fontsize=15,bbox=dict(facecolor='yellow', alpha=0.5)) plt.axis('off') plt.show() plot_results(im, scores, boxes)
First of all, we load a ** that needs to be tested, and pass ** to the betr model for object detection, after the object detection is completed, we can get the object label, confidence level, and object box information of the model**. After getting these object information, we can perform data visualization operations, and here we establish a plot results function to facilitate data visualization operations.
Transformer model is google in attention is all you need** However, with the release of the VIT model, it is possible to apply the Transformer model to computer vision tasks, and the Detr model introduced in this issue is an object detection model based on the combination of Transformer model and CNN convolutional neural network.