When to use Meta's Segment Anything to drastically improve your model's accuracy when using KIADAM

Rémi Coscoy
Jul 20, 2023
4 min read

Updated: Aug 29, 2023

The way our Kiadam tool helps you generate a dataset for your object recognition or computer vision project is by pasting images of the object you want to detect onto a chosen background. But what happens when the object is not rectangular? The bounding box of the object will contain a lot of pixels that are not part of either the object or the background, thus potentially misleading the model.

Meta released their open source model Segment Anything that can "cut out" any object from its background with impressive accuracy. We'll show you how to take advantage of this feature to solve our problem, as well as when you need to use it.

Curious about Segment Anything? Check out their official page or this blog post

Here are some links to the different sections of this post

How to use Segment Anything or SAM
Why use Segment Anything with Kiadam
First Experiment
When Segment Anything is the most useful
Conclusion

How to use Segment Anything or SAM

Link to the Colab Notebook

Good new, we put together our own Colab Notebook to allow you to do your own cut outs as easily as possible. Click on the link and follow along!

First, head out to Roboflow, label your images and download the zip to your computer with Yolov5 style annotations. Make sure to only have one object per image.

Next, open our Google Colab Notebook.

Click Run All and upload your zip when prompted in the first cell. The images will then be automatically downloaded to your computer!

Why use Segment Anything along with KIADAM

Check out another example of this technique in our previous blog post

Our data synthesis works by pasting images of the object you want to detect onto relevant backgrounds, combined with various data augmentation techniques to create new training images.

However, using a classic rectangular crop, you're left with part of the "original background" (here the gray tablecloth behind the beer bottle for example) in the to-be-pasted object.

This is additional information will only serve to mislead your neural network during training.

Using SAM allows you to cut out the unnecessary information and produce synthetic images that are closer to reality. We see however that in this case, the difference between the rectangular crop and the image extracted by SAM is slim. Let's see if this difference really matters in our first experiment.

First Experiment

Open the Colab Notebook here

For the sake of the experiment, we created a testing set with as few variations as possible (one background, always the same lighting and same object size) so that the only variable is the presence of absence of a cutout using SAM.

In this Notebook, we compare the results that YoloV5 obtains when trained on different datasets. The metric used will be the mean Average Precision or mAP

Here is an excerpt of the testing set.

Three beer bottles placed onto a grey table — Image from the Testing Set

We're training on the one hand with a dataset of cropped images pasted onto a background

Photos of beer bottles pasted onto a background of a table — Training set without SAM

And on the other hand with a dataset of the exact same set of images extracted with SAM and pasted onto the same backgrounds.

Photos of beer bottle with SAM cutout pasted onto a table background — Training set with SAM

Here are our results

Training Set	mAP
Without SAM	0.969
With SAM	0.966

We can see that the difference between using and not using SAM is quite low, negligible even. Why is that?

Here is an image generated by our tool, using the potential of SAM for perfect cutouts.

Photos of beer bottle pasted onto a table background

Here is an image generated with a rectangular crop. As you can see, they look quite similar. There are two main reasons for that.

How rectangular is the object
How dissimilar is the leftover part of the background

For a beer bottle, only the bottle neck prevents it from being a rectangle, and the leftover background is gray, quite similar to the actual background of the testing set

Three beers bottle placed atop of a table

In this particular case, for an object as rectangular as a beer bottle and with the testing background being so close to the original background, using SAM is not necessary. Let us however explore an object for which using SAM is a game changer.

When Segment Anything is the most useful

Follow along in this Colab Notebook

We're looking for an object that is as far away from a rectangle as possible. Looking around your house, you probably have at least one such object : a pair of scissors.

A pair of scissors — Cropped scissors, with a lot of remaining background

In this cropped image, the scissors occupy less than half of the image compared to the background. For the sake of the experiment, the white background here is quite different from the yellow background used in the testing set.

Three pairs of scissors placed atop of a table — Testing Set

We generated two training sets with our tool. Here is what the one with SAM looks like :

Photos of pairs of scissors extracted by sam pasted onto a photo of a table — Training set with SAM

Which does look quite close to the testing set.

Here is the one with only rectangular crops and not using SAM:

Photos of three pairs of scissors with rectangular crop pasted onto a photo of a table — Training set without SAM

We see that without SAM, a large part of the white background from the original image is left and is quite distracting, much more than in our first experiment with the beer bottles. But what are the actual results?

Training YOLOV5 on those two datasets and testing on the same testing set, we obtain the following.

Training Set	mAP
Without SAM	0.825
With SAM	0.921

In this case, the use of Segment Anything gives us a massive performance increase!

Conclusion

When using Segment Anything along with Kiadam, you will encounter the most gains when the recognized object's shape is significantly distinct from a rectangle, and when the production environment is quite different from the one you originally took the pictures of the objects in.