Fooling (Around With) Google’s Cloud Vision API

5 min readAug 19, 2018

The Google computer vision API is free to try and has various purchase plans if you want to use it professionally or in real research. The home page indicates that it can detect multiple elements in a photo (examples: flowers, cats, dogs, etc.) and give a count for each. In this blog, I test just the drag-and-drop feature you can do to a single image without having to drill into the API code or pay for a plan. This type of test evaluates the image as a whole rather than counting elements within it. To get the counts, I believe you need to use code to make custom API calls.

Breaking software and testing its limits has always been an innate accidental talent of mine, and this test was no exception. I literally managed to fool it on my very first try.

The image used is of pipes and steam valves connected to a giant boiler in the basement of MOMA PS1(located in Queens). It is not an engine from a motor vehicle. However, if you go to google and search the web, Google’s own search engine produces hits for this object and produces even more hits for the museum it comes from.

That said, the image taken is rather specific and unusual. I am guessing that nothing like it was used during training, and so the algorithm comes up with the closest match it can. This particular “engine” is coated in real gold, and yet none of the labeling detects this either. It also does not connect this image to anything it can find on the web. Not shown: the “unsafe” tab attempts to ascertain if there is adult or otherwise objectionable content. Most categories correctly assess this photo as clean, but “racy” gets a 3 out of 4. Does this “engine” get your motor running? I guess it takes all kinds.

One quick suggestion for Google. Where is the “provide feedback” or “help train this model” link on the web page? If they provided a simple UI to allow even casual users to provide data on where the machine learning went wrong, this information could be used to train better models. Better still, with the user’s permission, add the test image to a library and allow the community to provide feedback on the computer vision results. Done right, you get a self-correcting community to provide useful feedback and over time your models just get better and better. The community helps guard against a single user trying to game or trick the system. As a further precaution, user gathered feedback can go into a database for evaluation and only be used in model training once it has been sanity checked.

Analyzing Text In An Image (Not so Easy for Computers)

This is a funny image I found in a shop window. All of the text reads “The Customer is Always Wrong.” On both my computer and my smartphone, this image presents vertically, but dragging the image onto the test area of the web page, it flips sideways. This happened for all the images I tested (including the previous one from MOMA). As you can see in the results, the text detection comes close on some words, but does not get one word in the image 100% right. Cursive written at angles is probably tricky for computer vision algorithms, even today.

Something I would expect this algorithm to be heavily trained on is food. Though it still is far from perfect, it does fare much better on a photo of a home cooked meal.

This image has one completely vegetarian dish in it (a salad of radishes and other raw vegetables), an Asian influenced meal featuring a pork chop, sautéed greens, and boiled rice. The analysis of the page correctly includes many of these words (as categories) with percentages for probability of the photo containing them.

The labels are in a scrolling window that also includes:

Dish — 95%
Recipe — 55%
Commodity — 54%

Since it does not separate out the dishes, it is hard to tell if the 58% (likelihood) Vegetarian classification is accurate for the salad and greens, or inaccurate as a label for the image as a whole.

This particular image throws off the text analysis completely with respect to the words on the placemat and words it thinks it sees on the silverware and the plate (which aren’t even really there).

A related point of interest though. I had a flower pot with a small sign on it identifying the species of flowers as calandiva. In the photo I took of it, “calandiva” was very tiny compared to the rest of the image content, was not perfectly straight, and if you look closely, you can see there is something that is not text just after the last character in the word. To a computer, this is noise that could confuse it. In the foreground of my final test image, I had a decorative sculpture with the word “Love” written in cursive. Though the document-text-identification tab was unable to “feel the love”, it correctly identified every letter of the word for the species of flower. This technology has clearly come a long way.

Fooling (Around With) Google’s Cloud Vision API

Written by Mitch Abramson