Calling BS

We both recently read Calling Bullshit (by Carl T. Bergstrom and Jevin D. West) and Lee reviewed the book in a blog post. We’ve both talked together about the BS we see in and around testing for many years, and written blogs too (e.g. Lee’s critical review of the Capgemini “World Quality Report 2020-21”). We then spotted this tweet from Michael Bolton and it inspired us to pen this joint blog post in the spirit of “calling bullshit”:

Some of the attributes that we think make a great tester include critical thinking, shining a light into dark places, and being brave enough to call out problems when they’re spotted. Why then do testers seem willing to walk past problematic statements made in blogs, LinkedIn posts and vendor marketing and not call them out? How do we advance testing if we continue to allow mistruths to go unchallenged?

In On Bullshit, the philosopher Frankfurt (2005) defines bullshit as something that is designed to impress but that was constructed absent direct concern for the truth. This distinguishes bullshit from lying, which entails a deliberate manipulation and subversion of truth (as understood by the liar).

While there is no shortage of material when it comes to BS around testing, consider this recent marketing from testing tool vendor, Applitools – on their official website page for Applitools Eyes – as an example.


Stable, Fast, and Efficient

Applitools Eyes is powered by Visual AI, the only AI powered computer vision that replicates the human eyes and brain to quickly spot functional and visual regressions. Tests infused with Visual AI are created 5.8x faster, run 3.8 more stable, and catch 45% more bugs vs traditional functional testing. In addition, tests powered by Visual AI can take advantage of the ultrafast speed and stability of the next generation of cross browser testing, our Ultrafast Test Cloud.

Let’s consider some attributes of human eyes and brains. Eyes don’t, in reality, “see” anything. They capture lightwaves and send these, via neural networks, to our brain’s visual cortex for interpretation and correction. Notice how you have two eyes but only see one “smooth” picture – that’s some neat real-time processing right there. Not only that, your eyes are sending a constantly changing pipeline of varying lightwaves, but note how the picture stays smooth and controlled, enabling you to make decisions, almost at an unconscious level.

So, how do we “replicate” this with a computer? The answer is: we don’t.

“Vision is the functional aspect of the brain that we understand the best, in humans and other animals,” Tenenbaum says. “And computer vision is one of the most successful areas of AI at this point. We take for granted that machines can now look at pictures and recognize faces very well, and detect other kinds of objects.”

However, even these sophisticated artificial intelligence systems don’t come close to what the human visual system can do, Yildirim says. “Our brains don’t just detect that there’s an object over there, or recognize and put a label on something,” he says. “We see all of the shapes, the geometry, the surfaces, the textures. We see a very rich world.”

Neuroscience News

While it might sound cool to make such a grandiose statement, the regressions being spotted are simply comparisons – no more and no less. The computer is programmed to look for variances. The more sophisticated the software, it is probably reasonable to assume the better it will be at spotting regressions. Checking state “A” with state “B” and comparing it is, when all is said and done, is one of the things computers were built for and are good at.

Question – why are we replicating the “human eyes and brain”? We know that humans really struggle when it comes to checking in detail and for any length of time. Is the answer then to ask a computer to replicate that struggle? Why would we use software that replicates human limitations and errors? Perhaps we are not “replicating” at all but creating an illusion that makes for a grand sales pitch.

It seems to us that if the software was “replicating human eyes and brain” then surely it could do more than just spot a difference. Is it exploring the reasons why the difference has been spotted? Is it creating a bug ticket with replication steps and a detailed analysis of why the difference is unacceptable rather than acceptable? Can it compare the current state with previous states and make decisions about attractiveness, usability, suitability? Can it consider any of this based on a variety of contexts? If not, what’s stopping this? A human brain, a real human brain, that’s what. Perhaps it would be enough for Applitools to talk about how accurate the software is, avoiding this “human eyes and brain” hyperbole altogether.

Let’s turn our attention to the idea of “Tests infused with visual AI”. From the Oxford dictionary we can interpret “infused”, in this context, as “the introduction of a new element or quality into something”. From reading across one of the linked pages, we see many references to “AI” and we also see attention-grabbing words and phrases like “complex” and “deep learning”, along with the claim that “Applitools Visual AI has achieved 99.9999% accuracy”.

We take a couple of things out of this:

  1. We have been given no better understanding of why “tests infused with AI” are helpful to our testing or quality goals. While it might make for a nice sales pitch, it really gives us nothing of substance to compare against anything we might already be using to pinpoint likely advantages or even half solid reasons for considering a switch.
  2. The claim that “Applitools Visual AI has achieved 99.9999% accuracy” requires some scrutiny. Notice that it does not claim to always achieve or provide you with “99.9999% accuracy” (is our sarcasm in overdrive suggesting the wording is deliberately designed to be misread this way?). “Four nines” is an amazing accuracy rate but, the big (and unanswered) question is, under what conditions? High accuracy rates are achievable, and are far less impressive, when achieved under optimal conditions. If we started using this software tomorrow, what would our accuracy rate be? How long would it take to achieve 99.9999%? Is it even achievable in the real world outside of controlled conditions?

Let’s move on to the next series of claims – “5.8x faster” and “3.8 [times] more stable” – and ask the question, compared to what? These are unsupported statistical claims. There’s no denying that they sound impressive, but there is nothing to substantiate the statistics.

Without knowing the source and context, a particular statistic is worth little. Yet numbers and statistics appear rigorous and reliable simply by virtue of being quantitative, and have a tendency to spread.

West, Jevin D.; Bergstrom, Carl T.. Calling Bullshit (p. 101). Penguin Books Ltd. Kindle Edition.

We’re not going to simply accept the numbers. What is the benchmark? Under what conditions? Which hardware, software, load, etc.? If we run the software, will we see these (or similar) speed and stability improvements? We can’t possibly know because the claims have nothing to back them up. This seems exceptionally strange when you are selling into a market that should be questioning such claims.

With all seriousness, what would a software tool pitch be without a claim to being superior to testing executed by humans? “…catch 45% more bugs than traditional functional testing”. It’s nice to see that “manual” was cast aside in favour of “functional”, or was it? Is this a comparison to automated checks written without Applitools Eyes or is it a comparison to human testing? As the model is undefined, the “45% more bugs” claim is completely invalid.

The final claim made here relates to their cloud, which apparently offers “ultrafast speed and stability” when it comes to cross-browser testing. A LinkedIn post (from 5th January 2021) amped up this claim:

Again we see a very specific number, “20x faster” without offering anything by way of explanation of what it’s comparing (e.g. overall execution time) and there can be no foundation to the claim that their solution performs so well compared to “any other”. They could provide data to show their tool’s performance compared against some other solutions, but simply cannot claim it performs so much better than all alternatives. It’s also worth noting that the speed of check execution is not an indicator of quality in the approach and the human elements of genuine testing (over algorithmic checking) are not amenable to being sped up without reduction in their efficacy.

The amount of energy needed to refute bullshit is an order of magnitude bigger than [that needed] to produce it.

West, Jevin D.; Bergstrom, Carl T.. Calling Bullshit (p. 11). Penguin Books Ltd. Kindle Edition.

We’ve focused on Applitools marketing in this blog post – note that we’re not attacking the company, its people or its product, rather we’re critiquing their messaging. We see similar examples crossing our feeds on a daily basis – and so we assume most other testers do too. It’s easy to look at this nonsense and scroll on by, but if we do nothing and allow the BS to proliferate then we’re complicit in its ability to spread, be accepted and maybe even become viewed as fact over time. 

So, please join us in calling out the BS you see around testing. Flex those critical thinking muscles and communicate your objections and let’s hope that by doing so, we can help reduce the prevalence of such poor messaging around testing.

One reply on “Calling BS”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s