Listen to the Audio Version here
I have just returned from the ESOMAR congress where I had the pleasure of presenting some work on Causal AI. I had dozens of inspiring conversations and almost every one of them ended with the question "Frank, what's your view on synthetic respondents? Professionals right now are smelling snake oil but don't dare bring it up. Here is what I said, spiced with a little more depth.
"We do synthetic sampling, too" a researcher whispers to me. I realize that the terms are confusing. There is synthetic data or synthetic sampling, which is a technology to create a synthetic dataset based on real data. This data is similar in all aspects to the real data but different. In market research it is used to augment samples that lack sample size for certain segments or demographics.
"Information is like energy. It can only be transformed into another form of energy".
In principle this works, but when you have a small sample size, it is like magic to produce something big out of it. We know from information theory that it is not possible to produce information out of thin air. Information is like energy. It can only be transformed into another form of energy. Synthetic data can be useful, but it does not produce more information, it just uses the existing information better.
But synthetic data or synthetic sampling is not what we are talking about here - even though we have the word "synthetic" here. Synthetic data / sampling software uses deep machine learning but not Large Language Models.
Synthetic respondents
Early studies in 2023 showed that even GPT3-based synthetic respondents performed very well in a large US political survey. In another study, synthetic respondents went shopping for toothpaste. By changing the prices in the prompts, they generated a price-demand curve and found a surprisingly similar result to a conjoint study conducted on these brands.
How does the synthetic respondent approach work? Essentially, you prompt an LLM in such a way that you describe a particular respondent demographic profile and then include the survey question in the prompt. Then you repeat those LLM prompts for many different combinations of the demographic profile. Take age and gender. You might want 10 young males, 10 middle-aged males and 10 older males, then the same for females. With more demographics or other profile criteria, you could easily end up with hundreds or thousands of requests.
At the Esomar 2024 conference, there were a number of presentations analyzing the approach to concept testing, for example. In short, the results are mixed. Sometimes it works, sometimes it doesn't. The bad news is that we don't know in advance which will work.
“Synthetic respondents focusing quant while digital twins focusing qual research”
Digital Twins
Some other presentations at Esomar, such as those from Mars and L'Oreal, showed an approach that we can call "digital twins". While "synthetic respondents" are designed more for quantitative studies, the digital twin approach has qualitative research in mind. We want to talk to the AI and hope it responds like our audience.
Digital twins do not rely on a cascade of prompts. Instead, the LLM is trained on specific information from or about a particular customer persona.
Have Digital Twins worked for Mars and L'Oreal? Again, the results are mixed. And you can tell that when they work exceptionally well, the brands are not shy about bragging about it.
Essentially, the brands are taking the existing research work on a particular customer group, such as open-ended feedback and/or feedback on particular product concepts, and feeding it into the AI.
Brands find these digital twins useful and inspiring, but it is usually not quite the same. There are still problems such as hallucination or regression to the mean.
The big elephant in the room with Qual is also the question: How do I know that the digital twin is really a twin, that it really gives similar answers to any previously unasked question? The answer can only be a head-to-head comparison with real respondents.
This is exactly what is missing in the studies on digital twins that I have seen. For me, it is not enough to rely on human judgement of plausibility. Plausibility means to check with congruence with expert opinion, not with reality.
What's missing? Why do these methods not yet work to our complete satisfaction?
Again and again: what's missing is Causality
I happened to be sitting in the main hall at the Esomar Congress when the Puratos Group started to present their work with digital twins. Puratos is a large company that supplies bakery products to supermarkets and bakeries. They asked an LLM to invent a new healthy bakery product and made the LLM's invention according to the "invented" recipe. The banana-lentil cake tasted terrible.
Purato's approach to digital twins was different. Instead of feeding the LLM with random descriptions of the target group, they condensed everything they knew into knowledge graphs - a causal representation of actual knowledge about the customers.
The result, according to Puratos, was compelling. The machine invented a chia-based berry cake that tasted amazing in real life. The experience opened up a new area of product concepts for them, combining nuts and seeds with confectionery and bakery.
Think about it. What customers say isn't all that relevant to understanding what drives their thinking and decisions. Most digital twins today suffer from garbage-in-garbage-out syndrome.
“Most digital twins today suffer from the garbage-in-garbage-out syndrome.”
You may ask, "How can you say, dear Frank, that our internal data and insights are rubbish? Well, descriptors are just that, they describe facts, NOT relationships. What you are looking for are NOT facts, but what LEADS to certain results. Important insights are typical causal insights, NOT facts or data.
We know that when we ask "why did you rate it that way" in an NPS survey, the most common response is not causally relevant to the results. Music loudspeaker customers say 'because of the great sound', restaurant customers say ' because it tastes good' and washing machine customers say ' because it washes well'. It turns out that none of these are the key drivers of customer behavior.
The more you feed your AI irrelevant stuff, the more it will produce irrelevant results. It may sound plausible or look nice, but it turns out not to work.
A nice proof of this idea is a study that the amazing Dr Steffen Schmidt initiated. Before producing social ads with Midjourney, we conduct an implicit survey to measure implicit brand archetype perceptions of mobile brands alongside brand willingness. Using Supra Causal AI, we identified certain brand archetypes as the key attractors for the target group.
With this insight, we were able to prompt the AI to produce ads, which we then tested in real social contexts. The result: without further optimization, we achieved an 18% increase in market share over controls.
Another example. At Supra, we build a synthetic respondent approach to measure the price demand curve of products without asking customers. The prompting was all guided by the question of what makes customers buy a product at this price. In addition to the core demographics, we integrated information such as 'is this person already a buyer of this brand', 'has this person tried this product', 'is this person a heavy user of the category'. With this and other information, we were able to reproduce most of our real measured results with such precision that it would lead to the same price optimum.
"Causal in, truth out"
Is SYNTHETIC a bubble?
Yes and no. “Getting insights without surveys.” This promises not only lower costs and greater speed, but also the ability to get rid of insights departments and research agencies. It is particularly promising for use cases where audiences are hard to reach and expensive, such as B2B.
As with any hype, it starts with over-promising. The big million-dollar question is how to make these methods mature and trustworthy. For me, the way forward is clear:
Conduct Causal AI studies to understand the causal drivers of customer behavior. Complement this with secondary established knowledge of relationships.
Validate your twins with reality: LLMs are shiny black boxes. It is tempting to believe them. But we should validate regularly to calibrate the approach.
Synthetics is an amazing new field. It has amazing potential. Be smart and give your twins only excellent (causal) 'food' and validate them regularly.
This is how you 10x your insights.
p.s. Not sure how to achieve this causal-synthetic tandem? Hopefully I do not need to mention that you can always contact me for a chat.