In episode two of Speaking of AI, LXT’s Phil Hall discusses trending topics such as ChatGPT, Generative AI, and LLMs with John Kelvie, CEO of Bespoken.
Hi. Today’s guest on Speaking of AI is John Kelvie. John is founder and CEO of Bespoken. He has 20-plus years of industry experience and leadership as founder, CTO, and CEO of multiple organizations in the US and in Europe. When you find yourself at a conference losing the will to live after days of death by PowerPoint, John is the person you hope you might be seated with at dinner. He’s interesting, he’s funny, and he’s always ready to question the accepted wisdom on the topics of the day. John, it’s a pleasure to welcome you to Speaking of AI.
Thank you. And it’s a pleasure to be here talking with you.
So you’ve been in this industry since at least 2000. You’ve seen a lot. Things have certainly changed very, very quickly in the last 5 to 10 years. I have my own view on this. I joined the industry in about 2001 and I wasn’t sure this was ever going to work when I saw what the stage of the technology was at that point. I’d love to now hear some of your observations on where the state of the art is today and your expectations of progress dating back to the early 2000s.
I mean, it’s funny. What pulled me into the space was I was initially doing a speech recognition app, and so we were doing it at another startup actually with the same partner. We – and a couple other folks who were terrific – were doing audio advertisements that you could talk back to, which I thought was a really brilliant idea. A little premature, perhaps, but, a very good idea and one that I think we’re still working on. And that got me interested in conversational AI; how speech recognition worked. Being able to communicate hands-free, eyes-free, ambient computing, that whole kind of thing. And then when we saw Alexa coming out, I just thought, this is really going to bring so many people into this ecosystem within a few months.
We started Bespoken pretty soon thereafter, and I thought within a few months people are going to be living in the world that I’ve been living in and you’ve been living in too, Phil, which is really dealing with a lot of data and a lot of testing and a lot of diagnostics. The way I always like to explain to people was in my previous company, we were doing a lot of speech recognition work. I was the CTO. I was not a QA person, but I would write like ten lines of C code – maybe a little bit more – and then test it in a variety of ways that got progressively more complex over somewhere between a day to a couple of weeks because you might be introducing a noise filter, you might be doing something to optimize for an accent or trying to get a particular vocabulary or domain to work well. It was all testing and diagnostics. And I thought, thanks to Amazon, everyone’s going to live in this world and I have some insight into it. And I would say we really haven’t got any traction in the last five years. We have, but I would still say that people are not necessarily taking up that challenge quite as quickly as I thought they would.
And so my impression overall has been that it’s come along rather slowly, the progress. I mean, technology has certainly moved forward. I think that the products have moved forward. And what businesses are offering to customers, I think has moved forward a lot. But it’s been incremental and then it’s interesting having this conversation today because if we had it even at the end of last year, I would have had a different take because then all of a sudden it just feels like everything has changed, and it’s because of OpenAI putting out ChatGPT, and GPT 3 and GPT 4, and it’s just changed everyone’s expectation for what’s possible. I see peers that have been in the space for a while that are saying, and I don’t think they’re trying to gloat, but they’re, I think that they were wondering why I kind of kept at it with conversational AI like, see, like there was something here. It wasn’t just, it wasn’t just being delusional.
I don’t know. I mean, I think it’s an interesting question about what to expect next because I really have no idea. I did not see OpenAI coming. I was not close to the ground on that. And it’s just amazing. And what’s even more amazing is we’re talking to customers now and now we’ve just added on a question to every customer we talk to about what are you doing with LLMs? And a lot of them are doing so much more than I expected. A lot of them are trying to push it into production. And I am amazed, and hopefully I’m not rambling here, but one thing I’ve learned is like, people are not like there’s a lot of testing that’s required. There’s a lot of data that’s required and people are not necessarily in love with those two facts, right? And so it’s like, how is this going to play out? I mean, if you don’t love testing and data, I think it’s going to be an interesting ride for a lot of folks.
Yes, indeed. And in fact, we’ve published a report on AI maturity last year and a follow-up report this year. And I think that’s one of the things that we picked up in that was that people are expecting well, one of the questions was, do you expect to need more training data, more data, less data? Is your spend on this going to increase? Is it going to stay the same? And nobody thought it was going to go down. Well, I don’t think they’re thrilled about spending money on it or investing in it. But out of 200 people in the first year and 300 plus in the second year, nobody thought that their spend on data was going to go down. So it was an interesting stat.
I’ve read some of your other interviews and some of your pieces on AI, and you’ve had a few things to say about misconceptions of AI. So I think you’ve said more than once that there is a gap between people’s perceptions of AI and the reality, and perhaps that the reality is a little bit more vanilla and needs to be a little bit more focused, and to be taken in smaller steps. Am I paraphrasing you accurately there?
You are. And that has been my contention for a long time. I’ll come to what I think has changed, but let’s pretend we were talking in November of last year. I would still stick to what I’ve always said, which is that people need to be very careful about the use case that they’re looking at. They need to understand the technology. They need to be ready to optimize for each of their use cases, even down to an intent level, and they need to not be thinking that this is just going to be like HAL 9000 that you’re going to be chit-chatting with and it’s going to get everything right.
A lot of work goes into it. And quite frankly, my thesis for a long time (which I don’t think I was alone in this) but I think the correct approach to the market was if you can use a website or an app for whatever that use case is if it exists and it’s readily available, people are going to choose that because it’s a more satisfying interaction. And I don’t think that’s a fact that’s been widely disseminated in the first wave of chatbots and voice bots. That’s why I think there has been a lot of disappointment and that’s also why so much of it did retrench, as we did as a company around the contact center because people are going to make phone calls no matter what.
And so that’s the one place where you can say, look, whether you like it or not, we’re going to give you a bot. You’d probably rather talk to a person. If you knew where our app was, that would solve this problem, you would have already used that. But because you can’t find those things. You’ve called this up and now here’s how we can use AI to solve some part of your problems.
That’s obviously very effective. And if you approach it correctly – and I think I see it as a lot of work, probably more complicated than it should be – but nonetheless, rather complex. You can be very successful. And the ROI that we see with automation, with conversational automation, is tremendous. But then when you have something like ChatGPT that comes along, that’s where – now I’m not going to get up and moonwalk – but do I need to revise my answer? Because people at Google and the OpenAI folks, they obviously knew this was coming. I did not foresee it.
I don’t want to be a naysayer because I think it’s tremendous. But there’s some question marks there. But nonetheless, I mean, with that new state of the art, how many new use cases does it open up and does it create the potential for people to go beyond command and control, very tightly controlled domains? I think we’ll see. And I guess the other thing I expect, it’s ChatGPT now, but how long before there’s a VoiceGPT, and what is going to be the impact for that? I think it could be equally sort of seismic, if not more so.
Yeah, absolutely. The interpersonal level of that is going to be quite interesting. I think personal assistants, voice personal assistants are going very well these days. But I think the expectations on quality of synthesis and the quality of the interpersonal skills of the technology are going to really be an important area for development.
I guess I’ll just be interested to see how quickly people can get the new technology into production. That was another thing that was just startling to me is how customers are really thinking about some novel use cases they can use to delight their consumers. Are they going to be able to do it in a safe way? I imagine that’s something that maybe you all are looking at, Phil. I think the sort of safety of these bots is an interesting question for anyone who’s planning on deploying them.
Yeah. In fact, that was going to be my next question for you. I feel like I’ve caught you at a really interesting time. I’m quite glad having the conversation now, not October of last year or a few months from now. I’ve caught you at a point where there’s a certain amount of cognitive dissonance going on for you, because of what you’ve been exposed to and the rethinking of your position. You’ve talked a bit already about being very impressed by what’s been achieved, quite rightly so. I think everybody is. But talk about what you think are the risks, maybe? Where do you see the risks with the current stage of technology? And what do you see as the necessary mitigations?
The risk, I think is kind of obvious and rather profound, and I’ll be very interested to see how people deal with it, which is that we’re working on a benchmark for this. We did a benchmark previously of Alexa and Google and Siri, and so we want to revise that – probably do the same sort of benchmark again – but include ChatGPT, maybe some other of the interesting pieces that are out there. And in the way we evaluated the intelligence, and there’s a lot of ways to do it. But we found an excellent paper written on how to evaluate the sophistication of these NLPs on open-domain questions, which just means basically general knowledge. And then they had a very compelling way of breaking down what constituted complexity, things like temporal questions. Did this happen before that? That was an indication of a more sophisticated assistant. Things like composite questions.
Almost everybody would get it right if you said, “What day was Abraham Lincoln born on?” But a composite question would be, “What day was Abraham Lincoln’s son born on?” You know, then you’ve got to figure out, who’s his son? And then what day was he born? We have more than one, son. You’re getting into more complex stuff.
And then obviously the classic, how well does it do? Not giving you an answer when it’s completely made up, you know? Let’s say Lincoln didn’t have a daughter. “What day was his daughter born on?” They really like to give you an answer about a fictional daughter.
You know, they all sort of find that irresistible and that was easily the worst category for them. They’re an open domain. They’re in the fifties, sixties of the best. Google easily does the best. But, you know, Siri and Alexa are respectable, but 60% means you’re doing great. Just great. And that was a great achievement for those guys.
If ChatGPT – or you know GPT for interface – is able to get to 95% and that’s conceivable, we haven’t assessed it, but if it got to that level of accuracy, what an achievement. I don’t know if it’s revolutionary, but it’s profound.
Nonetheless, that’s still 5% of the time that it’s wrong. And that’s a problem. We had a customer talking about a medical use case, not in the medical domain, without getting too specific. And we know people are wrong.
I mean, if I called my insurance company, they’re wrong a certain percentage of the time. But still, can you possibly tolerate a system that you know is wrong 5% of the time? And how do you mitigate that? I don’t know how people are going to do that, honestly. I really don’t.
We’ve started breaking our heads over how to do it and what we can offer to customers to assist them with that. We think it’s testing and data. But I just don’t know what else it would be. And I guess well, the other part that we sort of hypothesize is that you need the testing, you need the data and you may need an AI as good or better than the one that you’re testing to evaluate it.
At least here’s what we’re hearing from some customers. We can throw enormous data sets at it and it’ll chew through them and really potentially make a lot of sense of them which is fantastic. But how then do you go about verifying that size of data?
It’s impossible to do it manually and it’s not obvious when you can’t do it exhaustively in an automated way either. So you have to do some sort of sampling, and you have to come up with some fairly rigorous testing process. I think it’s a huge challenge for folks that are out there.
Yeah, absolutely. You’ve triggered a thought with me here. I’m wondering if I was put through the same tests that you’re putting these products through, whether my accuracy could reach 95%. I have my doubts. My family would probably tell you that I’d get nowhere near 95%. And I wonder, I wonder whether people can work with 95% if you know that it’s 95% or so.
I’m rambling a bit here, but I think the question is if I achieve 95%, that’s probably okay because people know I’m fallible and they’re quite comfortable with going, yeah, well, it’s good to hear what you think, Phil, but you’re wrong. People are quite comfortable with that. But maybe with something like a ChatGPT, people have a higher expectation that this is the voice of authority. And so the risk is that it’s a two-fold risk. One part is that you don’t know which part is the 5% and the other is that it’s expressed so eloquently and so convincingly that you buy into it anyway.
I think it’s that you don’t know which 5% is not right. But you know, and again, I’m talking about cases that I would characterize as sort of mission-critical. I mean, there’s many use cases, obviously, where business does want it to be. Everyone’s going to say they want it to be perfect. It may not be perfect, but, you know, they may not have that tolerance. There’s lots and lots of use cases where I do think so much of the initial wave of chatbots was around FAQs. I mean, for an FAQ I certainly don’t expect perfection. Especially like, take us as a software company, right? I mean, I guess we should probably roll this out.
Not that our stuff is that complicated, but if you had a question about how our software works, if we deployed a large language model and an answer to your question, 95% of the time, it would be fantastic. If they covered things that we hadn’t come up with our customers are not going to be angry over that.
They know how to get in touch with us. If we say that you’re supposed to use a colon and that’s a semicolon, life is going to go on. It’s not mission-critical. It’s not health care. It’s not financial services. So I do think that I am focusing on some of these not edge cases, but the more challenging ones, because I think there’s so many use cases where it’s changed and it’s better.
And in terms of where you think things are headed, I think that those financial and medical use cases are quite prominent in people’s thinking. Financial requires very quick decision-making, which is clearly a part of the skill set of technology. And do you think that these are viable use cases into the future or do you think that the risks are just too great?
That’s where I just don’t know. I think there are risks and there’s challenges, but I shouldn’t hold myself out as an expert on large language models. The core of our business is around classic speech recognition and NLU technology. So that’s what I really understand best. But I read, you know, a lot of the different articles and announcements and I play around with it. And I think I saw something that said ChatGPT had passed a medical exam, and the score was 60%. And I thought, I guess that’s a passing grade, but that’s a D at least here in the US. I mean, I don’t want the doctor. They got a D on that also. I think there’ll be so many regulations to go through, but to what you were just saying Phil, like as a business person, I would want it to be perfect.
I know our customers want everything to be perfect, but for me, it’s like a patient. If it was 60 or 70% correct, I would be really impressed because I’ve been to doctors. I’m in my forties and approaching older ages and I’ve seen a lot of doctors. They’re very intelligent people, but they’re wrong a lot of the time. And so something that can get you information that has that sort of breadth of knowledge, it obviously can bring a lot of benefit there.
Yeah. A very good friend of mine who had a first career as a doctor, as a general practitioner, she said to me that she goes into a doctor’s surgery and sees their degree on the wall and she says, I want to see their academic record on the wall, not their degree. I want to know whether they scraped in or blitzed it.
Yeah, yeah. I agree that they should be forced to publish that.
Yeah. So I’m just going to shift gears a little bit here. In terms of data, are you working with synthetic data? We hear people talking a lot about synthetic data at the moment. Where do you see things going in that area?
We do a lot of work with synthetic data, but this is where I would say the industry has not matured in the way that I thought it would. When I started off, I was doing my own testing and building my own system. I was using a fair amount of synthetic data, using my own voice a fair bit. And then I had a need for more data. And I started using Mechanical Turk, which was pretty low-quality data. Definitely. Anybody listening to this definitely talk to LXT, don’t try to use Mechanical Turk.
Thank you, John.
But then we got some crowdsourced data and then we got production data and we had a whole big mix of those and sorted into different datasets and used them in different ways. And I really expected that was the journey that everyone would be on. We’ve got a good client base now. I’m not going to claim that we have the most sophisticated customers, but we’ve got big names working on high-profile projects and it’s hard to even convince them of the benefit of doing the synthetic data testing.
I sat down with somebody who used to work at Apple and they said, well, you guys are always advocating for testing with all this different data. And did you know Apple, all we did was golden utterances, and I’m sure that’s not what their data scientists dealt with. But on the QA side, it’s just like we want only the most pristine audio. And I thought, well, that’s surprising to me. You really need more than that to evaluate the fitness of these engines. But customers that we see, they find that intimidating.
I think us too. When we’re dealing with large technology organizations who are building core technology, there’s no question they get it. They understand the value of this. But when you’re dealing with second tier customers who are users of the core technology rather than developers of the core technology, then that conversation becomes a little more difficult and takes a little longer for them to fully understand the value of the testing, etc.
Yeah, I mean, and the interesting thing is, they will frequently, when we’re starting off working with people, they buy into that whole roadmap. Let’s start with some synthetic data. Let’s then do crowdsource. Then maybe we can work in the production data. Production data is tricky because there’s obviously a lot of sensitivity around that.
And you do need a way to safely annotate it, which can be challenging. They sort of tend to get tired or something. So they let it go. Our customers are banks and entertainment companies. I don’t want to name specific ones in the context of something that’s a little bit negative. But, you know, they are not Amazon. I mean, we know Amazon is knee-deep in data. Same thing with Google and the like. And it’s often being said that whoever has the most data wins.
We’re running out of time here, but I just have one last question I’d like to ask you today, which is if you were running the interview, what is the one question that you would want to ask yourself and what’s the answer?
I like that question. We sort of talked about it, but I do think and you sort of asked this, Phil. But I mean, I would just put a pin on it, which is, “What is going to be the impact of LLMs on our space?” I think that that’s the most interesting question that we’re all confronted with now. I don’t come here today pretending to have the answer to that. You know, I will be a very wealthy person soon if I do find the answer to that.
But I think, from our point of view, we think that LLMs are going to get deployed quickly. We see immense benefit to them. I don’t want to say they’re overhyped or underhyped. But, this is a real leap forward in terms of the state of the art of technology. I think because of that, there’s immense interest in it.
People are going to start deploying it, finding use cases for it. It’s going to unlock use cases they couldn’t do before. And I think for customers that are out there that are thinking about it, the thing I would sort of say to them is: do it. But then you do want to work with partners such as Bespoken and LXT so that you are able to do it safely. So that you are evaluating the answers and the quality of the responses that these engines are giving to make sure that your customers are getting the best possible representation of your organization and your knowledge.
I think that that’s the promise that they hold. I also don’t think people should be frightened about it. This is where it goes back to an ROI question, you know I mean if we go back to the example of me as a patient, I mean, if it’s 78% correct on diagnosis, getting me the information, that is very useful 70% of the time. I think that’s still really helpful. What if you can get it to 75, though? You know what? If you get it to 78, what if you get it to 80? It’s worth your time to look at that and to do that optimization. That’s what we’ve seen from the outset. And the sort of message that’s, you know, or the mission that’s informed us.
And I think my hypothesis on how these LLMs fit in is that it’s going to continue to hold even more so because the whole trend that I believe was the case with AI is it’s less development that’s going down and you’re testing and your quality assurance and your data goes up in terms of its importance.
And that equation is continuing to become more and more differentiated. So that’s what we see happening. But at the same time, I will gladly revisit this with you in a year or two years, and we can see how it plays out because it’s going to be, whatever happens, it’s going to be fun I think.
Yeah, I like your assessment of that. Like the idea of reconvening in a year or so and comparing notes on how that’s actually gone, how things have actually played out. So thank you. That was a well-selected question. I’m glad you asked, and thanks again for spending time with us today. It’s been really stimulating and interesting conversation as I expected it would be. Thanks very much and I look forward to speaking to you again before too long.
Yeah. Thank you, Phil. It’s great talking with you.