LXT Podcast I Trends, Generative, and AI Data Quality

In the third episode of Speaking of AI, LXT’s Phil Hall chats with Saadia Kaffo Yaya, Program Manager Lead for AI/ML at Pinterest, about her views on Generative AI, AI data quality, synthetic data and much more.

Phil:

Today I’m with Saadia Kaffo Yaya. Saadia is a program manager with extensive experience and expertise in data-centric approaches to machine learning. She has worked in product development, quality assurance and specialized localization projects in France and in the U.S.. Most recently, she has led data operations team supporting machine learning for AI at Amazon and currently at Pinterest. She is a vision driven, problem solving risk taker and we are thrilled to welcome Saadia as today’s guest on Speaking of AI. Hello and welcome Saadia.

Saadia:

Thank you, merci Phil, very exciting to be here today.

Phil:

That’s great. I’d like to kick off by asking about your impression of the very rapid emergence of generative AI.

Saadia:

Interesting. Rapid, indeed it has been. To be honest with you, I have been very much blindsided by ChatGPT and all of sister or brother technologies of generative AI. And I want to say that it is a new field for all of us. It’s something new to explore, something new to integrate into our life the same way the Internet actually changed, revolutionized how we work and how we live. I believe ChatGPT and generative AI is doing the same right now. And I’m quite excited to see what the future holds when it comes to this type of technology.

Phil:

Yeah, so I’m going to ask a question now about some current news items related to generative AI. This week in AI news, the European Union agreed to draft rules for AI and they placed a clear focus on generative AI. And then in other news, Google announced a massive expansion of Bard. They listed 180 countries where it’s available. Now clearly, this is global coverage. But interestingly, the list of 180 didn’t include any countries from the EU. I’d like to bring those two data points together, knowing that you’ve lived and worked both in the EU and in the US. How do you see the EU regulations around privacy affecting the rollout of generative AI? And do you think this indicates that the EU is leading the way or is trailing behind?

Saadia:

That’s a good question. Well, as someone that has lived in both the EU and the US, I want to say that the way technology is used, the way technology is perceived is very much different in the EU versus the US. My personal opinion again is that in the EU people are very much intentional in how they’re putting themselves out there. For example, one is employee social media. In the US you have people using social media a lot, sharing all of the ins and outs of their personal lives on social media. Whereas I feel like in the EU it’s a different approach to it where people are more reserved. Basically, people take more time to observe and see before they try and dabble with it.

So to that experience I want to say, yes. I’m not surprised that EU is already starting to think about what are the implications of generative AI. How are people lives are going to be changed because generative AI is sometimes replacing whole job categories for like customer service, for example is one example where generative AI can be substituted and you can do generative AI instead of using actual humans, to provide customer service.

So from that standpoint, I understand EU trying to put a framework, put some boundaries in there to make sure that, you know, people’s right or human rights are respected in that manner. And from that I want to say that they are kind of a pioneer in my opinion. I don’t think they are falling behind. I think they are asking themselves the right questions, because as I said, this will impact our lives for good or for the good reasons or the bad reasons. It is going to impact our lives. And I think it is good that the government is having – in the EU – is having to start and trying to think ahead. OK, how would this impact people and then putting laws in place to prevent any abuse or any risk there.

Phil:

Yeah. Now that makes a lot of sense. Thank you. It does seem like it’s a real cultural orientation. It’s not just, you know, people in technology. It’s people and culture and technology and there’s something immovable there that you can’t just dismiss.

Saadia:

I agree with you. Because generative AI is new, right? And I don’t think those people from the government have any expertise in generative AI, but still, they’re going out there and saying, hey, let’s understand this thing and it’s culturally right because we want to preserve our culture somehow. Let’s make sure that we have boundaries in place. Yeah, I agree.

Phil:

Now, recently, LXT, in fact, twice in recent years, we’ve carried out research projects to talk to organizations that are rolling out AI programs and something that’s very relevant, I think, to your current role and your recent roles is that we found in our survey that there was a real concern over data quality. So I have some questions about data quality for you. How would you define data quality? How important is it from your perspective and what are the risks that you take if you miss data quality?

Saadia:

That’s a very good point you mentioned in here. When you think about not only generative AI but any type of machine learning model, when you think about it at some point, for those models to be able to perform at the level they do, they have been provided with some type of data, right? There’s the data that feeds into the model. There’s the insight behind the model, there is a code behind the model. And then that model outputs something, right? Which is sometimes recognizing an object or localized text in the paragraph. So historically, the way quality has been addressed when it comes to machine learning models was to look at the code and try to fix the code or look at the infrastructure behind it and try to fix that physically. But in the most recent years, I want to say the last ten years, the focus has shifted into the data itself.

There is this saying in machine learning when we say, “garbage in garbage out,” meaning if the data you serve to your model is not of high quality, then the performance of your model will be directly affected by the quality of the input data. And that’s where what I’m doing right now evolved the role that I have, which is ML Data Ops where you have people like me, that help engineering teams or data scientists to figure out what is the best approach to get high quality data to have a performing model.

So data quality always starts with, what are we trying to do, what does the model do? Data quality is intrinsically related to the model itself, the performance of a model. If you have a model, for example, for which you want to recognize a dress in an image, you might feed that model with dress images and things that are not dresses as well. And then you will give that data to a human. For example, have a human label that data and check the quality of what the human has done. And then you can infer a little bit the quality of the model itself, basically.

Phil:

It does. Then when we look at the subtle distinctions. I understand the drivers behind the data labeling. Where is the fine line between a well-annotated dataset and a poorly-annotated dataset? And what happens.

Saadia:

For me, the fine line between both is to be able to look at the data that has been labeled by a human, for example, and being able to have a sort of quality control process in place. You cannot just take the data as it is and fit it into the model, right? In data quality, in order to have a high-quality dataset, there are several levers you can pull when you come to human annotation. The first lever you can pull is for those humans to label that data. They’ve been given a set of rules, right? How explicit are those rules? You usually give them examples, positive examples of what you’re trying to achieve and negative examples of what you’re trying to achieve.

How explicit was your pool of examples, for example? So that’s the first lever you pull. How explicit are my labeling, what we call labeling guidelines, right? How explicit am I in explaining what I’m trying to do in laying down what I need them to give me exactly? That’s the first thing. Second thing is the quality of the data itself. Does the data represent what I’m trying to achieve? For example, if I want a model to recognize dresses and I’m feeding the model with various type of data of dresses, but also non dresses, am I just getting the model dresses early or given type of dresses basically. So the variety also in the type of data you’re feeding the model determines quality here.

Now, that data is labeled by humans, right? Those humans, humans come in different shapes and different beliefs and we come with biases as well. Like how good are you at determining that pool of people that are labeling the data, what could be the possible bias and how am I mitigating those biases, right? How am I rationalizing the problem I’m trying to solve to make sure that I’m getting the quality that I want? And one last thing that I want to add is that one challenge that I haven’t figured out yet about data quality is you give usually, you know, in order to train a model you have, I want to say sometimes millions, right, of judgments of human labels that you feed into the model where you cannot look at every single judgment. So you just look at actually a subset of that which gives you an idea of the overall. So, also your sampling strategy, the way you sample that data matters, because if you skew toward a given category or not, that was skew quality as well.

And lastly, one thing that is important is that depending on the type of datasets, you have datasets that are very objective, right? You have the type of task where the answer is binary. Zero or one. Is this relevant or irrelevant? That’s a binary answer, right? But you have also datasets where instead of looking for a right answer, you’re looking for people’s perception of things. For example, you might look at a query and say, OK, is this query only arts, is it fashion related? Is it a quote? What is it, right? So now you want to gather people’s, all the possible impression out there of that given image to give you a trend of what you’re looking for, which is sometimes subjective and then quality sometimes is very hard to define for those type of subjective datasets rather than the one that are very objective, basically.

Phil:

That’s a great answer. Thank you. And my own observation, you mentioned a ten-year period over which this data-driven, rather than writing code approach has been in place. I’ve been around a while, certainly long enough to remember before and after. My own observation, was that early on in the shift towards data-centric approaches, there was a focus on what we need is quantity of data. We just have to have massive amounts of data. And over time, I think I’ve seen over the last ten years, I think I’ve seen that shift. I guess nobody’s saying we don’t want massive amounts of data, but the emphasis on quality has grown over that period. Initially, the message that I was hearing at conferences ten years ago was we don’t really care too much about annotation, all we want is volume of data. That’s the most important thing. And that doesn’t seem to be the message today.

Saadia:

Not anymore. And when you look at ChatGPT, for example, I’ve tested recently ChatGPT versus Bard, for example, and giving both of the generative AI the same type of prompts and have noticed that actually ChatGPT was way better than Bard. It was closer to how a human would think and gathered more nuances. And I think the difference between both is actually, as you say, the quality of the data in from what Open AI is saying as well is that they have really doubled down on quality rather than quantity. And I agree what you’re saying that it’s a trend, also I see throughout the industry, you know, I worked at Amazon, I’ve been at Pinterest, it’s a trend I’m definitely seeing which is that what high quality is enough start rather than gathering a huge amount of data and also the technology has evolved around AI where models now can learn from themselves and they get better over time. So the better the quality, the better they get at learning something from themselves.

Phil:

So I cast my mind back about two years ago and there seemed to be a very strong focus on synthetic data. I’m not hearing quite the same focus on synthetic data, quite the same emphasis, shall we say, on synthetic data today. But what’s your take on synthetic data versus real world data plus human annotation?

Saadia:

That’s a good question. I’m going to start by saying that I’m no expert in synthetic data, but I think synthetic data sometimes can help break in, can fill the gap where let’s say, you have companies, for example, that cannot afford to have human labels because collecting human labels or real world data costs money, sometimes. You have, of course, public datasets that can be used, right? But usually public datasets seem to be generalist. So if you h a specific use case, for example, you might want to rely on synthetic data. If you have a case of, for example, privacy as an issue, I don’t know if you cannot use, for example, customer data or you have a shortage of data, synthetic data can make up for that, basically, and fast track your progress rather than quality, you know, human data. And in the case of companies that might have the means actually to collect human data, but don’t have the time, right? Or want to optimize for speed, synthetic data can be a great way to be able to fast track production without having to compromise on something else, basically. So that’s my take on synthetic data.

Phil:

The key phrase in there in your response there was to fill gaps. Does that indicate that you don’t see synthetic data replacing real world human annotated data?

Saadia:

Not really. And I’ll tell you why. In real world you have humans, people like us. And the goal is to have those machines learn from humans. A thing emulates what a human would think. Synthetic data, as you say, is not real world data, it’s the opposite, not the opposite, but maybe complementary to real world data. So in that optic, I do not believe that synthetic data would take over, you know, human logic or human feedback.

Phil:

Now you’ve somewhat preempted my next question in your response already, but what advice would you give to a startup about setting up a data pipeline for AI if they don’t actually have the deep pockets and resources that a major tech organization like Pinterest has?

Saadia:

That’s a very good question. So I’m going to slice and dice data labeling. You know, the process of getting data in small chunks? The first step is to identify the customer problem you’re trying to solve. What am I trying to solve for? What is AI solving for me? Like what use cases do I want the AI to solve? So I’m going to assume that that this startup has figured that out already. The next step is, what is the set of data that I can gather that will help me solve for that problem? And third step is now I’m going to build the model, feed the data into that model. Check the results and make sure that now I’m on the right path to solve the customer problem that I want to solve in first place.

So different approaches in the term of collecting the data itself, either they have some, they can use public datasets, as I mentioned earlier, that’s one way to solve for that problem. Or they can buy synthetic data, that’s one. Another way to accomodate that, or they can also gather data from their own customers, which I don’t think is something that I would recommend doing, because now they need the consent of the customers to be able to use customer data to train a model.

Now I’m going to move to the second part, which is train the model itself. Either they go in-house and build a model from scratch, which I don’t recommend. I feel like in today’s world you can actually use an existing solution and rather pay for a solution that exists already and that might also even have solved the data collection problem. So two in one – use something that’s existing and fast track. Because as a startup, sometimes I think what you want to do is be the first to put something out there, right? Startups don’t have the luxury to do research and development or the means that big companies have. So you have to, speed is something that you ideally don’t want to compromise, but then you want to make sure that it’s of a good quality, right? So focus more on quality and trying to outsource the part of the tooling part in data collection part in focusing on, okay, the data that I’m getting, is it good of quality and use the resources I have internally to make sure to maximize for quality, maximize for speed, and maybe compromise on the time it takes to build it from scratch rather than outsourcing again to fix the problem.

Phil:

I think that’s very good advice. So my next question, I mentioned our research report that we’ve worked on with a large number of companies entering the AI space at varying stages of maturity and AI maturity has been the central focus of this research report. Your take on AI maturity, how far off is it and what will it take to be able to say, we are achieving AI maturity globally and locally?

Saadia:

So two questions. The first one is how would we say AI maturity is. Before, you know, the whole discovery of generative AI, I wanted to think that we were quite advanced. Why? Because AI was solving problems. Like, for example, in the medical field where you could use AI to diagnose people having cancer or stuff like that. So I was I was going to say that AI was reaching its maturity, but then Gen AI happened. And I want to say that we are only at the beginning of what AI can do to us when it comes to the technology itself, again, because I think AI has different angle that you can look from. You can look at it from what it allows people to do, how it can enhance our productivity as humans, allow us to do more. You can look at it from a lens of how can we improve our health as humans? Health, whether it is physical or mental health. We can think about it in terms of regulation as well. How are the governments going to catch up with the pace at which new technologies have been released to make sure that risks and biases are you know, that we have boundaries in place. Because the risk with AI is that it is moving so rapidly that it is replacing, as I said, people’s jobs in some cases when it comes to customer service, for example.

So how do you make sure that there’s no harm, right? How do you make that AI is not harming more than humans are already today, right? So that’s my answer when it comes to AI maturity. Now, you had a second question, I believe.

Phil:

Well, what would it take for us to be able to say that we are achieving AI maturity?

Saadia:

Wow. I want to say the first answer that come to mind, I do not know. But then second option is, let me think about this. Right. What would we take? I want to say that what it would take is the perfect balance of AI, allowing us to be literally better humans. And for me, what a better human is mental and physical health, right? Because without that mental health, physical health, there’s nothing humans can do. So how does AI enhance that for us, that would be the first thing for me. That would be, ok, first thing does AI can do that. Second thing again is regulation, balance that, so there’s a balance between what AI can do to enhance our lives and what AI can do to harm us.

Because we’re just at the beginning of this and we have no idea what those technologies can do, right? So that would be for me, the balance of, you know, AI being mature where we reach a point where we can use it for our own good and our own benefit without, you know, harming humanity, basically. That would be the best next answer I would give to this question.

Phil:

And that really is a great answer to the question. Thank you. When I think about the maturity of our own research on this, even our research is so far behind that stage that we need to be at to make this meaningful. But I love the way you framed that.

So I’m going to close off with just one more question. I say another two-part question, but if you were the person running this interview, what would be the big question that you would want to ask yourself and what is your answer to that question?

Saadia:

I thought about it before the interview, and it was about regulation in laws, actually. So you have grilled me a lot on that. That was the first thing that would come in mind, because AI is new. But then what are the implications, right, for when it comes to our laws. So what question do you ask me today? That’s a good one. Let me think about it. I think one question that you would ask me today, is do you think AI would take my job?

Phil:

Yeah. That is a question.

Saadia:

That is a question that I think you could’ve asked me. And the answer to that, if I want to answer that question, I don’t think anytime soon. I believe humans are unique. We have each of us have a unique way to approach our jobs today. And from what I’ve seen so far from generative AI, the capabilities that they have are quite generic. You know, they don’t have that edge that every human being has. And also AI doesn’t feel emotions as far as I’m aware. You know those are senseless machines, if I may say so. I do not believe that AI will replace me or my job anytime soon.

Phil:

Yeah, actually, you know, you’ve made me think of something that hasn’t occurred to me before, which is as this very small set of generative AI products develops, to some extent, they will each develop personalities. But we’re talking about what, you know, we’ve right now got the, the two big ones, ChatGPT and Bard, you know, that others, Amazon and others are working on their own personas. That’s maybe less than ten personalities. Every day I deal with hundreds of personalities and there are billions of personalities out there. It would be a shame to have a set of ten dominating our consciousness.

Saadia:

And that would be related to, again, biases in data, which we were talking about earlier. Those models right now are biased toward the give and take of, as you said, personality or main quality. They do not represent, you know, the world at the moment. So we’ll see.

Phil:

Well that is that’s a big enough topic that perhaps we could revisit in six months or a year and…

Saadia:

Exactly.

Phil:

…come back for a further conversation where we just focus on culture. That could be great. Saadia, thank you so much for being our guest today on Speaking of AI. Your answers are deeply thoughtful and very interesting. And I’m sure that people are going to really enjoy listening and watching the interview. I can’t thank you enough. It’s been great having you on the show.

Saadia:

Thank you very much Phil for giving me this opportunity. I don’t get it often, so I was very excited to be here. I enjoyed also the thoughtfulness of your questions, so thank you for having me.

Phil:

Wonderful. Thanks again. Bye bye.

Saadia:

Bye.

LXT podcast – episode 3: Saadia Kaffo Yaya – Program Manager Lead, AI/ML at Pinterest