Collecting quality training data: A how-to guide for better AI


Artificial intelligence (AI) is a top priority as the go-to solution for helping companies drive digital transformation and gain a competitive edge. Demand for AI will grow significantly over the next few years, and those companies that pass will run the risk of being left behind by their competitors.

An AI solution’s effectiveness and accuracy are directly linked to the quality of the data used to train its machine learning algorithms. Ensuring the availability of quality data begins with incorporating a data collection plan into the data strategy that a company creates to support its AI initiatives.

In this guide, we will discuss topics such as why data is essential to AI, what quality data looks like, how data collection fits into the lifecycle of AI, and how to create and sustain unbiased AI.


Key findings

AI investment

  • More than half of organizations surveyed (56%) are spending between $1 million and $50 million on AI annually, and 15% are spending $51 million or more.
  • Over a third of high revenue companies are spending between $51 million and $100 million on AI annually.
  • AI strategies are primarily driven by innovation and growth needs, where AI enables businesses to scale and innovate faster, and to secure competitive advantage.
  • Efficiency and productivity gains are seen as the most dominant problems that AI can solve across industries. Improved analytics and business expansion are also high priorities.

AI and maturity levels

  • 40% of organizations rate themselves within the three highest levels of AI maturity according to Gartner’s AI Maturity Model: Operational, Systemic, and Transformational.
  • Companies in the Systemic and Transformational levels of the AI maturity model are using AI to scale and drive competitive advantage and product innovation. They are also budgeting higher amounts overall for AI programs, and are using both supervised and semi-supervised machine learning methods.
  • AI investment and AI maturity correlate, with a quarter of AI maturing organizations spending $51 million or more on AI, compared to just 8% of experimenters (those organizations in the awareness and experimental stages of AI adoption).
  • Regarding the drivers behind AI strategies, businesses at the Transformational end of the scale have already experienced the benefits of better risk management and delivering on customer needs. Now, they are looking to scale up and accelerate product innovation. AI Experimenters are focused on driving innovation and managing large volumes of data.
  • AI Maturing organizations rely more on semi- and fully-supervised machine learning and AI nascent organizations report a greater use of unsupervised machine learning.
  • Maturing organizations view quality training data as an essential contributor to AI success.
  • When asked about the benefits experienced as a result of high-quality training data for AI, Experimenters see efficiency and agility gains, while Maturing organizations report accelerated time to market and improved competitive advantage.
  • Two-thirds of all respondents expect their need for training data to increase over the next five years, and AI Maturing organizations indicate the highest need to increase their training data budgets over this timeframe.

AI trends by industry

  • The financial services industry is leading the way, with 43% of organizations having reached Systemic or Transformational levels of AI maturity.
  • The tech industry follows the financial services industry, with 22% of organizations having reached the highest levels of AI maturity.
  • Respondents state that efficiency and productivity gains are the most common goals of AI strategies for their industries as a whole, except for financial services companies, which say that improved analytics is the main goal.
  • The tech, retail, manufacturing/automotive and professional services industries are deploying AI to innovate and advance product development, while financial services companies are driven by competitive advantage and risk management.
  • Text data is the leading type used across all industries; future data types include audio, user behavior and video.

01 Why data is essential to effective AI

Many AI solutions are powered by a machine learning algorithm that informs every decision. These decisions are based on the algorithm’s processing of information, or training data, and a team of people who iteratively annotate or correct the algorithm’s incorrect responses in order to help improve its accuracy. Put another way, a machine learning algorithm has the raw potential to be trained based on the data to which it’s been exposed. But not just any data will do.

When bringing an AI solution to market, you’re placing your confidence in the solution’s ability to make sound decisions for your company and customers. As a business or technology lead, you must have confidence that your algorithms are making decisions that are accurate, reliable, and defendable.

Quality data and why it matters

To this end, algorithms must be trained using quality data. More specifically, the data must meet the following guidelines:

  • Complete:

    Data sets must contain all essential information, with no missing values.
  • Timely:

    Data must be updated to reflect current market conditions.
  • Consistent:

    Data must not change as it moves across a company’s network into different storage locations.
  • Distinct:

    There should not be any duplication or overlapping of values across data sets.
  • Accurate:

    Data must reflect actual, real-world scenarios backed by verifiable sources.
  • GDPR compliant:

    All data and personally identifiable information must abide by guidelines outlined in the EU’s security and privacy regulations.

Once data is collected, it’s essential that you clearly and accurately label to ensure your algorithm makes sense of it. While this guide isn’t focused on the details of labeling data, it’s another important element for the effective training of supervised and semi-supervised machine learning algorithms.

02 How data fits in the AI solution lifecycle

Throughout the lifecycle of an AI solution, training data encompasses nearly every stage—whether it be collecting the data, labeling the data, or using data to train your algorithm.

The AI solution life cycle

  1. Prototyping an AI solution
  2. Collecting the data
  3. Labeling and organizing data
  4. Using the data to train an algorithm
  5. Deploying the AI solution
  6. Collecting live data
  7. Using live data to improve the user experience

Because data plays such an integral part in the life cycle, one of the first things you should do is formulate a data collection plan. This will help ensure that you have enough quality, relevant data to develop your solution, get it quickly to market, and sustain it over the long-term.

Short and Sweet Headlines are Best!


03 Data collection plans: What they are and how to build them

A data collection plan helps identify the data you need, where you will collect it from (whether in-house, via a public data source, or by creating a custom dataset), and who will collect it. Also note that if you’re collecting data in multiple languages, your data collection plan should be localized for each language locale. An experienced data collection partner can help develop your plans.

Crystallize your problem and solution

A good place to start a data collection plan is to write a statement identifying the end user, the nature of the problem, and the information you need to solve it. This can help crystallize your understanding of the situation and provide a point of reference for determining the data you need to build the dataset.

Determine how much data you need

When quantifying how much data you need to collect, consider the following factors:

  • The complexity of the problem:

    In a speech recognition scenario, for example, the amount of data you need is driven by the number of languages and dialects you’ll support. With computer vision solutions, consider the complexity of the scenario in which it will operate. For example, if you’re building an autonomous vehicle, you must account for roadways, other cars, pedestrians, construction zones, and other hazards. There are also numerous scenarios and edge cases that you must consider, as well as the algorithm’s ability to process and synthesize data from a car’s various sensors, including camera, LiDAR, radar, and ultrasound.
  • The complexity of your algorithm:

    For example, non-linear algorithms, in which data elements can’t be arranged in a linear or sequential order, require more data in order to clearly map out the relationships between different data points. Common uses of non-linear algorithms include tracking anomalies in financial transactions, segmenting consumer behavior, and identifying patterns in inventory based on sales activity.
  • The number of features you are training for:

    A feature is a measurable property of the object you’re trying to analyze. For example, this might include the name, age, gender, ticket price, and seat number on a passenger manifest.
  • How much data you have on hand:

    An algorithm that’s been trained using a curated and diverse dataset of external and in-house data may be able to provide greater value than by training solely with in-house data. Therefore, it may be advisable to curate additional data from external sources and train the algorithm with a dataset containing features that are tangentially related to the business problem.

The amount of data needed to get your AI solution from pilot to production varies by use case. An experienced data provider can help you navigate the data collection process.

04 How to collect data for AI

Once you’ve developed a data collection plan, it’s time to start collecting data. In some cases, you may not have enough in-house data on hand, what you do have may be improperly formatted, or removing errors to improve the quality may not be cost-effective. In such cases there are other avenues to consider. Consider one of the following options:

  • Generate in-house data:

    Automatically generate data from line of business apps and websites; sensors embedded across your company’s physical space and assets; or scrape data from social media, product review sites, and other online platforms.
  • Collect primary or custom data:

    Conduct focus groups or surveys with end users or scrape data from the web. Either way, this involves generating a unique dataset that’s tailored to your business problem.
  • Export data from one algorithm to another:

    Use one algorithm as the foundation to train another. This collection method can save time and money, but it only works when moving from a general context to one that’s more specific.
  • Generate synthetic data:

    Rather than using data containing personally identifiable information that will raise security and privacy concerns, you can elaborate upon that data. Synthetic data replaces all of the personal data with random, anonymized data that approximates the same relationships. One limitation is that it may not reflect the full scope of the problem you’re trying to solve.
  • Use open source datasets:

    Open source datasets can help accelerate your data training, and there are several online providers of open source datasets (such as Kaggle and When using an open source dataset, be sure to consider its relevance, the potential for any security and privacy concerns, and the reliability or lack of bias.

05 Different types of data for AI

The types of data you need to train your algorithm depend on the business problem you’re trying to solve and how users will interact with the AI solution, whether through text, speech, imagery, or gestures. The following are different types of data and the particular goals for each:

Audio data

Audio data covers a range of sounds including music, animals and other objects, human sounds (e.g. coughs, sneezes, or snores), and other background noises.

Audio data is used in training algorithms for a variety of uses, including virtual assistants, smart car systems, smart home devices and appliances, voice bots, and voice recognition-enabled security systems.

Speech data

When it comes to capturing human language, it’s best to capture the audio in a live environment that reflects the scenario in which your AI solution will be used. For example, when capturing audio for an in-car solution, it’s best to record the driver speaking while they are driving.

If budget or time constraints prevent you from taking this approach, do your best to approximate background noise and other aspects of the environment in which customers will use your solution.

And when recording speech data in foreign languages, be sure to account for the many dialects. If different dialects aren’t recorded, the variations in accent and pronunciation can make it difficult for an algorithm to understand a command or lead to inaccurate interpretations of the command, whether for the entire user base or only a fraction.

Once speech is recorded, it is often paired with a text transcription version of the recording.

There are three types of speech data:

  • Scripted speech:

    Scripted speech data is focused more on the different ways in which the same set of words might be pronounced, based on different accents, dialects, and speech mannerisms. Typically uses include voice commands or wake words.
  • Scenario-based speech:

    Scenario-based speech data is a step closer to natural language collection. The goal is to capture what and how a person would say something in a particular situation, such as speaking a command to an audio assistant or mobile app.
  • Unscripted/conversational speech data:

    Unscripted or conversational speech data is usually generated by recording two or more people talking about a particular topic.The primary goal is to train algorithms on the dynamics of multi-speaker conversations, such as changing of topics, flow of a conversation, unspoken assumptions between speakers, and multiple people talking at once.

Image/video/gesture recognition data

Visual data—video, still imagery, and gesture recognition imagery—is used in a wide variety of scenarios, such as robotics, gaming, autonomous vehicles, in-store inventory management, and defect detection systems. Imagery can be captured using a phone or camera, but if you’re doing it at scale (with multiple people taking pictures), everyone should use the same equipment.

In addition to the images, visual datasets include annotations describing what the algorithm should be looking for, and where it’s located in the image. This “ground truth” helps an algorithm understand the sorts of patterns it should look for and detect.

Computer image & video datasets must contain hundreds, if not thousands, of high-quality images. There must also be a wide variety of images. In addition to creating your own custom dataset, consider one of the many public sources of imagery data, such as ImageNet and MS Coco (Common Objects in Context).


Text data

Text data can be gathered from a variety of sources, including documents, receipts, handwritten notes, and chatbot intent data. It is used to develop AI solutions that understand human language in text form, most notably in chatbots, search engines, virtual assistants, and optical character recognition.

Conversational AI systems, such as audio assistants and chatbots, also require large amounts of high-quality data in a variety of languages and dialects so they can be used by customers around the world.


06 Improving AI performance

Algorithm training isn’t over once an AI solution has been released to market. Data is a representation of real life, and over time the data used to train an algorithm becomes a less accurate reflection of current market conditions. This phenomenon is known as model drift, of which there are two types. Both require continued retraining of an algorithm.

  • Concept drift

    happens when the relationship between the training data and the AI solution outputs changes, either suddenly or gradually. For example, a retailer might use historical customer data to train an algorithm, but when a monumental shift in consumer behavior occurs, the algorithm’s predictions will no longer reflect reality.
  • Data drift

    , sometimes called co-variate drift, happens when the input data used to train an algorithm no longer reflects the actual input data used in production. Typical causes include changes brought about by seasonality, demographic shifts, or an algorithm being used in a new geography.

07 Signs that it is time to retrain

The frequency of retraining required depends in part on the market in which an algorithm operates. For example, predicting college student attrition rates requires a predictable and intermittent schedule, and operates within a fairly non-hostile environment. Alternatively, credit card fraud is an adversarial environment in which an AI solution must be perpetually retrained to adjust to changes in the threat landscape.

Establish a benchmark

Regardless of the environment, you must establish benchmarks and collect live data to track your AI solution’s performance and determine when additional training is needed.

Monitor feedback

User feedback can also help alert you to when additional training is needed. If customers complain that your chatbot isn’t providing help in a specific area, that could be an opportunity to collect more data and train or retrain the algorithm.

08 Reducing bias through data collection

Humans will inescapably introduce bias when training an algorithm. For example, in the test of a widely-used facial recognition solution, the ACLU found that the software “incorrectly matched 28 members of Congress, identifying them as other people who have been arrested for a crime.” Of particular note, the false matches were disproportionately of people of color.

Following are some of the different types of biases that can materialize:

  • Bias with pre-processing:

    You don’t have domain expertise, don’t fully understand the data, and don’t have sufficient understanding of the variables.
  • Feature engineering bias:

    A machine learning model’s treatment of an attribute, or set of attributes (e.g. social status, gender, ethnic characteristics) negatively impacts its results or predictions.
  • Data selection bias:

    A training dataset isn’t large or representative enough, leading to a misrepresentation of the actual population.
  • Model training bias:

    There’s an inconsistency between the actual and trained model results.
  • Model validation bias:

    A machine learning model’s performance has not been sufficiently assessed using testing data, rather than training data.

Data collection and human intervention can eliminate those biases, but it takes planning ahead, an understanding of the domain in question, and how the presence or absence of a data point can skew the AI solution’s results.

At a more fundamental level, by working with people from a diverse range of backgrounds, collaboration styles, and worldviews, you can mitigate biases and develop responsible AI solutions.

09 The benefits of a data collection partner

Launching an in-house data collection initiative can be a costly endeavor that ties up resources and slows down your ability to build, train, test, and release a machine learning algorithm to market. By working with a trusted partner that can collect large volumes of reliable data, you can gain their expertise and stay focused on your business.

Cost savings

Managing the continual process of data collection in-house can be cost-prohibitive. Working with an experienced data collection partner allows you to ensure that data is collected correctly and thoroughly from day one, helping you to build a reliable data pipeline and focus your internal resources on driving innovation.

Bias reduction

Data collection partners with a diverse network of contributors can help to build a robust data pipeline that represents a wider range of demographics, dialects, and experiences.


Generating a sustainable, scalable training data pipeline takes specific expertise. Experienced data collection partners can provide quality assurance methodologies and processes that ensure a consistent, reliable flow of quality, AI-ready data.

Time to market acceleration

Creating a scalable data collection program from scratch isn’t out of the question, but by the time you source enough data at the right level of quality and diversity, your competitors may have already captured your market share. Alternatively, an expert data collection partner can design a data collection program, onboard contributors and generate data in a matter of weeks, making all the difference in the success of your AI initiatives.

10 What to look for in a data collection partner

If you’re ready to work with a data collection partner, consider the following factors:


You need assurances that the data you receive is of a high standard from the start to avoid costly rework and delays. Furthermore, your data partner should have established quality control processes that ensure consistency and reliability. And as your data volumes increase, you need to know that they can continue to meet your requirements.


Your AI roadmap is unique, and so are your data needs. Finding a data partner equipped to collect data and create datasets specific to your requirements, no matter how complex, is critical to differentiating yourself in the market.


For AI solutions to work globally, you need a partner that can quickly expand your program to many regions, generating the necessary volume of data in each locale. The ability to manage thousands of contributors and process large volumes of data while maintaining high quality are important capabilities to consider when looking for a data collection partner.


Your data needs and requirements are likely to change as you develop your data pipeline and train and test your models. It’s important to select a data partner that can adjust on the fly and continue to deliver high-quality datasets.


Data collection should not be a bottleneck in your development cycle. Select a partner with a reputation for delivering data quickly, without compromising quality. This includes recruiting the right program participants in a timely manner, ensuring guidelines are clear and precise, and delivering data on time.

Privacy and security

Select a partner that follows the latest privacy guidelines and protects your data with multiple layers of security. Also, look for a partner that can support your specific privacy and security requirements.

Ethical crowdsourcing

Working with a diverse, global group of contributors will help you build more inclusive AI solutions. When selecting a data collection partner, be sure to review their crowdsourcing policies and procedures, including pay rates, and pay frequency.

11 Building a custom data collection process

In some cases, you may need to build out a custom data collection process, such as when expanding to a different geography, launching a new business initiative, or conducting business in a market with especially sensitive security and privacy concerns. Creating a data collection plan that’s tailored to your needs enables you to collect the right amount of data that’s specific to your brand or target audience.

In addition, you can account for environmental factors that are unique to a particular industry or country and gain the level of visibility and insight that are necessary to build an explainable AI solution. No matter your challenges, an experienced data collection partner can help you build a custom data collection process that meets your needs.

Learn more about LXT’s custom data collection services here.


Ready to discuss your data collection needs?

Contatct our experts today