#12 AI Data Quality: Crap in – Crap out

Summary

Topic: AI Training : Data Quality Matters

Summary: Any AI project is based on data used to train the model. Unlike what we would imagine, getting the right data in the right shape is far from easy or obvious. Building a quality dataset is an engineering work. This paper covers the various steps of this job.

We recommend reading the previous articles in the series for an easier understanding.

Keywords: AI; Neural Network; Machine Learning; Data quality; Data Augmentation; Labelling; Balancing data; AI training

Author: Sylvain LIÈGE

Note: This Paper was NOT written by AI, although AI might be used for research purposes.

1 Introduction

No matter how hard we work on training an AI model, the essence of the matter is data quality. In a nutshell: crap in = crap out. In all our previous papers except the last one (#11), we somehow assumed that we had some good data to train our model. Unfortunately, good quality data is a very complicated feast to pull off. In general, data in life and around us are messy, too rare or too many, hidden (usually in plain sight but hidden nonetheless), misunderstood, non-homogenous, biased, etc. So, in short, it is hell.

Despite this state of fact, we do need quality data to get a quality model. Like a human being, if the teacher is bad, the teaching is not going to be brilliant. The pupil is somehow as good as his teacher. Just to remember, Alexander the Great had Aristotle as a Tutor. It kind of paid off for him.

But of course, amazing data on a poor architecture will do no good either. It is a balance act. Today, we will focus on the data side of the problem. More specifically on the data preparation: collecting, cleaning, labelling, balancing, with a focus on cost efficiency.

I will assume, to not repeat myself too much, that you either read my previous papers on training AI, or that you are at least aware that we need data to train a Neural Network.

2 Data Sources: never ready to go

Raw data is rarely ready for Machine Learning (ML) without serious work.

For a change, I will not use my usual Smell detector example to illustrate this paper. I will use a more relatable situation for all the business readers out there. Instead, we will build a simple Chatbot for Customer support. We are the ACME company, we have an amazing ACME product and our customers sometimes struggle and ask for help. We need to help our support team and release them from the most basic questions to leave them with the difficult and most interesting ones. We will build a Chatbot. For the sake of this paper, we will consider that we will use a Machine Learning model and that we will train it from scratch. In reality, it is unlikely to be the real solution since this one is fairly expensive and complicated to build, but it is the simplest to explain and also an excellent solution if done well.

So, the scope of this paper is to reach the point where we can start training our model. In other words, we place ourselves before the training we presented in paper #8. Our aim is to create the dataset that will be split into 3 parts: 80% for the training, 10% performance validation, 10% testing.

To achieve that stage, we need:

Identify the source of the data that will be used for the training. Gather this data into a dataset.
Clean the data of unnecessary information and focus on what the model really needs to learn what we want it to do.
Labelling the data to run a supervised learning. It means that we are able to label the expected outputs of the various inputs.
Balancing the data to train the model with a world that does represent the actual real-life situation. We want fair, balanced data to produce fair and balanced results.
It is fairly possible that the amount of data we will have available is not really enough to do a good job. As we have seen in paper #11, a too small set can lead to bias. So, we might want to use technics that will create fake but realistic data to feed our model.

Once we have achieved these steps, we have a large enough dataset that is representative of the real world in a fair way. Then only, can we start the training.

3 Data Preparation

3.1 Gathering

The first step in the process is to identify where the data will come from. With each source come a series of considerations to have.

Format: What is the format and structure of the data source? Some may come in numbers, in text, images, web sources, pdf sources, structures, unstructured, etc.
Volume: What is the volume of data available? From a handful to petabytes of data, the way to handle them will differ.
Accessibility: How accessible are these data? Do we have them readily available, or do we have to go through a long series of steps (secured data for instance) to get them? The effort and timeline implied can have important consequences.
Messiness: Are the data “messy”? In other words, do we have to do a lot of cleaning to get some consistency across the data? Will words always mean the same thing or can they be interpreted in very different way? Do we have a lot of jargon and technical language?
Noise: Is there “noise”? Will there be a lot of irrelevant information in the source? Irrelevance can create difficulties for the model to learn as it distracts its attention to irrelevant pieces of information.
Bias: Is there bias in the source? Data could very well be focused on one specific area of the business. For instance, the biggest customers in size, or the biggest customers in number of purchase orders, or customers located in a specific area. The list is long and the effect very dangerous.
Cost: How expensive is it to extract and process the data source?

With all these criteria in mind, we must identify the various sources of information. Let’s do just that, shall we? Let’s see what sources we could consider for training our chatbot. The following list is not exhaustive and in reality, is highly dependent on your business, but this should give you a good overview.

3.1.1 Historical Customer Interaction Logs

Description: Transcripts or records of past customer interactions with ACME’s support team via live chats, emails, or phone calls (converted to text via speech-to-text). Captures real-world queries, responses, and conversational patterns.

Examples:

Chat log: “Customer: My ACME gadget won’t turn on. Agent: Try a hard reset by holding the power button for 10 seconds.”
Email: “Subject: Delivery Issue. Customer: My order hasn’t arrived. Agent: Tracking shows delivery tomorrow.”
Call transcript: “Customer: How do I update my gadget? Agent: Download the latest firmware from our site.”

Characteristics:

Format: Unstructured text, often in CRM systems or email archives.
Volume: High (thousands of interactions over months/years).
Accessibility: Readily available from ACME’s support channels but may require extraction.

Pros:

Authentic: Reflects real customer language (e.g., slang, urgency) and agent responses.
High volume: Provides ample data for training.
Domain-specific: Matches ACME’s products and support style.

Cons:

Messiness: Contains typos (“plese”), slang (“broke”), or inconsistent phrasing. Call transcripts may have speech-to-text errors (e.g., “firmware” as “fire alarm”).
Noise: Includes irrelevant chit-chat (e.g., “Hope you’re having a great day!”).
Bias: May over-represent complaints or frequent customers.
Privacy: Contains sensitive data (e.g., names, order IDs), needing anonymization for GDPR/CCPA compliance.
Cost: Extracting and preprocessing large logs can be time intensive.

Business Consideration: Logs are a rich, authentic source but require significant cleaning to avoid “crap out” (e.g., a chatbot misinterpreting slang, increasing escalations). Choosing logs means investing in preprocessing tools like AWS Glue.

3.1.2 Customer Support Knowledge Base and FAQs

Description: ACME’s internal documentation, such as FAQs, help articles, or troubleshooting guides, typically hosted on a support portal or CRM system. Provides structured answers to common queries.

Examples:

FAQ: “Q: How do I return my ACME gadget? A: Returns accepted within 30 days with proof of purchase.”
Help Article: “Troubleshooting Gadget: Restart by holding power button for 10 seconds.”
Policy: “Warranty: 1 year for manufacturing defects, contact support for claims.”

Characteristics:

Format: Semi-structured text (e.g., Q&A pairs, bullet points), often in HTML or PDF.
Volume: Moderate (dozens to hundreds of articles).
Accessibility: Easily accessible from ACME’s support site, but may need scraping or reformatting.

Pros:

Authoritative: Ensures policy-compliant, accurate responses.
Structured: Easier to process than logs, reducing cleaning effort.
Focused: Covers common queries, ideal for basic chatbot tasks.

Cons:

Outdated Content: May not reflect recent product updates (e.g., old warranty terms).
Inconsistency: Varied terminology across articles (e.g., “reset” vs. “reboot”).
Limited Scope: Misses edge cases (e.g., rare technical issues), risking underfitting.
Static: Lacks conversational tone, needing augmentation with logs.

Business Consideration: FAQs are a cost-effective starting point but require updates and supplementation to avoid a chatbot giving outdated or incomplete answers, costing customer trust. Choosing FAQs means prioritizing regular maintenance.

3.1.3 CRM and Ticketing System Data

Description: Records from ACME’s CRM, including customer profiles, ticket details (queries, resolutions), and escalation notes. Provides context on customer history and issue patterns.

Examples:

Ticket: “Customer ID: 456, Issue: Gadget not charging, Resolution: Sent replacement cable.”
CRM Note: “Customer prefers email, made 3 purchases in 2024.”
Category: “60% of tickets are about delivery delays.”

Characteristics:

Format: Structured (e.g., fields for issue, resolution) but with unstructured notes.
Volume: High (hundreds to thousands of tickets).
Accessibility: Available in CRM, but may require API extraction or manual export.

Pros:

Contextual: Enables personalized responses (e.g., referencing past purchases).
Pattern-rich: Shows frequent issues (e.g., delivery delays), guiding chatbot focus.
Structured: Easier to parse than logs for key fields (e.g., resolution).

Cons:

Incompleteness: Notes may be vague (e.g., “Issue fixed”) or missing.
Bias: Over-represents problematic cases, skewing chatbot priorities.
Privacy: Contains personal data, needing anonymization.
Integration: Proprietary formats may complicate extraction.

Business Consideration: CRM data enhances personalization but requires effort to fill gaps and remove bias, or the chatbot may ignore non-ticket queries, increasing escalations. Choosing CRM data means investing in integration tools.

3.1.4 Product and Service Information

Description: Detailed data on ACME’s products (e.g., gadgets) and services (e.g., warranties, returns), sourced from internal databases, websites, or catalogs. Supports product-related queries.

Examples:

Product Database: “ACME Gadget X: $199, 16GB RAM, 1-year warranty.”
Policy Page: “Returns: 30 days, contact support with order number.”
Inventory: “Gadget X in stock at Store A, out of stock online.”

Characteristics:

Format: Structured (e.g., database tables, JSON) or semi-structured (e.g., website text).
Volume: Moderate (data for each product/service).
Accessibility: Available internally, but may need API or web scraping.

Pros:

Accurate: Ensures correct product details (e.g., warranty terms).
Dynamic: Can integrate real-time data (e.g., inventory via APIs).
Essential: Directly supports queries like “What’s the gadget’s price?”

Cons:

Dynamic Changes: Frequent updates (e.g., price changes) require real-time sync.
Complexity: Technical specs need simplification for conversational responses.
Fragmentation: Spread across systems (e.g., website, ERP), needing consolidation.

Business Consideration: Product data is critical for accurate answers, but risks outdated responses without real-time integration, potentially losing sales. Choosing this source means prioritizing system connectivity.

3.1.5 Customer Feedback and Surveys

Description: Feedback from ACME’s customers via surveys, reviews (e.g., website, social media), or post-interaction forms, capturing sentiments, issues, or suggestions.

Examples:

Survey: “Support was slow but resolved my gadget issue.”
Review: “ACME gadget is great, but delivery took 10 days.”
Feedback Form: “Suggestion: Add phone support for urgent queries.”

Characteristics:

Format: Unstructured (free-text) or semi-structured (ratings + comments).
Volume: Low to moderate (hundreds of responses).
Accessibility: Collected via forms or scraped from public platforms (e.g., Twitter).

Pros:

Sentiment-rich: Trains the chatbot to handle emotions (e.g., calming upset customers).
Issue-focused: Highlights pain points (e.g., slow delivery), guiding improvements.
Customer-centric: Reflects real user experiences.

Cons:

Unstructured: Varied phrasing (e.g., “slow” vs. “took forever”) needs NLP.
Bias: Over-represents negative feedback, skewing tone.
Sparsity: Limited volume misses broad scenarios.
Noise: Includes irrelevant comments (e.g., “Love your ads!”).

Business Consideration: Feedback adds emotional intelligence but requires parsing and balancing to avoid a negative-toned chatbot, risking customer alienation. Choosing feedback means investing in NLP tools.

3.1.6 Social Media Interactions

Description: Public or private customer interactions on ACME’s social media (e.g., Twitter, Facebook), including comments, DMs, or posts about support issues.

Examples:

Tweet: “@ACME-Why is my gadget’s battery draining so fast? 😣”
DM: “Customer: My order’s late. ACME: Please share your tracking number.”
Comment: “Love my ACME gadget, but setup was tricky.”

Characteristics:

Format: Unstructured text, often short and informal.
Volume: Moderate (hundreds to thousands of posts/DMs).
Accessibility: Publicly available or via platform APIs, but scraping may be needed.

Pros:

Real-time: Captures current customer sentiments and issues.
Informal: Reflects casual language (e.g., emojis, slang), training conversational tone.
Public Insight: Shows common complaints visible to others.

Cons:

Noise: Includes irrelevant posts (e.g., “Great ad!”) or spam.
Bias: Amplifies vocal or negative customers.
Privacy: Public data needs ethical handling; DMs require consent.
Volume Limits: May not provide enough data alone.

Business Consideration: Social media adds a modern, conversational edge but requires heavy filtering to avoid skewed or noisy data, risking off-tone responses. Choosing this source means investing in scraping and cleaning tools.

3.1.7 Synthetic data

Ai has another trick in it sleeve: synthetic data. The principle is quite simple: we use an AI system to analyse the data we have and generate new data from them. A well-known way to do that is to use Generative Adversarial Networks (GANs) or language models, to mimic real-world data without using actual customer or sensitive information. This would be the topic of a paper on its own. Just remember that we can in fact generate data to train a model. This has a cost and requires some know how as well. It is a project on its own.

3.1.8 We need to make choices

Having a lot of different sources is great. Better having choice than rely on few options. Nonetheless, it il unlikely we will use every source for our project.

First of all, each source presents strength and weaknesses. A log will offer facts and authenticity, while a good quality FAQ will provide accuracy and relevance. They also present challenges as the log can contain irrelevant information, and the FAQ might be incomplete. Perfection would be great, but it will not be the daily life of the Ai data engineer.

Cost will often be a decision maker. Logs are cheap to collect but fairly labour intensive, i.e. expensive to clean. Quantity can be a curse if quality is not paired with it. The FAQ is cheap but will not be updated anytime soon and can be stuck in the past.

Quality will vary and it is a major decision factor. Crap in, crap out! The variability, bias and noise contained in the data must be taken into account. Even if the data is cheap to process and available in quantity, if it is all biased, we still have a big problem.

Scalability is also important. Often internal data are relevant but scarce while external data might be available in quantity but less relevant to our business case.

The level of preparation by data source can vary a lot. All sources require cleaning, labelling and balancing. A FAQ requires far less cleaning than a log.

All in all, choices must be made and a strategies must be created to select the data and have a long term plan for the project. But whatever your strategy is, never forget to prioritize relevance. Without relevance, all your efforts are lost.

A typical strategy might start with logs (high volume), FAQs (accurate), and CRM (contextual), supplemented by synthetic data for gaps. This mix ensures a quality dataset. But raw data is still unwashed. Next, we clean it.

3.2 Cleaning

3.2.1 What is Data Cleaning?

Data cleaning is the process of removing or correcting errors, noise, and inconsistencies from raw data to prepare it for ML. It is a critical step as because very often, data are messy. It will often contain typos, outdated data FAQs and so on. If we do not clean them, we will create a chatbot that would provide wrong answers, increase escalation instead of decreasing it and ultimately require costly re-training. In fact, dirty data could even prevent a proper training to happen by provoking divergence instead of convergence. The data must be consistent to provide consistent results. So, clean data could even reduce the cost of training by requiring fewer epochs.

3.2.2 Data cleaning tasks

Removing noise: eliminate irrelevant information. For instance, using a chat log, we would remove the “How are you today” or “I’m a big fan of your products” parts of the conversations.
Standardizing formats: Fix inconsistencies like variations of data format in a CRM, or the use of several different words to express the same thing (like “reboot”, “reset”, etc.).
Correcting errors: From typos to speech-to-text errors (“firmware” becoming “fire alarm”) from support calls.
Handling missing data: Vague or missing information can be replaced by accurate and useful information. “Issue fixed” could become an explicit description of what has been fixed and how.
Anonymizing sensitive data: Remove or mask personal information for both legal compliance but also preventing the system to give personal information on its answers (“We did it that way for Joe Blog when his company got into financial trouble”. You get the idea.)
Filtering outdated content: Discard old irrelevant data about the n-th previous versions of your product, irrelevant workarounds, fixed bugs and so on.

3.2.3 Cleaning data techniques

Manual cleaning: of course, manual cleaning done by humans can always work. It is labour intensive and not as good as we would wish. Humans are prone to errors, bias and can also miss information. They would not necessarily be as thorough as an automated cleaning. Nonetheless it can be the solution for complex tasks that are difficult to define for a machine.
Automated cleaning

Text processing: use Natural Language Processing (NLP) to fix typos, standardize terms, or remove noise.

Data Profiling: Identify inconsistencies or missing values

Anonymizing tools: Mask sensitive data automatically

If we were looking for tools in the AWS ecosystem, we would be using AWS Glue, AWS Comprehend and AWS SageMaker to automate these tasks. AWS Sagemaker Clarify would help us validate the job done using statistical checks.

3.2.4 Cleaning data challenges

Over-Cleaning: Removing too much data (e.g., filtering valid but rare queries) can reduce dataset size, risking underfitting.
Cost Trade-offs: Manual cleaning is accurate but expensive; automated tools like Glue save time but require setup.
Bias Preservation: Cleaning must avoid amplifying bias (e.g., keeping only positive feedback skews the chatbot’s tone).

3.3 Labeling

Definition: Labelling is the process of assigning meaningful tags or categories (called labels) to cleaned data to define what it represents, enabling supervised machine learning (ML).

In other words, we attach meaning to the data we have for the system to learn from them. Imagine you are trying to teach a child to recognise dangerous animals in the wild. You would show pictures of animals, and you would say: “Dangerous” or “Friendly”. That would be labelling images for your child. From there, the child could learn and possibly recognise animals looking like dangerous or looking like friendly. So, you show a lion as dangerous, then a tiger as dangerous, then a puma and hopefully, when the cheetah shows his face, the child will guess dangerous. That is the general idea of labelling data. You would do this labelling on clean data rather than messy ones of course, because if we show pictures of cars as dangerous and clouds as friendly in the list, it will confuse the child to understand the animal kingdom. So, we’d rather do that on a clean set of images that have been tailored for the purpose of the training.

3.3.1 Labelling levels

But of course, in our previous example we have used a binary type of labelling. It is easy to setup. In our chatbot, we could label events as “Technical issue”, “After sale issue”, “Repair request”, and so on.

From such a labelling, the system could miss important nuances. Just like an animal can be “Absolutely dangerous” like a wild bear to “fairly dangerous” like a fox, a technical issue can be minor, medium or major, requiring a speed of reaction very different from the technical team. We could perfectly add nuances in the labels to address this issue. And of course, there is a cost to this decision.

For a start, the labelling effort is clearly much higher. For each data entered we must assess the level of urgency. This is likely to require human interaction. Then, there is of course the bias that the human can add to the label. One human could consider a fox as very dangerous when another one, more used to dogs will find it mildly dangerous. The same would apply to the level of urgency required to treat a technical issue.

So, as always, every improvement has a time cost and a financial cost, plus some unforeseen cost like bias.

3.3.2 Choosing the labels wisely

Definition: Choosing labels involves deciding which categories or intents best represent the cleaned data’s meaning, a strategic and creative task that balances specificity, coverage, and fairness.

This is the part where understanding the real purpose of the AI model is essential. Labelling things for the sake of labelling is not going to achieve much, no more than adding clouds to a system dedicated to recognising animals. So, labelling is a crucial element of the AI system design. There are challenges along the way. Let’s have a look at them.

Subjectivity: Different labellers may interpret queries differently (e.g., “My gadget’s weird” as “complaint” or “technical issue”, risking inconsistency.

Ambiguity: Vague data (e.g., CRM’s “Issue fixed,”) makes intent unclear, requiring judgment or additional context.

Granularity: Too few labels (e.g., “positive”/“negative” for feedback) oversimplify, causing underfitting; too many (e.g., 50 intents) confuse the model.

Bias: Over-representing common queries (e.g., complaints in logs) skews labels, making the chatbot ignore rare but valid queries.

Scale: Labelling thousands of logs is time-intensive, needing automation.

Cost: Manual labelling is accurate but expensive; automated tools reduce costs but risk errors if not validated.

To avoid these risks, we can create clear guidelines for the projects, use bias detection tools and, of course, use labels validation tools cross-check labels with a subset of data to ensure consistency and fairness.

3.4 Balancing

Balancing the training data set is about fairly representing the real world. In the case of our chatbot, We want to make sure that the data integrate all types of query types. For instance, we want to integrate complaints but also praises. Over-representing one would not be beneficial for the fairness of our system. We do not want to ignore rare situations either. In other words, the data set used for the training should be a fair representation of the business in the world. And like every attempt at fairness, it is easier said than done. It is easy to say that we want to integrate rare cases, or ensure all types of queries are represented, but what are the right proportions? What if the data available are not balanced in the first place? We could imagine that all queries related to broken widgets have been through the phone and that we do not have any real traces of the conversations. We know they exist but cannot use them for training. It could lead towards an over-representation of the software complains that happen in writing. In cases like that we may have to use synthetic data (created by an AI tool to simulate the data we don’t have, like with a GAN). Synthetic data is also named “Augmenting”. This is the topic of the next section.

3.5 Augmenting

GANs ((Generative Adversarial Network) is an AI-driven method to create synthetic data. It is used when real data is scarce.

Principle: Two AI systems are used “against” each other. One is creating data, and the other one is verifying if they are believable. The generator aims to produce data that is indistinguishable from real data, while the discriminator tries to identify whether the data is real or generated. The adversarialnature of the system works the following way: The generator tries to fool the discriminator, while the discriminator tries to correctly classify the data. This process continues until the generator can produce data that the discriminator cannot distinguish from real data.

Benefits: GANs can create rare or missing queries, for instance our phone based interaction and create their equivalent in a processable format. It is also scalable and cost effective. Once the system is in place in can produce data cheaply.

Limitations: Without the proper expertise, such a system can produce totally out of control data. The quality of the output depends on the quality of the GAN setup. So, if it is indeed scalable, reaching the point where you can produce data cheaply might not be cheap or easy at all.

Globally, GANs are a tool worth knowing about and very handy when the project depends on the availability of data we do not have in large enough quantity.

4 Where is the Intelligence?

In our Hunt for Intelligence, we are looking for places where AI is a serious contender for replacing human intelligence. In our chapter about data quality, we have seen no sign of intelligence that could confirm this idea. What we have seen is that we organise data in a way that the system can identify patterns from them. Should the data be skewed or biased or unbalanced and the whole system is at risk. The quality of the data used to feed the AI system are essential to produce a quality system. In other words, Garbage In – Garbage out !

#12 AI Data Quality: Crap in – Crap out

Summary

1 Introduction

2 Data Sources: never ready to go

3 Data Preparation

3.1 Gathering

3.1.1 Historical Customer Interaction Logs

3.1.2 Customer Support Knowledge Base and FAQs

3.1.3 CRM and Ticketing System Data

3.1.4 Product and Service Information

3.1.5 Customer Feedback and Surveys

3.1.6 Social Media Interactions

3.1.7 Synthetic data

3.1.8 We need to make choices

3.2 Cleaning

3.2.1 What is Data Cleaning?

3.2.2 Data cleaning tasks

3.2.3 Cleaning data techniques

3.2.4 Cleaning data challenges

3.3 Labeling

3.3.1 Labelling levels

3.3.2 Choosing the labels wisely

3.4 Balancing

3.5 Augmenting

4 Where is the Intelligence?

Sylvain LIÈGE PhD.

Categories