#10 Artificial Intelligence: Training going wrong

Summary

Topic: AI Architecture : Neural Network Training & Fitting

Summary: Building on Paper #9’s training of a smell detector using supervised learning and backpropagation, this paper explores why the model might fail in practice: underfitting (too simplistic), overfitting (too rigid), and the underlying issues of bias and variance. We examine how intent shapes the outcome: ‘edible’ smells differ in survival versus civilized contexts, just as ‘cat’ detection varies for pet feeding versus rodent poisoning. Through examples, we show how underfitting leads to random guesses (e.g., feeding rats), while overfitting causes oversensitivity (e.g., skipping shorthairs). We introduce bias (consistent errors) and variance (prediction variability) as the culprits, offering insights into fixing these issues with better data and intent alignment.

We recommend reading the previous articles in the series for an easier understanding.

Keywords: AI; Neural network; Machine Learning; Deep learning; Neuron; Synapses; Layers; Backpropagation, Epoch; AI Training; Underfitting; Overfitting; Data quality

Author: Sylvain LIÈGE

Note: This Paper was NOT written by AI, although AI might be used for research purposes.

1 Introduction

In our previous paper (#9), we have presented how we can train an AI model using supervised learning. We have explained the principle for going from a model that knows nothing to a model that can predict if a smell is edible or not. In practice, of course, it does not always work immediately. The training can go wrong although our validation phase and testing phase seem to have worked properly, as we’ll see with issues like underfitting and overfitting. Indeed, what do we do when, in practical use, the model is behaving badly and does not recognize the edible smells correctly? What can go wrong and how can we fix it?

This is the topic of this paper in which we will cover in particular two well-known phenomena: underfitting and overfitting.

2 A well-trained model starts with a proper intent

When we train our model, we define an objective. Like any expert, an AI model is specialised in doing something very well, but like for humans passing an exam, we need to define “very well”.

Our first reaction to the question of a well-trained model would be to say: “We want the model to assess the edibility of a smell.” Or “We want the system to recognise a cat.”

Then, the IT guy in me who has seen software projects for too long will say: “Define ‘edible’” or “Define ‘cat’”. Let me explain…

Imagine that our “Edible detection machine” is a survival tool, something adventurers take with them to survive in hostile or unknown environments; a tool that is designed to manage a level of acceptable risk. In a case like this, a moulded bread or too old cheese might still be a very decent option compared to an unknown mushroom. Obviously, if our machine is for mundane situations, detecting if food went off before cooking at home, the behaviour should be different. In civilization, we’re picky. In the wild, we’re pragmatic.

The problem remains the same for a cat detector. If the purpose is to distribute food to a pet cat and do weight control or distribute poison to get rid of rodents, obviously the model should behave differently. The life of animals is at stake, and it is very different to control weight and kill pest. In both cases we need to detect cats, nonetheless, but the intent is different.

So, a ‘well-trained’ model isn’t just accurate—it’s accurate for the job. Misdefine the job, and no amount of training saves you.

3 Essence of the problem: Pattern recognition: a matter of life and death

What we are really doing with our AI model training is recognise patterns. We say, generally speaking, “If you get something ‘like’ this, then the correct decision is this, if you get something like that, the correct decision is that.”

Ultimately, and we’ll come back to that topic in more details, an AI system like the one we are building is essentially a probabilistic system. It is making a mathematical decision based on the past, hopingthat the future is indeed like the past. This is why we use the term pattern recognition. By using the word “recognition” we clearly feel that it can go wrong. Our brain is in fact a pattern recognition machine. Its job is to match what it is given with things it knows. And our brain is the ultimate machine for that job. This is how we make split-second decisions 1000s of times a day. Every time you cross a street; your brain is checking if the context is matching a safe known one. If yes you cross, if not you stay. And if your pattern recognition goes wrong, you die. Pointless to say that you want it to go right more often than not. Of course, it is the same mechanism for food recognition, assessing if someone is dangerous in the street, and in fact almost everything we do in life. Patterns, patterns, patterns! Do we recognise what is submitted to us? And how do we get good at that game? Experience, experience and experience!

This is exactly what we do when we train an AI system. We give it as much experience as we can to recognise something new as matching something we already know.

Now, this is all good, but of course, our training can go wrong. If one learns how to cross thousands of tiny streets in a village for years, when confronted with crossing a highway, one can make a very bad decision indeed. This is not because this person is stupid, it is because the situation is not matching anything one knows. In other words, put myself, an urban guy, in the middle of the amazon forest for a day and my chances of survival are close to zero. My brain power has nothing to do with it. My pattern recognitions skills either, unless it has been trained with the right data.

The phenomenon we are now about to study are exactly that: what happens when the training is not adequate? What can go wrong? How do we recognise that the training is wrong? How do we know in which direction we must correct it?

These issues have names: Underfitting, Overfitting, Variance and Bias. Let’s have a look at them.

4 The Data Sets

In our previous paper #9 about training the model, we have fed the model with a set of data. We used supervised learning, i.e. the data were tagged and we could calculate a Loss for each training data.

Once we have run a number of epochs (complete passes through the training data) that we believe is enough to train our model, we will need to verify that the model is working fine. To do so we use a strategy with the data sets.

We will split the data in 3 subsets: the Training set, the Validation set and the Testing set. We will use about 80% of the available data for training, 10% for validation and another 10% for testing. In most cases, and especially when we have a large data set available, the split between the 3 subsets is done at random.

Random splitting is common because it ensures the sets are unbiased samples of the data, but in high-stakes cases—like our survival smell detector—we might carefully select validation and testing data to include a wider range (e.g., toxic mushrooms, new cat breeds) to better detect issues.”

The training set is the set of data used during the backpropagation process described in Paper #9. They are the knowledge that the model is trying to acquire.

4.1 Validation Set

Purpose: The validation set is used during training to monitor the model’s performance on unseen data. Based on the performance we will tune hyperparameters (e.g., number of layers, learning rate, epochs), and detect issues like overfitting or underfitting early. It acts as a checkpoint to guide the training process.

Timing: Validation is performed iteratively, often after each epoch (Paper #9). For example, after training the smell detector for one epoch, you might check its validation accuracy and compare it to training accuracy.

Response to Poor Performance: If validation performance is unsatisfactory (e.g., training accuracy 95%, validation accuracy 60%), you go back to training to adjust the model.

4.2 Testing Set

Purpose: The testing set is used after training is complete to provide a final, unbiased evaluation of the model’s performance on completely unseen data. It estimates how the model will perform in the real world, without further tuning.

Timing: Testing is performed only once, after you’ve finalized the model based on validation.

Response to Poor Performance: If testing performance is unsatisfactory (e.g., testing accuracy 50%), you don’t go back to training directly—at least not in the same way as with validation. The testing set is meant to be a final check, so poor performance indicates a deeper issue with the model development process.

5 Underfitting

Underfitting is when the model’s too simple or half-baked to catch the pattern. It’s a lazy learner. It pretends the world is simpler than it really is. It becomes the world of maybes.

What would it look like in our 2 examples, the edible food detector and the cat feeder?

5.1 Pet Feeder Mode Feeding System:

Behaviour: The model barely distinguishes cats from rodents, or anything furry. It might feed everything (cats, rats, even a squirrel) or nothing at all, with accuracy hovering around 50% (like a coin flip).

Example: Trained on 50 cat pics, it gets only 60% right—misses some tabbies, feeds half the rats. Test it on a new shorthair cat or a mouse? Feeds both—or neither.

Why: It didn’t learn “cat = fluffy, whiskers, big eyes.” It’s stuck on “furry blob = maybe feed,” too vague to be useful.

Real-World Impact: Your pet starves while the neighbourhood rats feast. “It’s like a buffet for pests!”

5.2 Rodent Poison Mode Feeding System:

Behaviour: Fails to spot rodent traits (e.g., long tails, small size). It might poison everything (rats, cats, even a passing dog) or nothing, again near-random.

Example: Trained on 50 rat pics, it poisons only 55% correctly, misses some rats, hits a kitten. Test it on a new rat breed or a tabby? Poisons both, or spares both.

Why: It didn’t grasp “rodent = skinny, sneaky, tailed.” It’s more “thing moving = maybe poison,” too sloppy to target pests.

Real-World Impact: Either your house stays rodent-infested, or your pet’s in peril. “A poison party gone wrong!”

5.3 Food Detector Scenario

What about our edible food detector? What would be its behaviour in an underfitted model?

5.3.1 Survival Mode Food Detector

Behavior: Labels most smells as edible (or barely any) failing to prioritize sustenance over danger. Accuracy hovers around 50%, like a guess.
Example: Trained on 50 survival smells (e.g., moldy bread [musty=5, intensity=0.4] → edible). Gets only 60% right—accepts some mold but also toxic mushrooms, skips a faint root that’s fine. Test on a new earthy smell (e.g., [earthy=3, duration=0.5])? Calls it inedible—or edible—randomly.
Why: It didn’t learn “edible = calories, not poison.” It’s stuck on “smell = maybe okay,” too vague. No survival instinct.
Real-World Impact: You eat poison or starve in the wild. “It’s a jungle buffet—with risks!”

5.3.2 Civilized Mode Food Detector:

Behavior: Can’t tell fresh from spoiled—accepts or rejects everything haphazardly.
Example: Trained on 50 home smells (e.g., fresh bread [floral=2, pleasantness=0.9] → edible). Hits 55% accuracy—rejects some fresh bread, accepts spoiled cheese half the time. Test on a new faint floral smell (e.g., [floral=1, intensity=0.2])? Flip a coin—edible or not.
Why: Missed “edible = fresh, pleasant.” It’s more “smell = food, maybe,” too sloppy for picky eaters.
Real-World Impact: Dinner’s a gamble—spoiled soup or skipped dessert. “A kitchen nightmare!”

5.4 What it means mathematically

Underfitting occurs when the model is too simple to capture the underlying pattern in the data—it’s like trying to fit a straight line to a curvy problem. Mathematically, this often means the model has high bias (section 7.2 below) and isn’t flexible enough to learn complex relationships. The loss (paper #9) remains high even on training data. This often happens when the model has too few layers or neurons (Paper #7), like a neural network with just one layer and three neurons, or when we don’t train it long enough (Paper #9’s 50 epochs might not be enough for a simple model to learn).

6 Overfitting

Overfitting happens when the model gets too cozy with the training data, memorizing specifics instead of learning general patterns. It’s like a student cramming for a test, acing it, but blanking on anything new—it’s too rigid, too particular. It pretends the world is exactly like its training set, missing the bigger picture.

What would this look like in our two examples, the edible food detector and the cat feeder?

6.1 Pet Feeder System

Behaviour: The model is obsessed with the exact cats it was trained on, rejecting anything slightly different. It feeds only the “perfect” cats from its training set, ignoring others that are still cats.

Example: Trained on 50 cat pics—all tabbies with specific fluffiness (e.g., [fluffiness=5, whiskers=3] → feed). It gets 98% accuracy on training, feeding every tabby perfectly. Test it on a new shorthair cat ([fluffiness=3, whiskers=2]) or a Persian? It skips them—doesn’t recognize them as cats.

Why: It learned “cat = tabby, fluffiness exactly 5,” not “cat = furry, whiskers, big eyes.” It’s too tied to training details, like memorizing a script word-for-word but failing at improvisation.

Real-World Impact: Your pet—a shorthair—goes hungry while you wonder why the feeder’s so picky. “It’s a tabby snob, not a cat lover!”

6.2 Rodent Poison System

Behaviour: The model fixates on the exact rodents it was trained on, missing new ones and misclassifying pets as targets. It poisons only the “perfect” training rats, sparing others or hitting the wrong targets.

Example: Trained on 50 rat pics (e.g., [tail_length=6, size=3] → poison), it nails training at 97%—poisons every rat in the set. Test it on a new rat breed ([tail_length=5, size=4]) or a kitten ([fluffiness=4])? It spares the rat (not an exact match) but poisons the kitten (close enough to a rat’s size).

Why: It didn’t grasp “rodent = skinny, sneaky, tailed.” It’s stuck on “rodent = tail exactly 6, size exactly 3,” too rigid to adapt. Paper #9 warned of this: a cat detector trained only on Siamese cats fails on tabbies—it’s the same overfitting trap.

Real-World Impact: Rodents keep sneaking around, but your kitten’s in danger. “A poisoner with a vendetta for the wrong crowd!”

6.3 Food Detector Scenario

6.3.1 Survival Mode Food Detector

Behaviour: The model memorizes the exact survival smells it was trained on, rejecting anything new—even if it’s edible in a survival context. It’s too strict, missing life-saving options.

Example: Trained on 100 survival smells (e.g., mouldy bread [musty=5, intensity=0.4] → edible), it gets 95% accuracy on training—perfect on those smells. Test it on a new earthy root ([earthy=3, duration=0.5]) that’s edible in survival mode? Rejects it—not an exact match to training.

Why: It learned “edible = musty exactly 5,” not “edible = calories, not poison.” It’s too fixated on training specifics, like a survivalist who only eats one type of mouldy bread.

Real-World Impact: You starve in the wild, ignoring a root that could’ve saved you. “A jungle snob—no feast for you!”

6.3.2 Civilized Mode Food Detector

Behaviour: The model sticks to the exact “fresh” smells it knows, rejecting new ones that are still safe, or accepting spoiled ones that match training quirks. It’s too rigid for a picky eater’s needs.

Example: Trained on 100 home smells (e.g., fresh bread [floral=2, pleasantness=0.9] → edible), it hits 96% on training—spot-on for those smells. Test it on a new faint floral smell ([floral=1, intensity=0.2]) or spoiled cheese ([musty=4])? Skips the floral (not floral enough) but might accept the cheese if it matches a training quirk.

Why: It didn’t learn “edible = fresh, pleasant.” It’s stuck on “edible = floral exactly 2,” too narrow for a civilized kitchen.

Real-World Impact: You miss a perfectly good dessert, or worse, serve spoiled cheese at dinner. “A kitchen critic with no taste for variety!”

6.4 What it means mathematically

Overfitting occurs when the model is too complex, fitting the training data too closely, including noise—it’s like memorizing a textbook without understanding the bigger picture. You are learning all the History lesson’s dates but failing the exam when asked what drove the world to war. Mathematically, this often means high variance (Section 7.1 below) and and the model being too flexible. In Paper #9, our smell detector’s loss would be very low on training data (e.g., 95% accuracy), because the model fits every detail, even quirks (e.g., [musty=5] → edible). But in validation, the loss spikes because the model can’t generalize. This is often due to too many layers or neurons (Paper #7), too many epochs (Paper #9), or a training set with low variety.

7 Bias and Variance

7.1 Variance

Variance is a concept that talks by its name. It expresses the variability of the results around a centre. So, if you are playing dart, you would be reaching the target regularly around a zone. This zone can be the centre of the dart board but not necessarily. Variance does not define purely how good you are at reaching the centre, it defines how good you are at reaching the same zone.

Variance is how sensitive the system is to the context. If you are playing dart, it would mean you are influenced by the music played, the speed of the wind, the weight of the darts and of course, the number of beers you have swallowed. With a low variance, you would be fairly waterproof to these various elements and your darts would always reach a small zone on the target.

High variance often stems from a combination of model complexity (e.g., too many layers) and training data issues (e.g., small size or low variety). Your system is specialised in recognising Siamese cats and as soon as you submit a Persian, the system goes into guessing mode. Its capability to generalise the problem to all sorts of cats is weak. It is very sensitive to cats’ variations and provides results that are not very trustworthy.

If we are a dart player, we would have a player that is very good with say 22 grams darts and would be unable to play the 25 grams darts. Not a very good player. A low-variance model is steadier, hitting the same spot no matter the distractions, but if that spot’s off-target, it’s still no good and we’ll see that with the concept of bias.

High variance can also come from less easy to fix sources. In paper #7, I described the architecture of a neural network. This architecture, if you remember, can create a more or less “intelligent” system. We can add layers, make the layers more or less rich in neurons. We can use a lot of data or very few. And of course, we can train with more or less data, more or less varied, more or less relevant. So, in fact, a lot of things can go wrong.

We usually identify high variance problem at validation stage or even testing stage. But this assumes that the data used for these phases vary enough from the training data, otherwise it would not show.

7.2 Bias

Bias in AI mirrors how we use the word in daily life—a consistent skew in judgment, like seeing the world through a tinted lens. We say that someone has a political bias when this person analyses everything with the same lens. This is true in many situations in life. Say for instance John is working by default 60h per week. John may see everyone as lazy since most people will work around 40h. This judgement would in fact be “lazy” regardless of the intrinsic laziness of the person who is judged. We would then say that John is biased in his judgement, always going in the “lazy” area of the spectrum.

In Ai, we have the same phenomenon when a model will predict everything with the same type of error. So, our pet recognition system would for instance have a tendency to identify as cat everything that has a fur or a similar size as a common cat. If we have a big rat or a big guinea pig, they will become cats. The system is biased towards cats.

Our dart player would have a tendency to play say on the top left. He is aiming at the centre of the dart board, but the darts end up in the top left.

That top-left spot is the ‘central zone’—the average of their throws, or the ‘expected prediction.’ Bias is the distance between this central zone and the true target (the bullseye). High bias means the central zone is consistently off-target, like John’s lazy judgments. In AI, we observe this in the validation or testing phases, when the model faces new data and its predictions are systematically skewed.

In other words, a system with no bias has its central zone right on the target: predictions average out to the truth. A cat has the same chance of being correctly recognized as a guinea pig—no systematic skew. But if bias is high, like our dart player stuck in the top left, the model’s predictions are consistently off, no matter how much we train it.

Like variance, bias is observed in validation or testing phases, not training.

7.3 Variance and Bias in summary

In practice, we can have 4 different cases of Variances combined with Bias. These are when Variance is high or low combined with high or low Bias. We can see that in the next diagram:

When a model has a low variance and low bias, we say it is Balanced. Mathematically, it would look like this:

8 Let’s wrap it up

When we train a neural network, it is in order to achieve a clear objective. The first step is to define this objective, as for our pet feeder vs rodent poison distribution. Both are made for distributing food to animals but one wants to feed the cats, trying to avoid feeding the squirrels, while the other one is aiming at killing the rats while keeping our cats safe. The risk of getting it wrong clearly doesn’t have the same consequences. We will therefore aim at a fairly different behaviour from our AI system.

We then need to build a data set that will be used to train our system. This data set will be split in 3 subsets: the training set, the validation set and the testing set.

The validation set will be used to assess the quality of the training. During this training 2 main defaults can occur. One is to have a lazy solution that is guessing more often than it is making an educated decision. That is underfitting. The second is when the training get zealous and performs so well on the training data that it loses any capability to adapt to any data that is slightly different. That is overfitting. Both situations are problematic and generally unacceptable.

We can measure 2 different behaviours of the system when we validate or test it. These 2 measures are the Variance and the Bias.

The Variance tells us how large the spectrum of answers is around the correct answers. In a way, it is how the system is equally wrong around the solution. A large Variance is not good. It cannot be trusted. There is too much imprecision in the answers.

The Bias is another problem. It defines how far the answers are from the target. It is like a bunch of darts all grouped on the board, but not on the bull’s eye.

These situations are, in general, the result of underfitting or overfitting systems.

To fix these issues, we need to either revisit the data we are using for training and/or the underlying infrastructure of the model, like the number of layers, the number of neurons per layer, and other parameters we can tweak.

Ultimately, we want to build a system that can learn general pattern recognition, based on the training data, that can handle new data in a way that respects the pattern, we could say the “spirit” of the initial data. The measures we covered—underfitting, overfitting, bias, and variance—are mathematical tools to understand and correct the AI model’s behaviour.

9 Where is the Intelligence?

As usual, we ask: where is the intelligence? Once again, I’m afraid my answer is: nowhere. Nowhere in the sense that all what we do from the very beginning of this adventure is creating mathematical models of the world. The genius is that we know how to build them, but we are totally unable to do it without the help of mathematics and computers. Interestingly, the mathematics involved in our whole system are fairly basic, but the intelligence, if there is one, is to combine a lot of simple mathematic principles into a larger “mathematical engine” that is capable of recognising real world situations and make decisions on them based on what it has learnt in the past, just like a human being would do. Quite amazing!