#9 AI Training & Back Propagation

Summary

Topic: AI Architecture : Neural Network Training & Back Propagation

Summary: In order to use a Digital Neural Network, we need to train it. In this paper we present how we can “train” one using supervised training and backpropagation. By comparing the model’s output with the value that we know to be correct, we can tune the parameters and make it solve the problem at hand.

We recommend reading the previous articles in the series for an easier understanding.

Keywords: AI; Neural network; Machine Learning; Deep learning; Neuron; Synapses; Layers; Embedding; Vectorization; Activation Function; Backpropagation, Epoch; AI Training

Author: Sylvain LIÈGE

Note: This Paper was NOT written by AI, although AI might be used for research purposes.

1 Introduction

In our previous paper, we have presented what is called forward propagation. It is the process of taking some information into the system and what happens to them in order to get a result. This is great but we have a huge problem: to produce the “good” result, all the parameters inside the digital neural network must be correct. I mean here all the “a” of all the “a.x+b” (one per neuron), and all the weights “w” on each synapse. We therefore need to train our system. Training is a concept that appears all the time in the context of AI. Indeed, at its creation an AI system is dumb as can be.

In this paper we will cover the topic of training and its essential concept of backpropagation. It will require some more mathematical wizardry, some of which we have already seen in our paper about differential calculus.

We will use mathematics in this paper. You don’t need to be a mathematician to benefit from this paper.

2 Forward Propagation Recap

One neuron gets a value as an entry point, x. This value is transformed using a linear function of the form a.x+b. Then the result of this function is transformed again using an activation function. The result of this activation is sent to the next neurons via synapses. On each synapse is a synaptic weight that transforms again the number.

We have something like:

x ==> y = a.x + b ==> z = Activation(y) ==> t = m.z ==> t is given as entry value to next neuron.

This is how the system works, when it is completed. I mean, we give it a problem to solve and it gives a result that we hope is appropriate. But at the start, the system knows nothing. It is as dumb as a stone. So, we need to tweack the various parameters of that system to the correct set of value to get it to produce a correct result. This is done via Training and training is done using a new concept: backpropagation.

Let’s investigate!

3 The Training concept

AI is full of these concepts and actions that can be scary for anyone not aware of what they really mean. We all have heard of AI systems learning stuff, but how on earth does a computer learn? And is it really a new thing?

3.1 Computer’s knowledge of the past

Before AI as we know it, the closest thing to “learning” was expert systems. Picture a computer in the 1970s or 80s, tasked with sniffing out bad food. No neural network, no tweaking weights—just a database and a rulebook. An expert system was like a super-smart librarian: humans (the “experts”) fed it a pile of facts (a database) and a set of “if-then” rules. For our food:

Database: “Spicy = 3, Intensity = 0.5, known rotten.”
Rule: “If Spicy > 2 and Intensity > 0.4, then Edible = 0.2.”

The computer didn’t guess and adjust—it looked up the data, matched the rule, and spat out result, just logic. Was this learning? Not really—it didn’t adapt or improve. It “knew” what humans told it, like a recipe card: “Spicy + intense = bad.” Expert systems ruled fields like medicine (diagnosing diseases) or engineering (fixing machines), but they were static—no tasting new food to refine the rules.

There were other tricks, like programming with heuristics—fancy trial-and-error rules. Think chess programs: “If pawn’s here, move there.” Still not learning—just smarter rule sets. Or statistical models—crunching averages, like “70% of spicy meat are rotten.” Closer to learning, but no self-adjusting. Expert systems with databases and rules were the king—codified human smarts, not machine discovery.

In other words, the system was as good as the series of rules were. These rules were to be known by human at the very starting point. It is a crucial element to understand the paradigm shift we are witnessing. In the old times, the humans had to do all the thinking, the abstraction, the knowledge design, etc. to create a good computer able to make decisions. Creating a good expert system was difficult and demanded seriously competent human beings to shape it. The big news is: this is true no more! And that, my friends, is thanks to mathematics and a whole bunch of sciences combined with it.

3.2 What is Training for AI?

Training for AI is the magic that makes the world excited. Indeed, we have devised a way to teach things to a computer that we don’t know ourselves. Yes, you read that right: things that we do not know ourselves! This is the most amazing feat you can think of.

As a general concept, training an AI system consists in creating a monstruous mathematical function that mimics the world. I say monstruous because this function is so complicated that it defies our human understanding. The astonishing fact is that we know how to create it but we are unable to explain how it works in details, unlike the experts systems of the past.

In this paper I will focus on the easiest training possible to understand: Supervised Training. The principle is quite simple: we have a bunch of data for which we know what result we are expecting. In our case, say we have a huge database of smells for which we know if they are edible or not. We will enter each smell in the system, check the result provided and, if the result is not good, modify the system to get better next time. It is called “Supervised Learning” because we do know what the result should be. It guesses, checks the tag, adjusts.

In fact, there are other types of training for an AI system, but we will not cover them here. Obviously, other trainings are called “Unsupervised Learning” where the system learns by itself. Let’s stick to the simplest for now!

4 What are we looking for exactly?

To understand the objective of this paper, we have to re-establish a few important points.

4.1 Mathematical functions are predicting the future

In our paper about Differential calculus, we presented that mathematical functions are a way to predict the future. The example we used was the calculation of the time it will take to reach a destination, given the average speed we intend to have:

\[ f(\text{distance}, \text{speed}, \text{pauses}) = \frac{\text{distance}}{\text{speed}} + \text{pauses} \]

But in the situation, we are in, we know that a function can predict our problem’s solution, but the big problem is that we do not know this function.

So, we used differential calculus to “discover” the best function to describe our problem.

This is what we did for predicting the resale price of a car after 10 years, based on its purchase price, on this diagram:

4.2 A digital neural network is a complex combination of functions

I will now require your attention to explain what will be going on.

Problem: We have built a system that is a complex combination of mathematical functions. For our paper we will use a small system, but the real cases can contain a huge number of them. We also have established that we need weights on the synapses. The problem is therefore to calculate all these a, b and w.

We now need a technique/strategy to find these numbers. Indeed, at the start we would typically initiate these numbers with default values that are extremely unlikely to provide the correct answer. The first step is using what we call Error Propagation.

4.3 Good news: let’s simplify!

Real life is very very complicated. We human want to understand it and model it in order to make predictions. So, we often simplify problems. For instance, it is true in economics where economics models can be represented with a huge number of variables, but since we don’t know what to do with them we make many variable constants.

Let’s take an example:

Picture a model predicting chili sales. In reality, sales depend on tons of variables: price (say, $2), demand (how many folks crave spicy), supply (chili harvests), weather (hot summers boost appetites), and more. Each could wiggle—price might drop to $1.50, demand could spike if a food trend hits. A full model might look like:

\[ \text{Sales} = a \cdot \text{Price} + b \cdot \text{Demand} + c \cdot \text{Supply} + d \cdot \text{Weather} \]

Here, ( a, b, c, d ) are variables, shifting with market vibes. But that’s a beast to calculate—too many moving parts!

So, economists simplify: swap some variables for constants. Maybe they say, “Weather’s steady this year,” fixing d = 0.5, or “Supply’s locked at 100 tons,” setting

c = 100. Now it’s:

\[ \text{Sales} = a \cdot \text{Price} + b \cdot \text{Demand} + 50 \]

Fewer knobs to twist—price and demand still flex, but the rest are nailed down. As we all know, it works, …or not! But we still do it. We have no choice. We canno0t handle certain levels of complexity.

It would be great to do the same kind of simplification with our digital neural network, so let’s do it.

What we observe in our network is that the 2 most important parameters along the workflow are a and w. Both are multiplicators. In fact, w is crucial in the system.

In real brains, synapses adapt—stronger connections mean louder signals, weaker ones fade. Our w mimics that: a big w means “Spicy’s a big deal!”; a small w whispers, “Eh, not so much.” If ( a and w) were separate, we’d split that control—synapse could say “loud,” while neuron could say “quiet”. This would complicate our task seriously. Consequently, we will merge them into one and replace a by w. This way we keep the essence of the problem but remove a potential extremely annoying divergent behaviour between the two.

Our system is now much simpler and the chances to find the right values has increased drastically. By doing so, we’ve turned a lively dance into a simpler step, losing some truth but gaining speed. The amount of calculations involved in this new design is vastly inferior.

5 How to train your Network?

5.1 General training strategy

We will now study how we train a network using the Supervised Learning strategy. It consists in putting together a series of data for which we know the expected output. For instance, we have accumulated a huge number of “smells” for which we know if they are edible or not. It is “supervised” because we are able to judge, from the outside, if a result if good or not.

The general idea is the following:

Split the data set in 3 sub-sets: Training Data 80%, Validation Data 10%, Testing Data 10%.

Take the Training Data and do the following:

Enter a smell
Get the system to produce its result
Compare the result with the one expected and calculate the error
Modify each component of the network so that the result with this smell gets closer to the truth next time.

We do this operation with 80% of all the smells. This operation has a name: epoch. It hails from Greek “epokhē”—a pause or starting point, used by star-watchers like Ptolemy to mark a starting point for tracking stars.

Now this would be amazing if it were enough to run the whole epoch only once. Unfortunately this is unlikely. Tweaking the parameters once for each data is not enough. After all every data is tweaking in a complementary way. We must run several epochs to have a chance to get it right. How many times? will you ask rightly. Then, I am pleased to tell you: …it depends! But for now I will summarise that there are 2 main startegies available: the easy one is to decide of a fixed number of times to do it, like 50. This can work for fairly simple domains of activity. The second one is more complicated and involves ….mathematics again. The idea in this case will be to estimated mathematically if the progress of the whole system is becoming stable, i.e. we think that we cannot really do better.

Anyways, after a certain number of epochs, we stop and try to validate our model.

To validate, we use the Validation Data set.

Enter the smells from the Validation Data set.
Compare the results with the correct result.
Decide if the global result is satisfactory. If not, start from Step 1.

If the general result is not satisfactory we re-start from step 1.

We do this until the results obtained by the Validation set are satisfactory.

We then take another 10%: the Testing Data and check the result. Hopefully the Testing Data set give a good result and the system is properly trained.

5.2 Model initiation

When we create the model, we have to replace all our “w”s and b by proper numbers. Now, as we have stated, we have no idea what these values should be. Consequently, we will initiate these values with random numbers. There are several possible strategies for setting up these numbers but I believe this is out of our scope. Let’s consider simply that they are random numbers, usually between -1 and 1.

What about the various “b”s? Good news, they get the same diet: assigned a number between -1 and 1 will do, thank you very much!

That was easy! It’s all wrong but it was easy!

5.3 Backpropagation

Before we start, I remind you that our model is testing if a smell is “edible”. The output of our system is a number between 0 and 1. 0 meaning “Don’t even try”, 1 meaning “Enjoy” and 0.5 meaning “Your bet is as good as mine. Are you a gambler?”.

Imagine you’ve built your digital neural network to sniff out whether food is edible based on its smell—like in paper #7. You feed it data: smell type (spicy = 3), intensity (0.5), and so on. In forward propagation (paper #8), those numbers zip through neurons, get transformed by functions (like w.x + b), tweaked by activation functions (like ReLU), and weighted by synapses until the network spits out a guess: “Yep, this smell’s edible!” (say, a score of 0.8).

But what if it’s wrong? Maybe that spicy smell was from rotten chili sauce, and the real answer should’ve been 0.2 (mostly inedible). The network’s guess of 0.8 is off, and we need to fix it. That’s where error propagation comes in—it’s like telling the network, “Hey, you messed up, let’s figure out who’s to blame and how to do better next time.” It’s the heart of backpropagation, sending the mistake backward to tweak the system.

5.3.1 Error Function

What’s the Error?

First, we measure how wrong the network was. In our smell example:

Real answer (what we know): 0.2 (mostly inedible).
Network’s guess: 0.8 (mostly edible).
Error: The difference between 0.8 and 0.2 …with a twist.

The error number will be slightly modified, it will be squared. Why would we want to do that? It is in order to punish big mistakes more than small ones. For instance, if we have a mistake of 0.1 or a mistake of 0.6, the scale between the 2 is 600%. But if we square them, , while . The scale between these 2 numbers is now 3600%. So, the error of 0.6 will have an impact 36 times bigger than an error of 0.1.

In paper #8, we saw data flow forward: input → hidden layers → output. Now, error propagation sends the blame backward: output → hidden layers → input. It’s like retracing your steps after a bad cooking attempt to see which ingredient ruined the dish.

5.3.2 Modifying the parameters

The question we have now is: how do we modify the parameters of the network? How do we know what is correct, or simply “better”? These are just numbers and functions combined with each other. It is pretty hard to know. Well, I have good news: mathematicians have an answer to that problem!

The solution to this problem is named a gradient. And for the first time since I started to write these papers I will ask for your stronger attention. What I will present now will lookcomplicated, but what I intend to convey is not complicated. I know it is counter intuitive, but it is.

For a start, I will explain the concept of gradient.

5.4 What is a gradient?

We now need to introduce a concept that should not be complicated to understand: the gradient of a line.

The Gradient can be simply defined as the Slope of the line. In other words, it describes how flat or how vertical a line is. It represents the amount of progression is made while moving along the line.

It is a fairly common concept that you encounter when you drive in a mountain area. You will see signs telling you the percentage of a road.

It tells you that, if for instance it displays 20%, that for every 100 meter you drive horizontally, you will go up or down by 20 meters. In other words, it is defined as Amount of change up or down divided by horizontal distance.

\[ \frac{\text{Change Up or Down}}{\text{Flat Distance driven}} \]

This formula is illustrated in the following diagram:

Now, we have to add an extra concept to this gradient. On a line, this gradient is constant. It is the same wherever you move along the line. Life is rarely that simple and a road on a mountain is certainly not looking like a straight line. It is more like a curve with hills and valleys. On a curve, it is quite different. In fact, the slope of a curve, like in a mountain, can vary depending where you are. Sometimes the slope is hard and sometimes it is soft. The flatter the road, the smaller the gradient. If we push this logic, what is the gradient when the road is flat? Well, this is an excellent situation because the gradient is equal to 0 (we move along the road but no change up or down). So, wherever the gradient is zero, we know we are flat.

This concept will be essential to understand what is happening now.

5.5 The heart of the system: The Gradient of the Loss

Principle:

The Loss function is defined by:

\[ L = (y - t)^2 \]

It represents how off we are compared to the correct result, y being the result and t being the “truth” —and we want it near zero.

Based on the result of this function, we want to modify w and b so that this famous be as close as possible to 0.

But y itself can be expressed in terms of the other parameters:

\[ y = \text{Sigmoid}(w \cdot x + b) \]

Now, I will ask you to believe me because if you don’t, I’ll have to develop the whole mathematical story and I promised not to do that. I’m a man of my words.

To remember:

We know how to calculate the impact that a change in value of w will have on the loss L. In other words, we are able to calculate without any doubt if increasing or decreasing w is a good idea. We can also know if this change should be big or tiny. This is all done via the principle of Differential Calculus presented in paper #4. If the gradient of the impact is big, w is a big troublemaker and should be treated accordingly. If the gradient of the impact is small, w is doing a fairly good job and should be modified only gently.

This is also possible for b that will be treated in the same manner.

Explained another way:

Just in case the previous explanation was not clear, I try another way:

After calculating the loss, we need to decide how much to tweak the neuron’s parameters—those are (w) and (b).

To figure this out, we check the impact a change in (w) has on the loss. This comes from a tricky gradient calculation—don’t worry about that—but it shows how much (w) affects our loss. If the gradient’s big, (w)’s a major player in the mistake, so we adjust it with a bigger step. If the gradient’s small, (w)’s not doing much to the loss, so we tweak it just a little.”

We do the same with (b).

The adjustment

The adjustment itself is calculated simply:

New w = w – learning_rate . gradient for w
New b = b – learning_rate . gradient for b

We introduce here a new concept, the Learning Rate. This is usually a small number, decided at the start of the training, typically 0.1. So, nothing to worry about.

The adjustment formula explained:

The above formula says that the new w or new b are equal to the old value, plus “Learning rate . Gradient”. Where does this come from?

It is rather simple. You remember that the gradient is an indicator of how big a troublemaker w or b is, right? So, the idea is that if they are big troublemakers we want to punish them harder. This gradient’s value is high for big trouble and low for little trouble. So, if we consider that the Learning_Rate is a constant like 0.1. If the gradient is small, say 0.2, the modification will be 0.1 x 0.2 = 0.02. But if the gradient is high, say 0.8, then the modification will become 0.1 x 0.8 = 0.08, which is much higher. It’s what we want: big gradient = big punishment; small gradient = gentle change.

The whole Gradient Loss process visually

Like any function, the loss function can be drawn. It always has a U shape. The gradient descent means visually going along this curve until we reach the bottom area. At every step, when we modify w (or b), we slide along the U shape.

This operation is applied on each neuron back to the entry point.

5.6 Gradient Loss in numbers (optional read)

This section is for illustrating with numbers. You can safely skip it if numbers are not your cup of tea.

I will reuse the same numbers as used at the beginning: The neuron is outputting 0.8 (very edible) and the real value is 0.2 (seriously not edible).

We start at the end—that 0.8 from the output neuron. The loss is 0.36. ( . Let’s assume that the weight was 0.7. We need to know how much this final weight (0.7) is to blame. This is where differential calculus (paper #3) comes in, with a bit of gradient descent magic. We calculate the “gradient”—how much the loss changes if we tweak that weight.

Without diving too deep into math (I promised to keep it light!), the gradient depends on:

How far off we were (0.8 – 0.2 = 0.6).

The activation function’s slope (let’s say we used Sigmoid from paper #8, which curves between 0 and 1).

The hidden neuron’s output (let’s call it 0.9 for now, from forward propagation, in other words, x).

The gradient might look something like : 0.6 x Sigmoid_Slope x 0.9, but let’s say it works out to 0.12. This tells us: “Hey, this weight pushed the guess too high.” We adjust it with a learning rate (multiplied by the gradient)—a small step, say Learning_Rate = 0.1:

New Weight = 0.7 – (0.1 x 0.12) = 0.7 – 0.012 = 0.688

A tiny tweak from 0.7, but it’s a start!

From the output, we step back—each hidden neuron gets its gradient and tweak, all the way to the input.

6 The Bigger Picture

Let’s take some distance again from all these maths and all these steps. What have we done in broad terms?

We feed the system with known smells
For each smell, we compare the result given with the known correct result
With each smell, calculate how wrong the system is and correct each neuron accordingly starting from the exit point back to the entry point.
Do these 3 steps with each smell of the batch.
Run the batch (epoch) several times
Re-run the model on each of the Validation Data
If validation data is not satisfactory, restart from step 1
If validation data is satisfactory re-run the model with the testing data.
Hopefully the result is satisfactory. If not, scratch your head and guess where you got it wrong, maybe your data are not good.

These steps are summarised on the next 3 diagrams:

7 Data quality

The process we have studied in this paper depends heavily on the data quality. The data quality is a topic in itself, but be aware that the magic we have described here is possible only when you have a good and large set of quality data.

We described our system as a smell detector. Usually, the example given is always how to recognise a cat in a picture. We send pictures of animals and we know when it is a cat and when it is not. Easy, right? Not that fast! Imagine that we feed the system exclusively with Siamese cats, in crystal clear quality images. Then our system will be good at …finding crystal clear Siamese cats, not at recognising cats. For this reason, the building of the data set is essential. But you get the picture on how we train a, AI model in a supervised learning manner just be aware that data quality matters.

8 Where is the Intelligence?

In this paper we have presented the multiple aspects and tools of how we can train an AI Neural network. It is this “invention” that allowed to make a giant step towards performant systems. We have seen that Mathematics are heavily involved and that without them, none of this would be even thinkable.

Now, if we get back to our recurring question, i.e. “Where is the Intelligence”?, I would argue that there is not much in the sense we commonly use the word intelligence. It is all a beautiful combination of mathematical formulas cleverly engineered. We convert the real world into numbers, then these numbers are transformed, then these functions are mathematically analysed and improved. We do that until we find the solution. That’s it.

So, it is not today either that we have any intelligence at work. Sorry.