#08 - Artificial Intelligence: Neural Network – Forward Propagation

Summary

Summary: AI Neural networks mimic the neural network of the brain. In this paper we present what is happening inside a digital neural network from data entry to result. We study the various mathematical steps in their simplest format to allow global understanding of the inside mechanisms. The end-to-end process is called Forward Propagation.

We recommend reading the 7 previous articles in the series for an easier understanding.

Keywords: AI; Neural network; Machine Learning; Deep learning; Neuron; Synapses; Layers; Embedding; Vectorization; Activation Function; Forward propagation

Author: Sylvain LIÈGE

Note: This Paper was NOT written by AI, although AI might be used for research purposes.

1 Introduction

In our previous paper, we have presented the various components of a digital neural network: the entry neurons for digesting the initial data, the functions inside a neuron to transform the data and the synapses to convey the data from neuron to neurons.

It is now time to have a closer look at what is actually happening to these data and how a series of numbers as entry points are converted into a decision at the exit. For instance, how do we recognize an animal or a person on an image; how do we decide that an object on the road is a person requiring to stop the car or a piece of junk presenting no danger; and for our use case, how a smell is assessed as edible food or not.

We will use mathematics in this paper. Some funny functions will be introduced with some tricky formulas. There is no need for you to remember them or even understand the mathematics behind the formula. Understanding what it does is enough. You don’t need to be a mathematician to benefit from this paper.

2 Processing our smell data

2.1 Neurons and Synapses

A digital neural network is made of:

Neurons that take numbers as input and produces new numbers as outputs.
Synapses that convey the numbers from one neuron to another. These synapses can decide to transmit or not the information and can amplify or decrease the importance of the information on the way. Now, I must emphasise that naming these connections “synapses” is definitely not conventional. You are unlikely to find this name used in any AI papers. You will find them under the name of “weight” with a much more mathematical approach in general. Probably enough to lose most people on the way. But I think that sticking to synapse for the moment is good enough and should make your life easier. Just be aware it is unusual.

2.2 Step 1: Inside a neuron

As we have established, a neuron will take one or more numbers as an entry point. In order to simplify our presentation, we will consider only one number for now. We are therefore in the case of the following network:

Figure 1: One data per neuron in entry layer

For our smell example, a neuron might receive a number between 1 and 7 to represent smell types, remember? Now what happens?

As we have explained earlier, the transformation of information operated by a biological neuron will be performed by a mathematical function in the digital world. The simplest mathematical function we can use is of the form:

f(x) = a.x + b

x is the entry number. It is multiplied by a number “a” and a bias “b” is added to the result. Now the question you are about to ask is: what is “a” and “b”? Indeed, that is THE question. Unfortunately, we don’t know. I’ll come back to that.

So, inside a neuron, a simple transformation is operated with the above function. Let’s see what happens in the synapse.

2.3 The multi parameters case

This is all well and nice but what happens when we send more than one value to a neuron? I mean in the following case:

Figure 2: Combined data per neuron in entry layer

In this situation, the second entry neuron from top will receive both i and c. In this case the function applied inside the neuron is simply:

\[ f(i, c) = a_1 \cdot i + a_2 \cdot c + b \]

Each parameter will receive its own coefficient. This rule can work for as many parameters as we wish.

3 Activation function

In practice, as usual, things are more complicated. After applying the initial linear function (a.x+b), we will also apply a complementary function: the activation function.

The activation function acts like a smart filter or tuner for each neuron, deciding how strongly or weakly the neuron should send its signal forward based on the information it receives. Instead of just turning the signal on or off (like an “all or nothing” switch), it adjusts the signal’s intensity—making it stronger or weaker—depending on how important or relevant the input is. This flexibility helps the network learn and recognize complex patterns, like spotting tiny faults in a circuit board or predicting customer preferences, by allowing the neurons to pass along nuanced, graded signals. It’s like the neuron is whispering, speaking normally, or shouting, rather than just being silent or yelling. This ability to vary the signal strength lets the network handle tricky, real-world problems in a more human-like way, going beyond simple straight-line calculations.

So, I’m afraid I have to take you slightly more into mathematics to study this part of the system. I’ll do my best to keep it simple.

We will now study 4 types of Activation functions. These four types are often used for simple problems that need only simple neural networks. The type of solution selected by AI engineers will depend on the type of problem to solve. Please don’t give up because of the names. It is not that complicated.

3.1 ReLU (Rectified Linear Unit)

This function is dead simple. It takes the result of the neuron’s transformation, let’s call it “z” and if this z is lower than zero, than it becomes zero, in other words does not transfer the information. If z is greater than zero, then it transfers z as is. So, less than 0 becomes 0; more than zero, it is not changed.

Mathematically it becomes:

\[ f(z) = \max(0, z) \]

Figure 3: ReLU function

Fairly straightforward!

ReLU functions are very popular for several reasons. Most importantly, they are extremely inexpensive to compute. Checking if a number is less than zero costs almost nothing. They are used to extract hierarchical features (e.g., edges, shapes, patterns). Very popular and useful in computer vision and image classification, object detection, facial recognition, recognizing road signs, etc.

Natural language processing (NLP) tasks, such as sentiment analysis or text classification are also using this solution.

The ReLU function is typically used on the hidden layers (between entry and exit) of the network.

Figure 4: ReLU activation function in hidden layer

3.2 Sigmoid function

This solution is more complex mathematically but also less used. Anyway, it is a popular solution and its principle is interesting.

Sigmoid outputs values between 0 and 1, which are easy to interpret as probabilities, making it a natural choice for the output layer in binary classification tasks (e.g., fault/no-fault detection).

Its mathematical formula is:

\[ f(z) = \frac{1}{1 + e^{-z}} \]

Figure 5: Sigmoid function

So, the sigmoid function helps the network think in shades of gray, not just black and white, making it great for tasks where you need to measure confidence or likelihood. It is often used in the last step of a neural network, when we need to decide the probability of an event. It converts any number into a number between 0 and 1, like a probability. This gradual, smooth adjustment helps the neural network make nuanced decisions.

This activation function is typically used on the last layer of the network

Figure 6: Sigmoid function applied on exit layer

3.3 Hyperbolic Tangent – Tanh

This function has a scary name but fear not it is not that complicated. This function is basically the same as sigmoid above but slightly more refined. Instead of giving a result between 0 and 1 it gives a result between -1 and 1. Consequently, it has more room for finesse. It also has some mathematical propriety that make it a smoother choice. I will come back to that in a future paper when we will see how we calculate these function.

Just for the sake of being complete, here is this function (no need to remember that):

\[ \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} \]

Figure 7: Tanh function

Here is a comparison of the Sigmoid and the Tanh:

Figure 8: Sigmoid vs Tanh functions

As we can see in the previous diagram, the output of the Tanh is richer than the flat sigmoid. They are equivalent in nature but not in practical terms.

3.4 Parametric ReLU

The 3 previous activation functions have something in common: they are “fixed”. It means that once they are attached to a “synapse”, their behaviour does not change over time. It is extremely convenient. Unfortunately, it is not always the case. In particular, the very popular ReLU function has some weaknesses that need to be addressed. In particular, it has the misfortune of displaying what is named the “dying ReLU” behaviour where the output ends up being mostly zero.

Of course, there are some solutions to this problem and one of them consist in adding a parameter to this function (hence the name).

\[ f(z) = \max(\alpha z, z) \]

Instead of creating an output of zero for every value of z<0, it will output z multiplied by alpha. This alpha is usually a tiny number smaller than 1. This little difference has several consequences from which I will highlight only one: it offers a variety of results for all the values of z<1. This variety is tiny and therefore every value of z negative will become a very small negative number. The negative values of z will be compressed close to zero as negative number.

It looks like this:

Figure 9: PReLU function

You’d use Parametric ReLU (PReLU) instead of standard ReLU when you want to give your neural network a little more flexibility to learn from tricky or unusual data, especially if some neurons in your network are “dying” (stopping learning) because they keep getting negative inputs. PReLU lets the network figure out the best way to handle negative numbers during training, rather than just cutting them off at 0 like ReLU does. This can help the network catch subtle patterns or faint signals in your data that ReLU might miss.

3.5 The Entry Layer case

The entry layer case is often slightly different than the hidden layer case. What I have presented is a sort of generalization of how the neural network works. In practice, the entry layer is a bit different. It does not matter much for the general understanding, but just to get the record straight I will add a couple of details about it.

The Entry layer more often than not does not do all the business we see in the hidden layers. It usually takes the raw data and sends that raw data to the first hidden layer without any transformation. It is not a mandatory behaviour and what has been described in my papers so far is still technically correct, but if you encounter a neural network that does nothing with the entry layer, be aware it is also a possibility and it is correct as well.

Figure 10: A typical entry layer performs no transformation

3.6 Synapses and synaptic weight

I want to remind you that the term “synapse” is not the convention in AI. I will now explain why.

In a biological neural network, we have seen that the information out of a neuron is sent to more neurons. We have established that the information is not necessarily sent to every neuron connected. To represent that behaviour we have represented that with a “switch” like on this diagram:

Figure 11: Synapses opening the flow or not

The thing is: the concept of switch is slightly misleading, or shall I say oversimplifying. In reality, what we have is closer to a dial. The information sent can be amplified or decreased or even switched off. This allows for a much better processing of the information. What the brain does is say: this neuron has transformed the received information, and this information is of more or less importance to the next neurons. For instance, the type of the smell might be far less important than the recognition of the smell. If the brain cannot recognise the what the smell is about, knowing it is fruity might not be that useful. But then, it gets slightly more complicated than that, because the importance of the information might be different for every next neuron. So, if a neuron receives information x, transforms it for a new number y, then y can be of different importance for the next 10 neurons. One can make a great use of it and we want to amplify, one can consider it of neutral importance and one can find it close to useless. The corresponding synapses would then put a weight (coefficient) of say 5 on the first one, 1 on the middle one and 0.2 on the last one.

With a system like this the brain really fine-tunes how it processes information. By building our digital neural network this way, we get closer to the brain’s system. And as you can now see, my initial “switch” is far too poor to allow for this job to happen. We will then replace these switches for a better solution: weights or coefficients. Consequently, our digital neural network would look like this:

Figure 12: Synapses dialing the strength of information

4 The complete workflow: Forward Propagation

Let’s recap the complete workflow from entry to output. This workflow has a name: Forward Propagation.

One neuron gets a value as an entry point, x. This value is transformed using a linear function of the form a.x+b. Then the result of this function is transformed again using an activation function. The result of this activation is sent to the next neurons via synapses. On each synapse is a synaptic weight that transforms again the number.

We have something like:

x ==> y = a.x + b ==> z = Activation(y) ==> t = m.z ==> t is given as entry value to next neuron.

This transformation is happening for every single neuron end every single synapse. It is a lot of transformations. This complexity, combined with the number of neurons and structure of the network, is what will give the “intelligent power” to the system.

5 Conclusion

In this paper we have presented the various components of a simple digital neural network ending in the global process of Forward Propagation. We have seen what a neuron is made of and how it handles one or more parameters. We have seen how the synapses are in fact mathematical functions that modulate the impact of an output. We have seen that these functions will have a huge impact on the behaviour of the network, and we choose them based on the nature of the problem to solve.