Artificial Intelligence & Mathematics: Differential Calculus: The Genie in the Bottle

Introduction

In our previous white paper, we have introduced how Algebra is an essential part of the AI world as it allows to convert the real world in computable data. In this paper we will continue to explore the mathematical marvels used in AI to create systems that can “predict” the future or produce results in any form like text, images, videos or decide what action to take while driving a car. We will keep the mathematics to the minimum level to make this paper accessible to all. My apologies to all the gifted mathematicians who might read this paper, I will definitely simplify the mechanisms. My purpose is not to teach mathematics but convey the principles of how it works. As you will see, mathematics are amazing !

2  Principle to predict the future

In ancient times, oracles were predicting the future by reading the sky, reading hand palms, reading birds flight or sacrificing animals or even humans. Reading the future in a hot blood bath is most impressive, you will surely agree.

Today, humanity still wants to know the future but most of the modern world has stopped killing animals for that purpose. Instead, we use computers and the Oracles are now named Artificial Intelligence. There is less decorum around it, maybe something we can regret, (but I think that decorum has taken a serious blow when people stopped wearing hats. Nothing like a Top hat or Derby to make an impression. Not to mention the number of articles written in relation to the queen of England’s hats. But I digress…)

So, what we want to achieve is to predict the future based on the past rather than the flight of birds. Mathematics have something fairly good for that: functions. The idea is that we enter a value as an entry point to a function and we get a new value as a result.

2.1  Mathematical Functions

As a very simple example, we can have a function to calculate how long it will take to reach a destination based on the average speed.
If d = Distance to reach, s = Average speed per hour, and p = pauses durations, then to calculate the duration of your trip you would have:

F(distance, speed, pauses)   

Shorter written as:

F(d, s, p) = (d/s)   + p

In other words, if we name y the result of this calculation, we have:

y = (d/s)  + p

That looks fancy, doesn’t it?

In case you have a doubt, this is indeed predicting the future since the system tells you how long it will take to arrive at destination. Things can happen and the prediction might be wrong. But think of it as Waze, or Google map telling you how long it will take to arrive. It’s based on this principle.

2.2   Finding the Mathematical Function from the Data

In the previous example, we create the function to find the result because we know how to calculate the result. This is not fun. How about we calculate the resale price of a car after 10 years, or the expected first salary of a graduate based on the diploma, the city and the wealth of the family at birth? That sounds more exciting, right? But the problem is: we do not have the function to calculate that. Noone can come and say:

Salary = Diploma … City … Wealth

And still, we’d want to know!

The good news is we can calculate this function with a pretty accurate result. To do that we will use Machine Learning and Differential Calculus.

The concept is the following: we will use the data we have about the past and we will discoverwhat function matches best the past. Then we will apply it to the present to discover the future. And guess what: it does work! So, let’s see how we do that…

3 Differential Calculus at work

3.1    Principle of Gradient Descent

The principle to calculate a function based on data is to do the followings:

  • Create a function as a starting point. This function can be random or cleverly chosen but we know it is very likely inefficient.
  • For each data point we have from the past, we calculate the “distance” between what the function we have would give as a result, compared to the real result.
  • We modify the function to give a better result closer to the real data.

We do these 3 steps for each data point and as many times as needed to get the correct function.

To illustrate visually, I have to simplify the problem to a 2-dimensional problem, otherwise it becomes messy pretty quickly. For instance, let’s predict the price of resale of a car after 10 years based on the price of purchase.

We can see on the first diagram the triangles, representing the history of car sales after 10 years, based on their initial price at purchase. The orange line is the starting point. We have to start somewhere, and we do not know what is the function that will represent these data best. We have to find it empirically. It is the blue line, the target.

For each data point, we enter the initial price and get the result by the orange function. Then we calculate the “distance”, between the reality and the prediction. This is the job of what is called the “Loss function”. We then apply a correction to the function that will go closer to the reality. We “pull” the orange function towards the best result (the blue line). Basically, we do that for all the data points, and we do it for as long as the average distance between the points and the function can be improved. In fact, there are many things to take into consideration here, but it will be the topic for another paper. Let’s keep it simple.

Once we have convergence (i.e. stability in the predictions) and there is no more improvement to make, we have a final function that matches the past pretty well. If we enter the price of a car today, we can quite accurately predict its value in 10 years. Magic !

4  Perspective

Some people might read the previous paragraphs and say: “Is that it?” The answer, in a nutshell is yes, with a catch. The example I used here is with 1 parameter so that I could display it on a 2D graphic. But of course, if you take more fascinating examples like the prediction of the salary of a student based on his initial wealth at birth, the geography, the university, and in fact probably many other parameters we’d love to introduce, then we have a problem. Indeed, the problem I have used could almost be done by hand with a small calculator or even pen and paper, should you have the time. As soon as you add “dimensions” to the problem, the difficulty becomes simply enormous. In fact, it becomes properly inhuman. And this is precisely where things change with computers: we know mathematically how to do it, and we know how to ask a computer how to do it. The computer has the processing power the human does not have and …bingo! It becomes possible. We can indeed do the same for a problem with 5, 10, 50, or 500,000 parameters. It takes longer. It takes a huge amount of processing power, but it can be done!! And then, when you do that on a problem that appears impossible, because indeed humanly impossible, you wonder if the system knows something you don’t. And in a way …it does.

5  Where is the Intelligence?

Well, so far, we still have not encountered anything that can be described as “Intelligence”. What we have done is use mathematics to build more mathematics. This mathematics are absolutely amazing as they can be used to predict things in the real world. The result is that the computer can “predict” things that humans cannot. Not only it can do that, but it does it pretty well. Nonetheless, the word “intelligence” does not really seem appropriate, at least not in the human sense of the word.