Bikeshare Usage Prediction
This demo will show you how to build a neural network from scratch to carry out a prediction problem on a real dataset(UCI Machine Learning Database). Our goal here is to predict the number of bike share users on a given day.
By building a neural network from the ground up, we’ll have a much better understanding of gradient descent, backpropagation, and other concepts that are important to know before we dig into some topics like, CNN, RNN and LSTM etc. (I will just put all the high level concept of DNN here, if you wanna know those formulas and implementation parts, please check the GitHub repo. of this demo. )
First, to understand DNN you have to understand those terminologies below.
Each circle(Neurouns/Perceptron) in the network graph looks at input data and decides how to categorize that data. In out case we got lots of inputs data like ‘season’, ‘weathersit’, ‘mnth’, ‘hr’, ‘weekday’ from the dataset . We need to feed all the useful data to help our DNN to learn the proportion of importance from those inputs.
You might be wondering: “How does it know whether season(input a) or weekday(input b) are more important in making this acceptance decision?” Well, when we initialize a neural network, we don’t know what information will be most important in making a decision. It’s up to the neural network to learn for itself which data is most important and adjust how it considers that data. It does this with something called weights.
These weights start out as random values, and as the neural network network learns more about what kind of input data leads to a student being accepted into a university, the network adjusts the weights based on any errors in categorization that the previous weights resulted in. This is called training the neural network.
A higher weight means the neural network considers that input more important than other inputs.A lower weight means that the data is considered less important.
An extreme example would be if test scores had no affect at all on university acceptance; then the weight of the test score input data would be zero and it would have no affect on the output of the perception.
Each input to a perceptron has an associated weight that represents its importance and these weights are determined during the learning process of a neural network, called training.
In the next step, the weighted input data is summed up to produce a single value, that will help determine the final output – the number of bike share users the number of bikeshare users on a given day
Activation Function (We don’t use it on this demo)
Though we don’t use any activation functions in our network, but in a prediction case like clarifying different classes or just define a Yes/No question. We need to use activate function to transform our linear weight to a 0/1. The closest number to zero the more motivation to divide that number to be 0/1.
Anyways, the result of the perceptron’s summation is turned into an output signal! This is done by feeding the linear combination into an activation function. It is the nonlinear activation function that allows such networks to compute nontrivial problems using only a small number of nodes. In artificial neural networks this function is also called transfer function (not to be confused with a linear system’s transfer function). There, I will list you some different activate functions. If you would like
- Heaviside step function
- Sigmoid function
- Rectified linear unit (ReLU)
Gradient Descent (Rate of change or slope)
Our goal here is to calculate the gradient of the error which means we need the partial derivatives of the error with respect to each of the weights. Remember from the video above, gradient descent works in small steps, moving down the mountain a little at a time. After calculating the output for some input data, we perform an update step, where we calculate the error gradients and then update the weights. This happens over and over many times while training a network, corresponding to the many steps it takes to make it down the mountain. To sum up, there are 2 points.
- The goal of gradient descent is to start on a random point on this error surface \((m0,b0)(m0,b0)\)and find the global minimum point\((m∗,b∗)(m∗,b∗)\)
- The larger this error is, the larger the step should be. When the error is small, our steps can be smaller since the weights are near the minimum.
For the mathematically inclined
If you want to minimize the equation x² (whose derivative is 2x), and your guess for the solution is 3, then you can take a baby step (.1) in the direction opposite of the gradient at x=3, which is -6. So the next guess might be 2.4, the next one 1.8, the next 1.5… until finally we reach zero. (Ref. hackernoon)
2D example – Relationship of \(f(x)=mx+b\)
Imagine a marble at the rim of a large bowl. This is the first guess of a solution. It starts far from the bottom center of the bowl, but it eventually gets there by rolling there, bit by bit. Instead of teleporting instantaneously from the rim to the bottom, it takes a gradual path, following the path of least resistance.
This is an example when you are using a high learning rate to attempt a black diamond in your first attempt at snowboarding. This increases the chances of injury, repeated failure, possibly leading to demotivation. Be sure you don’t miss to check this post(Linear Regression with Numpy) if you want to get the idea of Gradient Descent, under the hood.
3D example – Relationship of \(f(x)=mx+b\)
There is another animated gif I found which is pretty awesome in the internet ‘A Brief Introduction To Gradient Descent‘. As you see, there is an error surface of predictions, given our data on the right head side in the gif.
For more detail, please check this post A Brief Introduction To Gradient Descent which I think it explains the whole gradient concept in a really concise way!
Where we won’t go through all the mathematics here, you may check the code in the demo below. That is quite complicated for me to explain the whole process of backprop especially I am not good at calculus(I may write a post about how to apply calculus in the future). So, if you want to know what happens under the hood, there is a excellent post for you to reference A Step by Step Backpropagation Example.
Before, we saw how to update weights with gradient descent. The backward propagation of errors or backpropagation, is a common method of training artificial neural networks and used in conjunction with an optimization method such as gradient descent. The algorithm repeats a two phase cycle, propagation and weight update.
Since the output of a layer is determined by the weights between layers, the error resulting from units is scaled by the weights going forward through the network.Since we know the error at the output, we can use the weights to work backwards to hidden layers.
You may find the source code of this demo in this repo.