The gradient descent algorithm is a first-order optimization algorithm used to find the local minima/maxima of a provided function.
In simpler words, it is a technique primarily used to reduce the cost/loss of a function in other machine learning algorithms, such as neural networks and regression.
The algorithm trains the parameters of the machine learning model and updates them to minimize errors between the actual and predicted values.
These optimized machine learning models are used as practical tools in machine learning, deep learning, and Data Science related projects.
Types of Gradient Descent Algorithm
There are three types of gradient descent algorithms in machine learning model optimization.
1. Batch Gradient Descent
It is used when we need to process data on a smaller scale. It processes all sets of training data for each iteration of gradient descent. Using the mean gradient, we update our parameters by averaging the gradients of all the training samples. Therefore, that is only one gradient descent step in one epoch. Here, convergence to the unique global minima is guaranteed but can be very slow for larger datasets.
2. Stochastic Gradient Descent
We use Stochastic Gradient Descent when we have large sets of data. This type of GD algorithm processes one training sample for each iteration. In simple words, only one training sample is processed at a time. It is faster than the previous, as weights are updated after every iteration.
However, it does not guarantee convergence as it can also oscillate around the minimum point. Since we only consider one example at a time, the cost will vary over the training examples and will not go down. However, over time, you will see that costs are eventually decreasing.
3. Mini Batch Gradient Descent
Mini Batch GD Algorithm is a mixture of the other two. Hence, it is more efficient than them as well. We choose a batch of training examples with a fixed quantity more diminutive than the actual dataset and refer to it as a mini-batch.
Suppose x samples are processed per iteration, then x has to be less than the total number of training samples. When x equals the total training samples, the algorithm acts like Batch Gradient Descent.
Just like Stochastic Gradient Descent, the average cost fluctuates over time as we are taking an average of a small number of samples per iteration.
Gradient Descent Algorithm in Action: Steps to Follow
Let’s suppose:
Cost function (mean squared error) = ( Σ(ypredicted - yactual)^2)/2
1. We initialize the weights randomly.
2. We calculate the cost.
def Mean_Squared_Error(y_actual, y_predicted):
cost = np.sum((y_actual-y_predicted)**2) / len(y_actual)
return cost
3. Calculate the difference in costs. If it is greater than the stopping threshold then we need to update weights.
def gradient_descent(DataSet1,DataSet22,NoOfIterations=1000,
learning_rate = 0.0001,stopping_threshold = 1e-6):
initial_weight = 0.1
NoOfIterations = NoOfIterations
learning_rate = learning_rate
n = float(len(x))
costs = []
weights = []
prev_cost = 0
for i in range( NoOfIterations):
y_predicted = (initial_weight * DataSet1)
curr_cost = Mean_Squared_Error(y, y_predicted)
if prev_cost and abs(prev_cost-curr_cost)<=stopping_threshold:
break
prev_cost = curr_cost
costs.append(curr_cost)
weights.append(curr_weight)
New Weight = W - α (d(C) / d(W))
where,
- Â Â Â Â Â Â Â Â Â Â Â C – output
- Â Â Â Â Â Â Â Â Â Â Â W– old weightÂ
-            α – learning rate (can vary between 0.0 – 1.0)
4. Derive partial derivative cost functions.
- d (C) / d( yactual) = -1(ypredicted – yactual)
- d (yactua) / d(z) = yactual (1 – yactual)
- d(z) / d(W) = I1
where I1: input 1
5. Calculate their values.
6. Add the values, and substitute them in the above equation to calculate new weights.
curr_weight=curr_weight - (Learning_Rate * weight_derivative)
7. If the cost is still greater than 0.1, Repeat it again!