K-means clustering is a popular classification algorithm that has many applications in the real world. It works by initializing
k number of centers (defined by the user). The distance of every point is then calculated from these centers. Every point gets assigned to the center closest to it in distance.
New centers are formed at each iteration by taking the mean of every point assigned to a particular cluster. The algorithm stops when no points are transferred among the clusters. Scikit learn provides a built-in method to run this algorithm.
The K-means algorithm is widely used in applications that require clustering similar things together. For example, it can be used to group similar TV shows and movies on streaming platforms like Netflix which can then be used in the recommendation system.
sklearn.cluster.k_means( X, n_clusters, sample_weight=None, init='k-means++', n_init=10, max_iter=300, verbose=False, tol=0.0001, random_state=None, copy_x=True, algorithm='lloyd', return_n_iter=False)
This function takes the following arguments:
- X(array): Observations to be clustered.
- n_clusters(int): Defines the number of clusters.
- sample_weight(array): Defines the number of weights given to each observation. If set to
'None', all observations get equal weights.
- init(string): Defines the process of centroid generation. It can either be
'k-,means++'(uses empirical probability for determining centroids) or
'random'(randomly chooses the centroids)
- n_init(int): Defines how many times the k-means algorithm will run with different centers.
- max_iter(int): Defines the upper limit on the number of iterations.
- tol(float): Defines the tolerance level. The algorithm stops if inertia (relative difference of each assigned object from its center) is less than the total. Default value is 0.0001.
- verbose(bool): Higher the value, the higher the number of output messages will be to see how the training is progressing.
- random_state(int): An integer sets the randomness of centroid selection as fixed. ‘None’ can also be given as an argument.
- copy_x(bool): If set as true, the data is not centered before running the method while if it is set as false, the data is centered and then reverted after the algorithm stops running.
- algorithm(string): Can be
'full'. Each works best in different circumstances.
- return_n_iter(bool): Returns the number of iterations if set
The following values return from the k-means method:
- centroid, ndarray: Gives the centers at the last running instance of the algorithm.
- label, ndarray(n_samples): Gives the cluster number of every observation.
- inertia, float: Cumulative sum of the difference between the squared distance of every point and its assigned center.
- best_n_iter, int: If the
return_n_iterparameter is true, the number of iterations return from the method.
Following is a code example in order to understand the
from sklearn.cluster import k_means import numpy as np #defining the points on which the algorithm is to be run points = [[1,2], [0,5], [3,5], [9,1], [4,7], [3,5], [10,4], [6,2], [3,1], [8,1]] X = np.array(points) #fitting the model model = k_means(X, n_clusters = 3, init='random', n_init=10, max_iter=10, tol=0.0002,algorithm='auto', return_n_iter=True) #printing the centres print("Centres: ", "\n", model) #printing the classifications of observations print("Classification: ", model) #printing the inertia print("Inertia",model) #printing max number of iterations print("Max Iterations: ",model)
- Line 1: Importing the
- Line 2: Importing Numpy for the conversion of data.
- Line 4,5: Defining the points on which k-means clustering is to be applied.
- Line 6: Converting the list to
- Line 8, 9, 10: Defining the model. Parameters were set according to needs. Some parameters mentioned above were not specified. Those parameters will have default values.
- Line 13-19: Printing the return values to check results.