How to perform K means clustering using cluster.k_means() in scikit learn?

K-means clustering is a popular classification algorithm that has many applications in the real world. It works by initializing k number of centers (defined by the user). The distance of every point is then calculated from these centers. Every point gets assigned to the center closest to it in distance.

New centers are formed at each iteration by taking the mean of every point assigned to a particular cluster. The algorithm stops when no points are transferred among the clusters. Scikit learn provides a built-in method to run this algorithm.

The K-means algorithm is widely used in applications that require clustering similar things together. For example, it can be used to group similar TV shows and movies on streaming platforms like Netflix which can then be used in the recommendation system.

Syntax

sklearn.cluster.k_means(
X,
n_clusters,
sample_weight=None,
init='k-means++',
n_init=10,
max_iter=300,
verbose=False,
tol=0.0001,
random_state=None,
copy_x=True,
algorithm='lloyd',
return_n_iter=False)

Parameters

This function takes the following arguments:

  • X(array): Observations to be clustered. 
  • n_clusters(int): Defines the number of clusters.
  • sample_weight(array): Defines the number of weights given to each observation. If set to 'None', all observations get equal weights.
  • init(string): Defines the process of centroid generation. It can either be 'k-,means++' (uses empirical probability for determining centroids) or 'random' (randomly chooses the centroids)
  • n_init(int): Defines how many times the k-means algorithm will run with different centers.
  • max_iter(int): Defines the upper limit on the number of iterations.
  • tol(float): Defines the tolerance level. The algorithm stops if inertia (relative difference of each assigned object from its center) is less than the total. Default value is 0.0001.
  • verbose(bool): Higher the value, the higher the number of output messages will be to see how the training is progressing.
  • random_state(int): An integer sets the randomness of centroid selection as fixed. ‘None’ can also be given as an argument.
  • copy_x(bool): If set as true, the data is not centered before running the method while if it is set as false, the data is centered and then reverted after the algorithm stops running.
  • algorithm(string): Can be 'lloyd', 'elkan', 'auto' or 'full'. Each works best in different circumstances.
  • return_n_iter(bool): Returns the number of iterations if set true.

Return Values

The following values return from the k-means method:

  • centroid, ndarray: Gives the centers at the last running instance of the algorithm.
  • label, ndarray(n_samples): Gives the cluster number of every observation.
  • inertia, float: Cumulative sum of the difference between the squared distance of every point and its assigned center.
  • best_n_iter, int: If the return_n_iter parameter is true, the number of iterations return from the method.

Explanation

Following is a code example in order to understand the sklearn.cluster.k_means() method.

from sklearn.cluster import k_means
import numpy as np
#defining the points on which the algorithm is to be run 
points = [[1,2], [0,5], [3,5], [9,1], [4,7], [3,5], [10,4], [6,2], [3,1], [8,1]]
X = np.array(points)
#fitting the model
model = k_means(X, n_clusters = 3, init='random', n_init=10, max_iter=10, tol=0.0002,algorithm='auto', return_n_iter=True)
#printing the centres
print("Centres: ", "\n", model[0])
#printing the classifications of observations
print("Classification: ", model[1])
#printing the inertia
print("Inertia",model[2])
#printing max number of iterations
print("Max Iterations: ",model[3])
  • Line 1: Importing the k_means() method from sklearn library.
  • Line 2: Importing Numpy for the conversion of data.
  • Line 4,5: Defining the points on which k-means clustering is to be applied.
  • Line 6: Converting the list to NumPy array.
  • Line 8, 9, 10: Defining the model. Parameters were set according to needs. Some parameters mentioned above were not specified. Those parameters will have default values.
  • Line 13-19: Printing the return values to check results.

Output

K Means Clustering
K Means Clustering Using cluster.k_means()

Stay in the Loop

Get the daily email from Algoideas that makes reading the news actually enjoyable. Join our mailing list to stay in the loop to stay informed, for free.

Latest stories

- Advertisement -

You might also like...