cluster.ward_tree()
is a method in the scikit-learn library, specifically in sklearn.cluster
module. It performs hierarchical clustering using Ward’s linkage algorithm & returns a linkage matrix representing the given dataset’s hierarchical clustering.
So, to understand the Ward’s linkage algorithm it also necessary to know what is hierarchical clustering.
Hierarchical clustering in machine learning is an unsupervised learning methodology that groups similar data points into clusters. Ward’s linkage is one of the methods used to measure the dissimilarity between clusters during hierarchical clustering. It aims to minimize the within-cluster variance when merging clusters.
Parameters
The cluster.ward_tree
function takes the following parameters:
- X: The input data array or sparse matrix. It should have shape (n_samples, n_features).
- connectivity(0ptional): An optional connectivity matrix. It is used to specify the neighbors for each sample, which can influence the clustering result. If not provided, a fully connected graph is assumed.
- n_neighbors(Optional): An optional integer representing the number of nearest neighbors to consider when constructing the connectivity matrix. This parameter is only used when connectivity is not provided. The default value is None, which corresponds to using all samples as neighbors.
- return_distance(Optional): A boolean value indicating whether to return the pairwise distances between the samples in addition to the linkage matrix. The default value is False.
- compute_full_tree(Optional): A boolean value indicating whether to compute the full dendrogram. If True, the full dendrogram is computed and returned. If False, only the lower part of the dendrogram is computed and returned. The default value is True.
- distance_threshold(Optional): A float value specifying the threshold to apply when forming flat clusters. When the distance_threshold parameter is not None, the fcluster function can be used to obtain flat clusters. The default value is None.
- memory(Optional): An optional joblib.Memory object or a string indicating the caching strategy. This parameter is used to cache the pairwise distances between samples. By default, no caching is performed.
- **kwargs(Optional): Additional keyword arguments that are passed to the ward_tree_fast function.
Return Values
It returns a tuple containing the following values:
- linkage: The linkage matrix representing the hierarchical clustering. It has shape (n_samples – 1, 4), where n_samples is the number of data points. Each row in the matrix represents a merging step, and the first two columns represent the indices of the merged clusters. The third column contains the distance or dissimilarity between the merged clusters, and the fourth column shows the number of data points in newly formed cluster.
- n_components: The number of connected components in the connectivity matrix or graph. If the connectivity parameter is not provided, it will be equal to 1, indicating a fully connected graph.
- n_leaves: The number of leaves in the hierarchical clustering dendrogram, which is equal to the number of input data points.
- parents: An array representing the hierarchical structure of the clustering. It has shape (2 * n_samples – 1,). Each element at index i indicates the parent cluster of the cluster with index i. Clusters are numbered from 0 to n_samples – 1, with the first n_samples indices corresponding to the original data points.
- distances: An array containing the distances between clusters during the merging steps. It has shape (n_samples – 1,). The distance at index i corresponds to the distance between the clusters with indices linkage[i, 0] and linkage[i, 1].
Coding Example
Here’s a real-world coding example using cluster.ward_tree with step-by-step explanations:
Step 1: Import the required libraries and load the dataset
from sklearn.cluster import ward_tree
from sklearn.datasets import make_blobs
# Generate a random dataset
X, _ = make_blobs(n_samples=100, centers=3, random_state=0)
Step 2: Perform hierarchical clustering using ward_tree
linkage_matrix = ward_tree(X)
Step 3: Analyze the linkage matrix
The linkage_matrix returned by ward_tree represents the hierarchical clustering. Every row in a matrix shows a merging step. The first two columns represent the indices of the merged clusters, and the third column contains the distance between them.
Step 4: Visualize the ward tree
A dendrogram is a visual representation of the hierarchical clustering. We can use scipy library to plot it.
count = itertools.count(linkage_matrix[2])
[{'node#': next(count), 'left_node': x[0], 'right_node':x[1]} for x in linkage_matrix[0]]
The above code snippet will generate a ward tree as printed as below:
