Skip to content Skip to sidebar Skip to footer

How To Find Documents That Are In The Same Cluster With Kmeans

I have clustered various articles together with the Scikit-learn framework. Below are the top 15 words in each cluster: Cluster 0: whales islands seaworld hurricane whale odile sto

Solution 1:

You can use the fit_predict() function to perform the clustering and obtain the indices of the resulting clusters.

Obtaining the cluster index of every document

You can try the following:

km = sklearn.cluster.KMeans(init='k-means++', max_iter=10000, n_init=1,
                verbose=0, n_clusters=25)
clusters = km.fit_predict(X_train_tfidf)

# Note that your input data has dimensionality m x n and the clusters array has dimensionality m x 1 and contains the indices for every document
print X_train_tfidf.shape
print clusters.shape

# Example to get all documents in cluster 0
cluster_0 = np.where(clusters==0) # don't forget import numpy as np# cluster_0 now contains all indices of the documents in this cluster, to get the actual documents you'd do:
X_cluster_0 = X_train_tfidf[cluster_0]

Finding the distance of each document to each centroid

You can get the centroids by doing centroids = km.cluster_centers_, which in your case should have dimensionality 25 (number of clusters) x n (number of features). For calculating i.e. the euclidean distance of a document to a centroid you can use SciPy (the docs for scipy's various distance metrics can be found here):

# Example, distance for 1 document to 1 cluster centroidfrom scipy.spatial.distance import euclidean

distance = euclidean(X_cluster_0[0], km.cluster_centers_[0])
print distance

Update: Distances with Sparse & Dense matrices

The distance metrics in scipy.spatial.distance require the input matrices to be dense matrices, so if X_cluster_0 is a sparse matrix you could either convert the matrix to a dense matrix:

d = euclidean(X_cluster_0.A[0], km.cluster_centers_[0]) # Note the .A on X_cluster_0print d

Alternatively you could use scikit's euclidean_distances() function, which also works with sparse matrices:

from sklearn.metrics.pairwise import euclidean_distances

D = euclidean_distances(X_cluster_0.getrow(0), km.cluster_centers_[0]) 
# This would be the equivalent expression to the above scipy example, however note that euclidean_distances returns a matrix and not a scalarprint D

Note that with the scikit method you can also calculate the whole distance matrix at once:

D = euclidean_distances(X_cluster_0, km.cluster_centers_)
print D

Update: Structure and Type of X_cluster_0:

X_cluster_0 as well as X_train_tfidf are both sparse matrices (see the docs: scipy.sparse.csr.csr_matrix).

The interpretation of a dump such as

(0, 13535)    0.115880661286
(0, 17447)    0.117608794277
(0, 44849)    0.414829246262
(0, 14574)    0.10214258736
.             .
.             .

would be as follows: (0, 13535) refers to document 0 and feature 13535, so row number 0 and column number 13535 in your bag of words matrix. The following floating point number 0.115880661286 represents the tf-idf score for that feature in the given document.

To find out the exact word you could try to do hasher.get_feature_names()[13535] (check len(hasher.get_feature_names()) first to see how many features you have).

If your corpus variable document_text_list is a list of lists, then the corresponding document would simply be document_text_list[0].

Post a Comment for "How To Find Documents That Are In The Same Cluster With Kmeans"