How To Find Documents That Are In The Same Cluster With Kmeans
Solution 1:
You can use the fit_predict()
function to perform the clustering and obtain the indices of the resulting clusters.
Obtaining the cluster index of every document
You can try the following:
km = sklearn.cluster.KMeans(init='k-means++', max_iter=10000, n_init=1,
verbose=0, n_clusters=25)
clusters = km.fit_predict(X_train_tfidf)
# Note that your input data has dimensionality m x n and the clusters array has dimensionality m x 1 and contains the indices for every document
print X_train_tfidf.shape
print clusters.shape
# Example to get all documents in cluster 0
cluster_0 = np.where(clusters==0) # don't forget import numpy as np# cluster_0 now contains all indices of the documents in this cluster, to get the actual documents you'd do:
X_cluster_0 = X_train_tfidf[cluster_0]
Finding the distance of each document to each centroid
You can get the centroids by doing centroids = km.cluster_centers_
, which in your case should have dimensionality 25 (number of clusters) x n (number of features). For calculating i.e. the euclidean distance of a document to a centroid you can use SciPy (the docs for scipy's various distance metrics can be found here):
# Example, distance for 1 document to 1 cluster centroidfrom scipy.spatial.distance import euclidean
distance = euclidean(X_cluster_0[0], km.cluster_centers_[0])
print distance
Update: Distances with Sparse & Dense matrices
The distance metrics in scipy.spatial.distance
require the input matrices to be dense matrices, so if X_cluster_0
is a sparse matrix you could either convert the matrix to a dense matrix:
d = euclidean(X_cluster_0.A[0], km.cluster_centers_[0]) # Note the .A on X_cluster_0print d
Alternatively you could use scikit's euclidean_distances()
function, which also works with sparse matrices:
from sklearn.metrics.pairwise import euclidean_distances
D = euclidean_distances(X_cluster_0.getrow(0), km.cluster_centers_[0])
# This would be the equivalent expression to the above scipy example, however note that euclidean_distances returns a matrix and not a scalarprint D
Note that with the scikit method you can also calculate the whole distance matrix at once:
D = euclidean_distances(X_cluster_0, km.cluster_centers_)
print D
Update: Structure and Type of X_cluster_0
:
X_cluster_0
as well as X_train_tfidf
are both sparse matrices (see the docs: scipy.sparse.csr.csr_matrix
).
The interpretation of a dump such as
(0, 13535) 0.115880661286
(0, 17447) 0.117608794277
(0, 44849) 0.414829246262
(0, 14574) 0.10214258736
. .
. .
would be as follows: (0, 13535)
refers to document 0 and feature 13535, so row number 0 and column number 13535 in your bag of words matrix. The following floating point number 0.115880661286
represents the tf-idf score for that feature in the given document.
To find out the exact word you could try to do hasher.get_feature_names()[13535]
(check len(hasher.get_feature_names())
first to see how many features you have).
If your corpus variable document_text_list
is a list of lists, then the corresponding document would simply be document_text_list[0]
.
Post a Comment for "How To Find Documents That Are In The Same Cluster With Kmeans"