Skip to content Skip to sidebar Skip to footer

Output Top 2 Classes From A Multiclass Classification Algorithm

I am working on a multiclass classificiation problem for text , where I have a lot of different classes (15+). I have trained a Linearsvc svm method(method is just and example). Bu

Solution 1:

LinearSVC does not provide predict_proba but it provides the decision_function which gives the signed distance from the hyperplane.

From Documentation:

decision_function(self, X):

Predict confidence scores for samples.

The confidence score for a sample is the signed distance of that sample to the hyperplane.

Based on @warped comments,

we can use decision_function output, to find the top n predicted classes from the model.

import pandas as pd 
from sklearn.datasets import make_classification
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

X, y = make_classification(n_samples=1000, 
                           n_clusters_per_class=1,
                           n_informative=10,
                           n_classes=5, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2,
                                                    random_state=42)
clf = make_pipeline(StandardScaler(),
                    LinearSVC(random_state=0, tol=1e-5))
clf.fit(X, y)
top_n_classes = 2
predictions = clf.decision_function(
                    X_test).argsort()[:,-top_n_classes:][:,::-1]
pred_df = pd.DataFrame(predictions, 
                       columns= [f'{i+1}_pred'for i inrange(top_n_classes)])

df = pd.DataFrame({'true_class': y_test})
df = df.assign(**pred_df)

df

Solution 2:

linearSVC has a method called decision_function, which gives confidence scores for individual classes:

The confidence score for a sample is the signed distance of that sample to the hyperplane.

Example with a 3-class dataset:

from sklearn.datasets import make_classification
import numpy as np    

# dummy dataset
X, y = make_classification(n_classes=3, n_clusters_per_class=1)

#train classifier and get decision scores
clf = LinearSVC().fit(X, y)
decision = clf.decision_function(X)
decision = np.round(decision, 2)

prediction = clf.predict(X)

# looking at decision scores and the predicted class:for a, b inzip(decision, prediction):
    print(a, b)

[...]
[ 3.04 -0.61 -7.1 ] 0
[-4.991.85 -1.62] 1
[ 3.01 -0.98 -5.93] 0
[-2.61 -1.122.64] 2
[-3.43 -0.651.32] 2
[-1.78 -1.674.15] 2
[...]

you can see that the classifier takes the classwith maximum score as prediction. 
To get the best two, you would take the two highest scores. 

Edit:

Note what signed distance means:

sign of the decision function:

+: yes (data point belongs to class)

-: no (data point does not belong to class)

absolute value of the decision function:

denotes confidence in the decision.

Example from the first row in the code above:

[ 3.04 -0.61 -7.1 ]0

Decision for class 1: 3.04 => this classifier thinks that the data belongs to class 1, with a certainty score of 3.04.

Decision for class 2: -.61 => this classifier thinks that the data does not belong to class 2, with a certainty score of .61.

Decision for class 3: -7.1 => this classifier thinks that the data does not belong to class 2, with a certainty score of 7.1.

Post a Comment for "Output Top 2 Classes From A Multiclass Classification Algorithm"