Output Top 2 Classes From A Multiclass Classification Algorithm
Solution 1:
LinearSVC
does not provide predict_proba
but it provides the decision_function
which gives the signed distance from the hyperplane.
From Documentation:
decision_function(self, X):
Predict confidence scores for samples.
The confidence score for a sample is the signed distance of that sample to the hyperplane.
Based on @warped comments,
we can use decision_function
output, to find the top n
predicted classes from the model.
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
X, y = make_classification(n_samples=1000,
n_clusters_per_class=1,
n_informative=10,
n_classes=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=42)
clf = make_pipeline(StandardScaler(),
LinearSVC(random_state=0, tol=1e-5))
clf.fit(X, y)
top_n_classes = 2
predictions = clf.decision_function(
X_test).argsort()[:,-top_n_classes:][:,::-1]
pred_df = pd.DataFrame(predictions,
columns= [f'{i+1}_pred'for i inrange(top_n_classes)])
df = pd.DataFrame({'true_class': y_test})
df = df.assign(**pred_df)
df
Solution 2:
linearSVC
has a method called decision_function
, which gives confidence scores for individual classes:
The confidence score for a sample is the signed distance of that sample to the hyperplane.
Example with a 3-class dataset:
from sklearn.datasets import make_classification
import numpy as np
# dummy dataset
X, y = make_classification(n_classes=3, n_clusters_per_class=1)
#train classifier and get decision scores
clf = LinearSVC().fit(X, y)
decision = clf.decision_function(X)
decision = np.round(decision, 2)
prediction = clf.predict(X)
# looking at decision scores and the predicted class:for a, b inzip(decision, prediction):
print(a, b)
[...]
[ 3.04 -0.61 -7.1 ] 0
[-4.991.85 -1.62] 1
[ 3.01 -0.98 -5.93] 0
[-2.61 -1.122.64] 2
[-3.43 -0.651.32] 2
[-1.78 -1.674.15] 2
[...]
you can see that the classifier takes the classwith maximum score as prediction.
To get the best two, you would take the two highest scores.
Edit:
Note what signed distance
means:
sign of the decision function:
+: yes (data point belongs to class)
-: no (data point does not belong to class)
absolute value of the decision function:
denotes confidence in the decision.
Example from the first row in the code above:
[ 3.04 -0.61 -7.1 ]0
Decision for class 1: 3.04 => this classifier thinks that the data belongs to class 1, with a certainty score of 3.04.
Decision for class 2: -.61 => this classifier thinks that the data does not belong to class 2, with a certainty score of .61.
Decision for class 3: -7.1 => this classifier thinks that the data does not belong to class 2, with a certainty score of 7.1.
Post a Comment for "Output Top 2 Classes From A Multiclass Classification Algorithm"