Dimension Mismatch When I Try To Apply Tf-idf To Test Set
I am trying to apply a new pre-processing algorithm to my dataset, following this answer: Encoding text in ML classifier What I have tried now is the following: def test_tfidf(data
Solution 1:
When using transformes (TfidfVectorizer
in this case), you must use the same object ot transform both train and test data. The transformer is typically fitted using the training data only, and then re-used to transform the test data.
The correct way to do this in your case:
deftfidf(data, ngrams = 1):
df_temp = data.copy(deep = True)
df_temp = basic_preprocessing(df_temp)
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, ngrams))
tfidf_vectorizer.fit(df_temp['Text'])
list_corpus = df_temp["Text"].tolist()
list_labels = df_temp["Label"].tolist()
X = tfidf_vectorizer.transform(list_corpus)
return X, list_labels, tfidf_vectorizer
deftest_tfidf(data, vectorizer, ngrams = 1):
df_temp = data.copy(deep = True)
df_temp = basic_preprocessing(df_temp)
# No need to create a new TfidfVectorizer here!
list_corpus = df_temp["Text"].tolist()
list_labels = df_temp["Label"].tolist()
X = vectorizer.transform(list_corpus)
return X, list_labels
# this method is copied from the other SO questiondeftraining_naive(X_train_naive, X_test_naive, y_train_naive, y_test_naive, preproc):
clf = MultinomialNB() # Gaussian Naive Bayes
clf.fit(X_train_naive, y_train_naive)
res = pd.DataFrame(columns = ['Preprocessing', 'Model', 'Precision', 'Recall', 'F1-score', 'Accuracy'])
y_pred = clf.predict(X_test_naive)
f1 = f1_score(y_pred, y_test_naive, average = 'weighted')
pres = precision_score(y_pred, y_test_naive, average = 'weighted')
rec = recall_score(y_pred, y_test_naive, average = 'weighted')
acc = accuracy_score(y_pred, y_test_naive)
res = res.append({'Preprocessing': preproc, 'Model': 'Naive Bayes', 'Precision': pres,
'Recall': rec, 'F1-score': f1, 'Accuracy': acc}, ignore_index = True)
return res
train_x, train_y, count_vectorizer = tfidf(undersample_train, ngrams = 1)
testing_set = pd.concat([X_test, y_test], axis=1)
test_x, test_y = test_tfidf(testing_set, count_vectorizer, ngrams = 1)
full_result = full_result.append(training_naive(train_x, test_x, train_y, test_y, count_vectorizer), ignore_index = True)
Post a Comment for "Dimension Mismatch When I Try To Apply Tf-idf To Test Set"