However, with generative model classification, we lose the ‘confidence level’ or the probability score of the prediction available in logistic regression. Traditional models like logistic regression provide a probability score for each class, indicating the model’s confidence level in its predictions. This confidence score is not just valuable; it’s essential for decision-making, as it helps users gauge how confident the model is about its classifications. While generative model responses may align well with the intended classification, we don’t directly get an explicit probability for each class. This can be a limitation, particularly in high-stakes applications where knowing the model’s confidence level is crucial.
One way to approach this limitation and gain a better sense of confidence in LLM classification results is to add instructions to the prompt to provide a “confidence level” over some range, such as 1.0 through 5.0, along with the predicted class of a given text example. But is this confidence value to be trusted?
An Experiment Is Needed
To help answer that question, I’ve done a little experiment to see if the confidence level passed back in the LLM’s response has some statistical significance using an open-source dataset from Hugging Face.
AG is a collection of more than 1 million news articles. ComeToMyHead, an academic news search engine, has gathered articles from over 2000 news sources in more than one year of activity. The dataset is provided by the academic community for research purposes in data mining (clustering, classification, etc.), information retrieval (ranking, search, etc.), XML, data compression, data streaming, and any other non-commercial activity.
Xiang Zhang (xiang.zhang@nyu.edu) constructed the AG’s news topic classification dataset from the dataset above. It is used as a classification benchmark in Zhang, Xiang, Junbo Zhao, and Yann LeCun. “Character-level Convolutional Networks for Text Classification.” ArXiv:1509.01626.
This code will download the dataset and cache it locally:
from datasets import load_dataset
import pandas as pd
dataset = load_dataset("wangrongsheng/ag_news", cache_dir="./data/ag_news")
labels = dataset["train"].features["label"].names
df = pd.DataFrame(dataset['train'])
print(labels)
df.head()

Figure 1. Sample labeled data of ‘Business’ topics.
The Classification Prompt
Here is the example prompt I used with the GPT-4 model to get classification results for 1000 rows of the data, sampling an even distribution of classes:
classify_prompt = f"""
Examine the text delimited below by ``` and classify it into one
of the following categories: WORLD, SPORTS, BUSINESS, or SCITECH.
```{text}```
output:
On a single line write the category that best fits and
a confidence score in the range 1 to 5 (1 being the least confident)
separated by a comma.
On the following line write a brief explanation of why you chose that category.
Do not include any other information in your response.
"""
The raw response looks like this:
WORLD, 5.0
The text describes a geopolitical event involving an explosion in Baghdad, Iraq, which is a newsworthy incident related to global affairs typically covered in the WORLD news category.
Enhancing the sampled dataset to add these values results in the following table:
Figure 2. Prediction results.
Analyzing Results
I’ll extract the labeled and predicted pairs from the data to create a classification report using the following code:
def generate_classification_report(labeled_predicted_pairs, label_names):
# Extract labels and predictions from the pairs
labels = [pair[0] for pair in labeled_predicted_pairs]
predictions = [pair[1] for pair in labeled_predicted_pairs]
# Map string labels to numerical values
label_to_num = {label: i for i, label in enumerate(label_names)}
num_labels = [label_to_num[label] for label in labels]
num_predictions = [label_to_num[pred] for pred in predictions]
# Generate confusion matrix
conf_matrix = confusion_matrix(num_labels, num_predictions)
# Generate classification report
class_report = classification_report(num_labels, num_predictions, target_names=label_names)
return conf_matrix, class_report
# Get labeled_predicted_pairs from sampled_df
labeled_predicted_pairs = list(zip(sampled_df['labelName'], sampled_df['predicted_class']))
label_values = sampled_df['labelName'].unique()
# Generate report
conf_matrix, class_report = generate_classification_report(labeled_predicted_pairs, label_values)
#print("Confusion Matrix:\n", conf_matrix)
print("\nClassification Report:\n", class_report)
Which produces this:
Classification Report:
precision recall f1-score support
WORLD 0.84 0.88 0.86 250
SPORTS 0.96 1.00 0.98 250
BUSINESS 0.74 0.90 0.82 250
SCITECH 0.92 0.63 0.75 250
accuracy 0.85 1000
macro avg 0.86 0.85 0.85 1000
weighted avg 0.86 0.85 0.85 1000
The overall accuracy is 85%, which is pretty good, and we can dig a little deeper. The ‘support’ column shows how many samples of each class was in the dataset. I made sure to select a balanced set of samples for each class.
To understand how these were broken out into true and false positive and negatives, we’ll graph the “confusion matrix”, which shows actual versus predicted counts of each class. This data is used to calculate the precision/recall and f1-scores in the above report.
# Plot confusion matrix
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
plt.figure(figsize=(10, 7))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=label_values, yticklabels=label_values)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

Figure 3. Multi-class Confusion Matrix.
How Confident Was The Model?
We can use the confidence level combined with the prediction’s “correctness” score to see how the confidence values were distributed for both true and false predictions as follows:
true_predictions = sampled_df[sampled_df['correct'] == True]
false_predictions = sampled_df[sampled_df['correct'] == False]
plt.figure(figsize=(12, 6))
# Plot for true predictions
sns.histplot(true_predictions['confidence_level'], bins=10, kde=True, color='blue', label='True Predictions', alpha=0.6, stat='density')
# Plot for false predictions
sns.histplot(false_predictions['confidence_level'], bins=10, kde=True, color='red', label='False Predictions', alpha=0.6, stat='density')
plt.xlabel('Confidence Level')
plt.ylabel('Normalized Frequency')
plt.title('Comparison of Confidence Levels between True and False Predictions')
plt.legend()
plt.show()
Figure 4. Confidence Distributions.
In the next section, I’ll use a Logistic Regression to fit the true/false predictions to their associated confidence scores, then check if there’s a significant correlation using the calculated p_value
and significance_level
.
Statistical Tests
This code will load some scikit-learn
tools, along with scipy
,stats
andstatsmodels
useful for performing a few tests - the first of which will be the logistic regression and ap_value
check.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc
import scipy.stats as stats
import statsmodels.api as sm
# Independent variable (X) and dependent variable (y)
X = df['Confidence_Level'].values.reshape(-1, 1)
y = df['Prediction_Correctness'].values
model = LogisticRegression()
model.fit(X, y)
# Estimated coefficient
coef = model.coef_[0][0]
intercept = model.intercept_[0]
print(f"Coefficient for Confidence Level: {coef}")
print(f"Intercept: {intercept}")
# Calculate the odds ratio
odds_ratio = np.exp(coef)
print(f"Odds Ratio: {odds_ratio}")
# Interpretation
if odds_ratio > 1:
print(f"For each unit increase in confidence level, the odds of a correct prediction increase by {round((odds_ratio - 1) * 100, 2)}%.")
else:
print(f"For each unit increase in confidence level, the odds of a correct prediction decrease by {round((1 - odds_ratio) * 100, 2)}%.")
# Fit the model using statsmodels to get p-value
X_with_const = sm.add_constant(X) # Add a constant term for intercept
logit_model = sm.Logit(y, X_with_const)
result = logit_model.fit()
print(result.summary())
# Extract p-value for confidence level coefficient
p_value = result.pvalues[1]
print(f"P-value for Confidence Level Coefficient: {p_value}")
# Check significance level
significance_level = 0.05
if p_value < significance_level:
print(f"The coefficient for confidence level is statistically significant.")
else:
print(f"The coefficient for confidence level is not statistically significant.")
The results of this test are as follows. It appears we have a statistically significant result.
Coefficient for Confidence Level: 1.4061994575718302
Intercept: -4.464738502617202
Odds Ratio: 4.080418095576085
For each unit increase in confidence level, the odds of a correct prediction
increase by 308.04%.
Optimization terminated successfully.
Current function value: 0.360052
Iterations 7
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 1000
Model: Logit Df Residuals: 998
Method: MLE Df Model: 1
Date: Sat, 02 Nov 2024 Pseudo R-squ.: 0.1339
Time: 14:10:07 Log-Likelihood: -360.05
converged: True LL-Null: -415.71
Covariance Type: nonrobust LLR p-value: 5.056e-26
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -4.6039 0.664 -6.936 0.000 -5.905 -3.303
x1 1.4384 0.152 9.449 0.000 1.140 1.737
==============================================================================
P-value for Confidence Level Coefficient: 3.416898004216482e-21
The coefficient for confidence level is statistically significant.
To double-check this initial result, I’m going to use a Point-Biserial Correlation test.
Point-Biserial Correlation
The point-biserial correlation is a special case of the Pearson correlation used when one variable is continuous and the other is binary. Use the formula: $$ r_{pb} = \frac{\bar{X}_1 - \bar{X}_0}{s_X} \sqrt{\frac{n_1 n_0}{n^2}} $$ Where:
- $\bar{X}_1$ = Mean confidence level for correct predictions.
- $\bar{X}_0$ = Mean confidence level for incorrect predictions.
- $s_X$ = Standard deviation of all confidence levels.
- $n_1,n_0$ = Number of correct and incorrect predictions.
- $n$ = Total number of predictions.
Test for significance:
- t-test: $$ t = r_{pb}\sqrt{\frac{n-2}{1-r_{pb}^2}} $$
- Degrees of freedom $n-2$
Let’s see how this can be done in code, and what the results are.
# Separate the data for correct and incorrect predictions
correct_predictions = df[df['Prediction_Correctness'] == 1]['Confidence_Level']
incorrect_predictions = df[df['Prediction_Correctness'] == 0]['Confidence_Level']
# Calculate the means
mean_correct = correct_predictions.mean()
mean_incorrect = incorrect_predictions.mean()
# Calculate the standard deviation of all confidence levels
std_confidence = df['Confidence_Level'].std()
# Calculate the counts
n_correct = len(correct_predictions)
n_incorrect = len(incorrect_predictions)
n_total = len(df)
# Calculate point-biserial correlation coefficient
r_pb = (mean_correct - mean_incorrect) / std_confidence * np.sqrt((n_correct * n_incorrect) / n_total**2)
print(f"Point-Biserial Correlation Coefficient (r_pb): {r_pb}")
# Test for significance using t-test
t_value = r_pb * np.sqrt((n_total - 2) / (1 - r_pb**2))
degrees_of_freedom = n_total - 2
p_value = 2 * (1 - stats.t.cdf(abs(t_value), df=degrees_of_freedom))
print(f"T-value: {t_value}")
print(f"P-value: {p_value}")
# Check significance level
significance_level = 0.05
if p_value < significance_level:
print("The point-biserial correlation is statistically significant.")
else:
print("The point-biserial correlation is not statistically significant.")
And the results are apparently significant:
Point-Biserial Correlation Coefficient (r_pb): 0.3647830613321057
T-value: 12.37676335481931
P-value: 0.0
The point-biserial correlation is statistically significant.
To add another test of significance, I will use a Mann-Whitney U test.
Mann-Whitney U Test
The Mann-Whitney U test assesses whether the distributions of confidence levels differ between true and false predictions.
The code and the results follow:
# Perform Mann-Whitney U Test
u_statistic, p_value = stats.mannwhitneyu(correct_predictions, incorrect_predictions, alternative='two-sided')
print(f"Mann-Whitney U Statistic: {u_statistic}")
print(f"P-value: {p_value}")
# Check significance level
significance_level = 0.05
if p_value < significance_level:
print("The Mann-Whitney U test is statistically significant, indicating a difference in distributions between correct and incorrect predictions.")
else:
print("The Mann-Whitney U test is not statistically significant, indicating no difference in distributions between correct and incorrect predictions.")
Mann-Whitney U Statistic: 90831.5
P-value: 4.6967698934230915e-26
The Mann-Whitney U test is statistically significant, indicating a difference in
distributions between correct and incorrect predictions.
ROC Curve
The ROC curve will further measure the significance of the confidence level scores provided by the GPT-4 model.
# ROC Curve
fpr, tpr, _ = roc_curve(df['Prediction_Correctness'], df['Confidence_Level'])
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(10, 5))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC Curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='grey', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()

Figure 5. ROC Curve
If classes were randomly predicted in association with the confidence scores, the area under the ROC curve would be approximately 50%. As it stands, with these results, the ROC curve area is 73%, which is a significant score, although not an outstanding one.
Conclusion
In this small experiment, using the LLM model’s estimate of its “confidence level” combined with its explanation for its decision as asked for in the prompt, it appears there is a statistical significance of the confidence expressed by the model in its predictions.
A larger study might be worthwhile with different types of text and classification classes, as well as different LLM models, to see if this type of prompting improves classification accuracy or at least provides some way to determine confidence when presenting results for human review.