Building a Hybrid Summary Evaluation Framework

Combining deterministic NLP with LLM-as-Judge for robust evaluation

Summary Evaluation Challenges

Summary evaluation metrics sometimes fall short in capturing the qualities most relevant to assessing summary quality. Traditional machine learning for natural language processing (NLP) has covered a lot of ground in this area. Widely used measures such as ROUGE focus on surface-level token overlap and n-gram matches. While effective for evaluating lexical similarity, these approaches offer limited insight into aspects such as factual accuracy or semantic completeness [1].

For example, consider two summaries that state “The study found 73% improvement” and “Nearly three-quarters showed positive outcomes.” Both express the same underlying fact, yet traditional metrics treat them as distinct. This illustrates how lexical approaches can struggle to account for semantic equivalence and factual consistency.

I did some experimentation with the Natural Language Toolkit (NLTK) and managed to get a good first cut working with the traditional ML approach, but my accuracy scoring was falling short on many edge cases such as the one above, and others described in [1]. So, a new experiment was needed.

A Hybrid Approach

Initial explorations with established NLP metrics demonstrated their strengths and limitations, but extending these approaches to semantic accuracy was not robust. It seemed LLM-based evaluation might be better for semantic assessment with good prompt design.

At the same time, relying solely on LLM-based evaluation for the other more deterministic scores related to coherence and completeness seemed like overkill. When considering the different quality dimensions, I considered a tailored “hybrid” approach:

Accuracy: Dependent on semantic understanding and factual consistency assessment - LLM
Completeness: Amenable to topic extraction and coverage analysis - NLTK
Coherence: Supported by established linguistic measures - ROUGE
Overall Score: Using weighted average of the above components.

This seemed to be worth exploring with some experimentation.

Architecture: Three Evaluation Paradigms

Accuracy: LLM-as-Judge with Structured Prompting

For accuracy, large language models provide the needed semantic reasoning to assess equivalence and factual consistency. Structured prompting ensures consistent, interpretable judgments:

prompt = f"Compare the <source> and <summary> provided. Using only the
information provided in the <source>, characterize the accuracy of the
<summary> on a scale of 0=Poor, 1=Fair, 2=Good, or 3=Excellent.
Provide a short rationale for your score.
<source>
{source_text}
</source>
<summary>
{summary_text}
</summary>
Format as JSON: {{'score': int, 'rationale': str}}"

Key design considerations:

Explicit source constraint prevents hallucinations or injection of external knowledge
Discrete 0–3 scale improves judgment consistency
Required rationale encourages reasoning and enables debugging
JSON formatting ensures parseable responses

The rationale requirement, in particular, enhances judgment quality by prompting more deliberate decisions.

Completeness: Enhanced Topic Extraction

Completeness is measured through topic extraction methods that extend beyond term frequency analysis. By incorporating document structure knowledge and positional weighting, the evaluation captures salient content more effectively:

# Weight sentences by position and content characteristics
if i == 0 or i == len(sentences) - 1:
    position_weight = 1.5
elif i < len(sentences) * 0.2 or i > len(sentences) * 0.8:
    position_weight = 1.2
else:
    position_weight = 1.0

combined_score = position_weight * tfidf_score * length_weight

Semantic similarity (cosine similarity on sentence embeddings) is then used to match extracted topics with summary content. A coverage threshold of 0.4 balances precision and recall. This approach captures topic coverage that lexical matching alone may miss, especially when summaries express concepts using different terminology.

Coherence: ROUGE Metrics

For coherence, established computational linguistics methods are sufficient and efficient:

coherence = (0.4 * semantic + 0.25 * discourse + 0.08 * (rouge1+ rouge2+ rougeL+ lexical))

Components:

ROUGE-1/2/L: Official F1 scores using rouge-score library with stemming
Semantic coherence: Sentence embedding similarities combined with ROUGE-L
Discourse coherence: Penn Discourse Treebank marker analysis
Lexical diversity: Type-Token Ratio with windowed normalization
Contradiction penalty: 1.0 (no contradictions) to 0.0 (many contradictions)
Readability: Flesch-Kincaid grade normalized as max(0, min(1, (20-grade)/20))
Length penalty: 1.0 (normal length) to 0.3 minimum (very short texts)

Deterministic methods are appropriate here because coherence has objective, measurable properties and reliable metrics already exist.

Overall Score Weighting Rationale

I chose a 60/25/15 weighting across accuracy, completeness, and coherence to reflect practical priorities: (0.6 * Accuracy + 0.25 * Completeness + 0.15 * Coherence).

Accuracy receives the greatest emphasis, since factual errors undermine trust.
Completeness is weighted significantly, as omission of key information reduces usefulness.
Coherence is weighted lower, since awkward phrasing is less harmful than inaccuracy.

These weights can be tuned for domain-specific applications.

Prompt Engineering for Consistent Judgments

Reliability in accuracy scoring depends heavily on careful prompt design. Critical elements include:

Constraining evaluation to provided source material
Using a discrete scale rather than continuous ranges
Requiring explicit reasoning
Enforcing structured output

Among these, the source constraint proved particularly important, as unconstrained prompts encouraged models to draw on external knowledge rather than evaluate the given material.

Performance Characteristics

The hybrid framework yields overall score distributions that align with intuitive quality assessments:

0.8–1.0: High-quality summaries with minimal issues
0.6–0.8: Good summaries with some gaps in completeness or flow
0.4–0.6: Usable but with noticeable issues
0.0–0.4: Significant problems in accuracy, completeness, or coherence

The system captures error types that purely algorithmic methods tend to miss:

Semantic paraphrasing
Factual contradictions expressed with different wording
Important topic omissions despite high lexical overlap
Issues with discourse and readability

The overall scoring establishes levels that can be used to flag summaries for human evaluators to more closely examine.

Summary Evaluation Using a Standard Baseline

To test this framework I chose to use the CNN/DailyMail dataset which contains 287K training articles and summaries of news articles. This is a standard used in many summary evaluation experiments [1].

Summary Evaluation Tool for CNN/DailyMail Dataset

This module provides functionality to evaluate reference summaries from the CNN/DailyMail dataset using a comprehensive SummaryEvaluator with ROUGE-enhanced coherence scoring [2].

Key Features:

Loads and samples from CNN/DailyMail dataset (287k+ articles)
Evaluates summaries across three dimensions:
- Accuracy: LLM-as-judge scoring using OpenAI models (GPT-4, GPT-4o)
- Completeness: Topic coverage analysis using TF-IDF and Semantic similarity
- Coherence: ROUGE-enhanced scoring with discourse analysis and readability
Flexible output formats: pretty console display or CSV export
Configurable sampling with optional seeding for reproducibility
Command-line interface with comprehensive argument parsing

Usage Examples:

python evalsummary.py                      # default 5 samples, GPT-4o, pretty output
python evalsummary.py -n 10 -m gpt-4.1     # 10 samples with GPT-4.1```
python evalsummary.py -f csv > results.csv # CSV export for analysis```
python evalsummary.py --seed 42 -n 20      # Reproducible results```

Example output

📥 Loading CNN/DailyMail dataset...
✅ Dataset loaded - Train: 287113
🚀 Evaluating 1 random reference summaries using gpt-4o
================================================================================
Article 1 (index 221292):
Bradford midfielder Billy Knott has taken to Twitter to thank John Terry for 
giving his dad a signed shirt following his side's stunning 4-2 FA Cup win 
against Chelsea on Saturday. The former Chelsea youngster was still revelling 
in the win on Sunday morning as he posed with his father, Steve, and the 
famous No 26 shirt of the Blues captain. Knott tweeted: 'Thanks to jt top man
give the shirt to my dad. What a day that was every were we goooo [sic].'
Billy Knott shows off his signed shirt from...

Reference Summary:
Billy Knott received signed shirt from John Terry following Bradford's 
4-2 FA Cup win against Chelsea. Knott took to Twitter to thank 'top man' Terry
for the gift. Bantams are through to FA Cup fifth round following memorable win.
--------------------------------------------------------------------------------

📊 SUMMARY EVALUATION RESULTS:
----------------------------------------
🎯 Accuracy:     1.000 (100.0%)
📋 Completeness: 0.532 (53.2%)
🔗 Coherence:    0.437 (43.7%)
⭐ Overall:      0.799 (79.9%)

💭 Accuracy Rationale: The summary accurately captures the key points
from the source: Billy Knott received a signed shirt from John Terry,
he thanked Terry on Twitter, and Bradford's win against Chelsea allowed
them to progress to the FA Cup fifth round. The summary is concise and includes
all the main elements mentioned in the source.

Conclusion

The hybrid framework combines evaluation paradigms to provide robust, semantically aware summary assessment. It leverages LLMs for accuracy, structured NLP methods for completeness, and classical measures for coherence, producing a more comprehensive evaluation than any single method alone.

The result is an approach that balances semantic nuance with consistency and efficiency, offering a flexible foundation for summary evaluation across diverse domains. With overall scoring establishing levels that can be used to flag summaries for human evaluators to review.

References

[1] Fabbri, A. R., et al. (2020). SummEval: Re-evaluating Summarization Evaluation. arXiv preprint arXiv:2007.12626. https://arxiv.org/pdf/2007.12626

[2] O’Shea, M. (2024). Summary Evaluation Framework. GitHub repository. https://github.com/oshea00/summaryeval

Summary Evaluation Challenges#

A Hybrid Approach#

Architecture: Three Evaluation Paradigms#

Accuracy: LLM-as-Judge with Structured Prompting#

Completeness: Enhanced Topic Extraction#

Coherence: ROUGE Metrics#

Overall Score Weighting Rationale#

Prompt Engineering for Consistent Judgments#

Performance Characteristics#

Summary Evaluation Using a Standard Baseline#

Summary Evaluation Tool for CNN/DailyMail Dataset#

Example output#

Conclusion#

References#