Summary Evaluation Challenges
Summary evaluation metrics sometimes fall short in capturing the qualities most relevant to assessing summary quality. Traditional machine learning for natural language processing (NLP) has covered a lot of ground in this area. Widely used measures such as ROUGE focus on surface-level token overlap and n-gram matches. While effective for evaluating lexical similarity, these approaches offer limited insight into aspects such as factual accuracy or semantic completeness [1].
For example, consider two summaries that state “The study found 73% improvement” and “Nearly three-quarters showed positive outcomes.” Both express the same underlying fact, yet traditional metrics treat them as distinct. This illustrates how lexical approaches can struggle to account for semantic equivalence and factual consistency.
I did some experimentation with the Natural Language Toolkit (NLTK) and managed to get a good first cut working with the traditional ML approach, but my accuracy scoring was falling short on many edge cases such as the one above, and others described in [1]. So, a new experiment was needed.
A Hybrid Approach
Initial explorations with established NLP metrics demonstrated their strengths and limitations, but extending these approaches to semantic accuracy was not robust. It seemed LLM-based evaluation might be better for semantic assessment with good prompt design.
At the same time, relying solely on LLM-based evaluation for the other more deterministic scores related to coherence and completeness seemed like overkill. When considering the different quality dimensions, I considered a tailored “hybrid” approach:
- Accuracy: Dependent on semantic understanding and factual consistency assessment - LLM
- Completeness: Amenable to topic extraction and coverage analysis - NLTK
- Coherence: Supported by established linguistic measures - ROUGE
- Overall Score: Using weighted average of the above components.
This seemed to be worth exploring with some experimentation.
Architecture: Three Evaluation Paradigms
Accuracy: LLM-as-Judge with Structured Prompting
For accuracy, large language models provide the needed semantic reasoning to assess equivalence and factual consistency. Structured prompting ensures consistent, interpretable judgments:
prompt = f"Compare the <source> and <summary> provided. Using only the
information provided in the <source>, characterize the accuracy of the
<summary> on a scale of 0=Poor, 1=Fair, 2=Good, or 3=Excellent.
Provide a short rationale for your score.
<source>
{source_text}
</source>
<summary>
{summary_text}
</summary>
Format as JSON: {{'score': int, 'rationale': str}}"
Key design considerations:
- Explicit source constraint prevents hallucinations or injection of external knowledge
- Discrete 0–3 scale improves judgment consistency
- Required rationale encourages reasoning and enables debugging
- JSON formatting ensures parseable responses
The rationale requirement, in particular, enhances judgment quality by prompting more deliberate decisions.
Completeness: Enhanced Topic Extraction
Completeness is measured through topic extraction methods that extend beyond TF-IDF. By incorporating document structure knowledge and positional weighting, the evaluation captures salient content more effectively:
# Weight sentences by position and content characteristics
if i == 0 or i == len(sentences) - 1:
position_weight = 1.5
elif i < len(sentences) * 0.2 or i > len(sentences) * 0.8:
position_weight = 1.2
else:
position_weight = 1.0
combined_score = position_weight * tfidf_score * length_weight
Semantic similarity (cosine similarity on sentence embeddings) is then used to match extracted topics with summary content. A coverage threshold of 0.4 balances precision and recall. This approach captures topic coverage that lexical matching alone may miss, especially when summaries express concepts using different terminology.
Coherence: ROUGE Metrics
For coherence, established computational linguistics methods are sufficient and efficient:
coherence = (0.4 * semantic + 0.25 * discourse + 0.08 * (rouge1+ rouge2+ rougeL+ lexical))
Components:
- ROUGE-1/2/L: Official F1 scores using rouge-score library with stemming
- Semantic coherence: Sentence embedding similarities combined with ROUGE-L
- Discourse coherence: Penn Discourse Treebank marker analysis
- Lexical diversity: Type-Token Ratio with windowed normalization
- Contradiction penalty: 1.0 (no contradictions) to 0.0 (many contradictions)
- Readability: Flesch-Kincaid grade normalized as max(0, min(1, (20-grade)/20))
- Length penalty: 1.0 (normal length) to 0.3 minimum (very short texts)
Deterministic methods are appropriate here because coherence has objective, measurable properties and reliable metrics already exist.
Overall Score Weighting Rationale
I chose a 60/25/15 weighting across accuracy, completeness, and coherence to reflect practical priorities: (0.6 * Accuracy + 0.25 * Completeness + 0.15 * Coherence).
- Accuracy receives the greatest emphasis, since factual errors undermine trust.
- Completeness is weighted significantly, as omission of key information reduces usefulness.
- Coherence is weighted lower, since awkward phrasing is less harmful than inaccuracy.
These weights can be tuned for domain-specific applications.
Prompt Engineering for Consistent Judgments
Reliability in accuracy scoring depends heavily on careful prompt design. Critical elements include:
- Constraining evaluation to provided source material
- Using a discrete scale rather than continuous ranges
- Requiring explicit reasoning
- Enforcing structured output
Among these, the source constraint proved particularly important, as unconstrained prompts encouraged models to draw on external knowledge rather than evaluate the given material.
Performance Characteristics
The hybrid framework yields overall score distributions that align with intuitive quality assessments:
- 0.8–1.0: High-quality summaries with minimal issues
- 0.6–0.8: Good summaries with some gaps in completeness or flow
- 0.4–0.6: Usable but with noticeable issues
- 0.0–0.4: Significant problems in accuracy, completeness, or coherence
The system captures error types that purely algorithmic methods tend to miss:
- Semantic paraphrasing
- Factual contradictions expressed with different wording
- Important topic omissions despite high lexical overlap
- Issues with discourse and readability
The overall scoring establishes levels that can be used to flag summaries for human evaluators to more closely examine.
Summary Evaluation Using a Standard Baseline
To test this framework I chose to use the CNN/DailyMail dataset which contains 287K training articles and summaries of news articles. This is a standard used in many summary evaluation experiments [1].
Summary Evaluation Tool for CNN/DailyMail Dataset
This module provides functionality to evaluate reference summaries from the CNN/DailyMail dataset using a comprehensive SummaryEvaluator with ROUGE-enhanced coherence scoring [2].
Key Features:
- Loads and samples from CNN/DailyMail dataset (287k+ articles)
- Evaluates summaries across three dimensions:
- Accuracy: LLM-as-judge scoring using OpenAI models (GPT-4, GPT-4o)
- Completeness: Topic coverage analysis using TF-IDF and Semantic similarity
- Coherence: ROUGE-enhanced scoring with discourse analysis and readability
- Flexible output formats: pretty console display or CSV export
- Configurable sampling with optional seeding for reproducibility
- Command-line interface with comprehensive argument parsing
Usage Examples:
python evalsummary.py # default 5 samples, GPT-4o, pretty output
python evalsummary.py -n 10 -m gpt-4.1 # 10 samples with GPT-4.1```
python evalsummary.py -f csv > results.csv # CSV export for analysis```
python evalsummary.py --seed 42 -n 20 # Reproducible results```
Example output
📥 Loading CNN/DailyMail dataset...
✅ Dataset loaded - Train: 287113
🚀 Evaluating 1 random reference summaries using gpt-4o
================================================================================
Article 1 (index 221292):
Bradford midfielder Billy Knott has taken to Twitter to thank John Terry for
giving his dad a signed shirt following his side's stunning 4-2 FA Cup win
against Chelsea on Saturday. The former Chelsea youngster was still revelling
in the win on Sunday morning as he posed with his father, Steve, and the
famous No 26 shirt of the Blues captain. Knott tweeted: 'Thanks to jt top man
give the shirt to my dad. What a day that was every were we goooo [sic].'
Billy Knott shows off his signed shirt from...
Reference Summary:
Billy Knott received signed shirt from John Terry following Bradford's
4-2 FA Cup win against Chelsea. Knott took to Twitter to thank 'top man' Terry
for the gift. Bantams are through to FA Cup fifth round following memorable win.
--------------------------------------------------------------------------------
📊 SUMMARY EVALUATION RESULTS:
----------------------------------------
🎯 Accuracy: 1.000 (100.0%)
📋 Completeness: 0.532 (53.2%)
🔗 Coherence: 0.437 (43.7%)
⭐ Overall: 0.799 (79.9%)
💭 Accuracy Rationale: The summary accurately captures the key points
from the source: Billy Knott received a signed shirt from John Terry,
he thanked Terry on Twitter, and Bradford's win against Chelsea allowed
them to progress to the FA Cup fifth round. The summary is concise and includes
all the main elements mentioned in the source.
Conclusion
The hybrid framework combines evaluation paradigms to provide robust, semantically aware summary assessment. It leverages LLMs for accuracy, structured NLP methods for completeness, and classical measures for coherence, producing a more comprehensive evaluation than any single method alone.
The result is an approach that balances semantic nuance with consistency and efficiency, offering a flexible foundation for summary evaluation across diverse domains. With overall scoring establishing levels that can be used to flag summaries for human evaluators to review.
References
[1] Fabbri, A. R., et al. (2020). SummEval: Re-evaluating Summarization Evaluation. arXiv preprint arXiv:2007.12626. https://arxiv.org/pdf/2007.12626
[2] O’Shea, M. (2024). Summary Evaluation Framework. GitHub repository. https://github.com/oshea00/summaryeval