Data model
Every record shape the pipeline produces, every field on each, and
when each field is populated. All four core types are frozen
dataclasses defined in thesisagents.core.models.
Query
The input contract: what the user is asking for.
@dataclass(frozen=True)
class Query:
keywords: str # NFC-normalised, whitespace-collapsed
sources: tuple[str, ...] # subset of ALL_SOURCES
max_results: int = 25 # 1..200 per source
year_from: int | None = None # inclusive lower bound
year_to: int | None = None # inclusive upper bound
top_tier_only: bool = True # apply the venue whitelist
min_citations: int | None = None # discard below this (MCP-only flag)
Field |
Required |
Notes |
|---|---|---|
|
yes |
Must be non-empty after |
|
yes |
Empty tuple is rejected at construction. Use |
|
no |
Clamped to |
|
no |
Either or both. |
|
no |
When True (default), filters to the curated CS top-tier whitelist + arXiv pass-through. |
|
no |
Surfaced via the MCP |
Paper
The normalised result shape every source plugin produces. See
thesisagents.core.models.Paper for the source.
@dataclass(frozen=True)
class Paper:
source: str # source plugin name (e.g. "arxiv")
source_id: str # the plugin's stable record ID
title: str
authors: tuple[str, ...]
year: int | None
venue: str | None # the REAL publication venue
abstract: str
url: str # canonical landing page
doi: str | None = None
arxiv_id: str | None = None
pdf_url: str | None = None # PDF if publicly accessible
citation_count: int | None = None
raw: dict[str, Any] | None = None # the source's raw payload
summary: PaperSummary | None = None
Field reference
Field |
Required |
Populated by |
|---|---|---|
|
yes |
Source plugin name ( |
|
yes |
The plugin’s stable ID for the record (arXiv: ID without version suffix; pubmed: PMID; openalex: opaque ID). Used to form the BibTeX key fallback when DOI is missing. |
|
yes |
Plain text, no Markdown. CJK supported. |
|
yes |
Tuple of |
|
no |
|
|
no |
Real publication venue when available. |
|
yes |
Plain text. May be empty string for entries with no abstract (the lightweight tier falls back to bullet placeholders). |
|
yes |
The canonical landing page URL — what a human would click to read the paper. arXiv: |
|
no |
|
|
no |
The bare arXiv ID ( |
|
no |
A publicly fetchable PDF URL. |
|
no |
Integer when the source reports one. Used in the rank score. |
|
no |
The source’s raw payload (parsed JSON / dict). Available for debugging and for the LLM-as-agent flow ( |
|
no |
A |
Derived methods
paper.bibtex_key() # → "vaswani2017attention" (lowercase, ASCII-folded)
paper.to_dict() # → JSON-serialisable dict (drops `raw` when huge)
Paper.from_dict(data) # → Paper (round-trip equality)
The BibTeX key generation is:
Last name of first author (ASCII-folded, lowercase).
Four-digit year (or
ndwhen missing).First non-stopword from the title (lowercase, ASCII-folded).
If a collision: append
a,b,c, … per the project’s collision counter.
PaperSummary
The structured per-paper summary attached as Paper.summary.
Three usage tiers stack additively:
@dataclass(frozen=True)
class PaperSummary:
# Flat tier — enriched-flat exporter renders these one per slide
motivation: str = ""
contributions: tuple[str, ...] = ()
method: str = ""
results: str = ""
limitations: str = ""
takeaways: tuple[str, ...] = ()
# Rich tier — thesis-style exporter activates when has_rich_fields()
pain_points: tuple[str, ...] = ()
research_question: str = ""
contributions_detailed: tuple[ContributionDetail, ...] = ()
headline_metrics: tuple[Metric, ...] = ()
technique_table: tuple[TechniqueRow, ...] = ()
literature_positioning: tuple[LiteratureRow, ...] = ()
system_flow: tuple[str, ...] = ()
method_sections: tuple[MethodSection, ...] = ()
evaluation_method: str = ""
research_questions: tuple[str, ...] = ()
rq_results: tuple[RqResult, ...] = ()
contribution_summary: str = ""
core_observation: str = ""
future_work: tuple[str, ...] = ()
# Provenance
model: str = "" # "claude-opus-4-7 (LLM-as-agent, read 12-page PDF)"
raw_text_chars: int = 0 # length of source text that was summarised
language: str = "en"
has_rich_fields() returns True when any rich-tier field has
non-empty content; this is what the .pptx exporter checks to
pick between the enriched-flat and thesis-style layouts.
When each tier is populated
Source |
Flat tier |
Rich tier |
|---|---|---|
CLI |
yes |
yes — Claude prompts produce both tiers |
MCP |
yes if the LLM writes them |
yes if the LLM writes them |
Hand-authored regen script |
yes |
yes |
CLI default (no enrichment) |
empty |
empty |
Nested types (rich tier)
ContributionDetail
@dataclass(frozen=True)
class ContributionDetail:
title: str # "Two-tower fine-tuning"
description: str # one sentence explaining what + why
bullets: tuple[str, ...] = () # 2-4 supporting points
Renders as one stack on the contributions slide. Cap the slide at ≤ 4 contributions — the overflow check trips above that.
Metric
@dataclass(frozen=True)
class Metric:
name: str # "Top-1 accuracy on ImageNet-1k"
value: str # "84.7%"
delta: str = "" # "+2.3% vs. ViT-B/16"
Renders as one row of the KPI slide. Aim for 3–5 metrics.
TechniqueRow
@dataclass(frozen=True)
class TechniqueRow:
technique: str # "RoPE positional encoding"
used_for: str # "long-context generalisation"
note: str = "" # optional aside
Renders as one row of the technique table.
LiteratureRow
@dataclass(frozen=True)
class LiteratureRow:
work: str # citation key or short ref ("BERT (2019)")
contribution: str # what they did
delta: str # what this paper adds beyond them
Renders as one row of the literature-positioning table.
MethodSection
@dataclass(frozen=True)
class MethodSection:
title: str # "3.1 Encoder"
bullets: tuple[str, ...] # 3-6 bullets, ≤ 28 chars each for column layout
Renders as one column block on the method slide. The
_METHOD_SECTIONS_PER_SLIDE = 2 cap means the exporter splits into
multiple method slides automatically.
EvaluationSection
@dataclass(frozen=True)
class EvaluationSection:
title: str # "4.1 ImageNet-1k benchmark"
bullets: tuple[str, ...] # 3-6 bullets
Same shape as MethodSection; same _EVALUATION_SECTIONS_PER_SLIDE = 2 cap.
RqResult
@dataclass(frozen=True)
class RqResult:
research_question: str # "RQ1: Does X improve Y under constraint Z?"
headline: str # one-sentence answer
table: tuple[tuple[str, ...], ...] = () # rows of cells; first row is header
notes: tuple[str, ...] = () # 2-4 supporting bullets below the table
Renders as one slide per RQ. The pipeline pairs research_questions[i]
with rq_results[i] by index; lengths must match.
PaperCollection
The pipeline’s output and every exporter’s input.
@dataclass(frozen=True)
class PaperCollection:
query: Query # the originating query (for provenance)
papers: tuple[Paper, ...] # deduplicated, ranked
Field |
Notes |
|---|---|
|
Used by the |
|
Tuple (frozen). Order matters — exporters render in the order given. |
Helpers
len(collection) # → len(collection.papers)
collection.to_dict() # → JSON-serialisable
PaperCollection.from_dict(d) # → round-trip
ExportOptions
The exporter contract.
@dataclass(frozen=True)
class ExportOptions:
formats: tuple[str, ...] # subset of ALL_EXPORTS
out_dir: str # filesystem path
filename_stem: str | None = None # override autogen
include_abstract: bool = True # off → drops abstract + summary
language: str = "en" # slide-deck language code
max_slides_per_paper: int = 25 # 0 = unlimited
Field |
Notes |
|---|---|
|
Validated against |
|
Created if missing. Path-traversal-safe (resolved via |
|
When |
|
False produces a deck that’s title + authors + link slides only — useful when you want a one-sentence summary deck for hundreds of papers. |
|
Must be one of the 14 supported slide-deck languages. Unknown codes fall back to |
|
Caps each paper’s slide count; the exporter drops lower-priority sections (figures, paper-tables, contribution-summary, pagination tails) until the count fits. Cover / overview / contributions / metrics / core observation / references are always kept. Pass |
Identifier parsing
thesisagents.core.identifiers.parse_identifier(value: str) is
the single entry point for resolving a --paper argument. It
returns a ParsedIdentifier carrying:
@dataclass(frozen=True)
class ParsedIdentifier:
value: str # canonical form (e.g. "2401.08741")
kind: IdentifierKind # ARXIV | DOI | PMID | IEEE_DOC
Accepted input forms:
Kind |
Examples |
|---|---|
arXiv |
|
DOI |
|
PMID |
|
IEEE |
|
The CLI raises a friendly error: could not classify identifier
for any value that doesn’t match.
Exceptions
The whole project’s error type hierarchy is in
thesisagents.core.exceptions:
ThesisAgentsError # base — surfaces as exit code 2
├── ConfigError # missing API key, malformed env var
├── FetchError
│ ├── RateLimitError # 429 / explicit upstream rate limit
│ ├── ParseError # malformed JSON / XML / HTML
│ └── SourceUnavailableError # 5xx that retries can't recover
├── CacheError # disk-cache I/O failure
└── ExportError # exporter failed to write
Every fetcher’s top-level method wraps upstream exceptions into
one of the above. Surface code (CLI / MCP / GUI) catches the base
ThesisAgentsError and renders it as a one-line error message;
unexpected exceptions surface as a stack trace so bugs are loud.