Data model

Every record shape the pipeline produces, every field on each, and when each field is populated. All four core types are frozen dataclasses defined in thesisagents.core.models.

Query

The input contract: what the user is asking for.

@dataclass(frozen=True)
class Query:
    keywords: str                     # NFC-normalised, whitespace-collapsed
    sources: tuple[str, ...]          # subset of ALL_SOURCES
    max_results: int = 25             # 1..200 per source
    year_from: int | None = None      # inclusive lower bound
    year_to: int | None = None        # inclusive upper bound
    top_tier_only: bool = True        # apply the venue whitelist
    min_citations: int | None = None  # discard below this (MCP-only flag)

Field

Required

Notes

keywords

yes

Must be non-empty after normalize_query. URL/HTML encoding is per-source — you pass plain text.

sources

yes

Empty tuple is rejected at construction. Use ALL_SOURCES from core.constants to get the full list.

max_results

no

Clamped to [1, MAX_RESULTS_PER_SOURCE] (200) by pydantic validation.

year_from / year_to

no

Either or both. year_from > year_to is rejected.

top_tier_only

no

When True (default), filters to the curated CS top-tier whitelist + arXiv pass-through.

min_citations

no

Surfaced via the MCP search tool’s min_citations parameter only.

Paper

The normalised result shape every source plugin produces. See thesisagents.core.models.Paper for the source.

@dataclass(frozen=True)
class Paper:
    source: str                       # source plugin name (e.g. "arxiv")
    source_id: str                    # the plugin's stable record ID
    title: str
    authors: tuple[str, ...]
    year: int | None
    venue: str | None                 # the REAL publication venue
    abstract: str
    url: str                          # canonical landing page
    doi: str | None = None
    arxiv_id: str | None = None
    pdf_url: str | None = None        # PDF if publicly accessible
    citation_count: int | None = None
    raw: dict[str, Any] | None = None # the source's raw payload
    summary: PaperSummary | None = None

Field reference

Field

Required

Populated by

source

yes

Source plugin name ("arxiv", "pubmed", "openalex", …). Always one of ALL_SOURCES.

source_id

yes

The plugin’s stable ID for the record (arXiv: ID without version suffix; pubmed: PMID; openalex: opaque ID). Used to form the BibTeX key fallback when DOI is missing.

title

yes

Plain text, no Markdown. CJK supported.

authors

yes

Tuple of "Firstname Lastname" strings. Empty tuple is allowed (rare; some preprint servers omit authors). The exporter shows the first three then .

year

no

None only when the source genuinely lacks year metadata. Slide layouts substitute "n.d." (configurable via i18n).

venue

no

Real publication venue when available. None for preprints with no venue, scrape results that can’t determine a venue, and local PDFs.

abstract

yes

Plain text. May be empty string for entries with no abstract (the lightweight tier falls back to bullet placeholders).

url

yes

The canonical landing page URL — what a human would click to read the paper. arXiv: https://arxiv.org/abs/<id> (no v<N> suffix). DOI papers: https://doi.org/<doi>. Locally-fed PDFs: file:///....

doi

no

10.x/y form, no https://doi.org/ prefix. Used as the BibTeX key when present.

arxiv_id

no

The bare arXiv ID (2401.08741), no v<N> suffix. Strip the version when populating.

pdf_url

no

A publicly fetchable PDF URL. None when the paper is paywalled or the source can’t surface a PDF link. The pipeline’s paywall gate triggers when more than 30% of results have pdf_url=None.

citation_count

no

Integer when the source reports one. Used in the rank score.

raw

no

The source’s raw payload (parsed JSON / dict). Available for debugging and for the LLM-as-agent flow (raw["extracted_text"] when populated by --pdf). Excluded from .json export when too large.

summary

no

A PaperSummary dataclass — populated by --enrich, the LLM-as-agent flow, or hand-authored regen scripts.

Derived methods

paper.bibtex_key()      # → "vaswani2017attention" (lowercase, ASCII-folded)
paper.to_dict()         # → JSON-serialisable dict (drops `raw` when huge)
Paper.from_dict(data)   # → Paper (round-trip equality)

The BibTeX key generation is:

  1. Last name of first author (ASCII-folded, lowercase).

  2. Four-digit year (or nd when missing).

  3. First non-stopword from the title (lowercase, ASCII-folded).

  4. If a collision: append a, b, c, … per the project’s collision counter.

PaperSummary

The structured per-paper summary attached as Paper.summary. Three usage tiers stack additively:

@dataclass(frozen=True)
class PaperSummary:
    # Flat tier — enriched-flat exporter renders these one per slide
    motivation: str = ""
    contributions: tuple[str, ...] = ()
    method: str = ""
    results: str = ""
    limitations: str = ""
    takeaways: tuple[str, ...] = ()

    # Rich tier — thesis-style exporter activates when has_rich_fields()
    pain_points: tuple[str, ...] = ()
    research_question: str = ""
    contributions_detailed: tuple[ContributionDetail, ...] = ()
    headline_metrics: tuple[Metric, ...] = ()
    technique_table: tuple[TechniqueRow, ...] = ()
    literature_positioning: tuple[LiteratureRow, ...] = ()
    system_flow: tuple[str, ...] = ()
    method_sections: tuple[MethodSection, ...] = ()
    evaluation_method: str = ""
    research_questions: tuple[str, ...] = ()
    rq_results: tuple[RqResult, ...] = ()
    contribution_summary: str = ""
    core_observation: str = ""
    future_work: tuple[str, ...] = ()

    # Provenance
    model: str = ""               # "claude-opus-4-7 (LLM-as-agent, read 12-page PDF)"
    raw_text_chars: int = 0       # length of source text that was summarised
    language: str = "en"

has_rich_fields() returns True when any rich-tier field has non-empty content; this is what the .pptx exporter checks to pick between the enriched-flat and thesis-style layouts.

When each tier is populated

Source

Flat tier

Rich tier

CLI --enrich (Python pipeline)

yes

yes — Claude prompts produce both tiers

MCP export from LLM-as-agent

yes if the LLM writes them

yes if the LLM writes them

Hand-authored regen script

yes

yes

CLI default (no enrichment)

empty

empty

Nested types (rich tier)

ContributionDetail

@dataclass(frozen=True)
class ContributionDetail:
    title: str                # "Two-tower fine-tuning"
    description: str          # one sentence explaining what + why
    bullets: tuple[str, ...] = ()  # 2-4 supporting points

Renders as one stack on the contributions slide. Cap the slide at ≤ 4 contributions — the overflow check trips above that.

Metric

@dataclass(frozen=True)
class Metric:
    name: str                 # "Top-1 accuracy on ImageNet-1k"
    value: str                # "84.7%"
    delta: str = ""           # "+2.3% vs. ViT-B/16"

Renders as one row of the KPI slide. Aim for 3–5 metrics.

TechniqueRow

@dataclass(frozen=True)
class TechniqueRow:
    technique: str            # "RoPE positional encoding"
    used_for: str             # "long-context generalisation"
    note: str = ""            # optional aside

Renders as one row of the technique table.

LiteratureRow

@dataclass(frozen=True)
class LiteratureRow:
    work: str                 # citation key or short ref ("BERT (2019)")
    contribution: str         # what they did
    delta: str                # what this paper adds beyond them

Renders as one row of the literature-positioning table.

MethodSection

@dataclass(frozen=True)
class MethodSection:
    title: str                # "3.1 Encoder"
    bullets: tuple[str, ...]  # 3-6 bullets, ≤ 28 chars each for column layout

Renders as one column block on the method slide. The _METHOD_SECTIONS_PER_SLIDE = 2 cap means the exporter splits into multiple method slides automatically.

EvaluationSection

@dataclass(frozen=True)
class EvaluationSection:
    title: str                # "4.1 ImageNet-1k benchmark"
    bullets: tuple[str, ...]  # 3-6 bullets

Same shape as MethodSection; same _EVALUATION_SECTIONS_PER_SLIDE = 2 cap.

RqResult

@dataclass(frozen=True)
class RqResult:
    research_question: str    # "RQ1: Does X improve Y under constraint Z?"
    headline: str             # one-sentence answer
    table: tuple[tuple[str, ...], ...] = ()  # rows of cells; first row is header
    notes: tuple[str, ...] = ()  # 2-4 supporting bullets below the table

Renders as one slide per RQ. The pipeline pairs research_questions[i] with rq_results[i] by index; lengths must match.

PaperCollection

The pipeline’s output and every exporter’s input.

@dataclass(frozen=True)
class PaperCollection:
    query: Query              # the originating query (for provenance)
    papers: tuple[Paper, ...] # deduplicated, ranked

Field

Notes

query

Used by the .xlsx exporter’s “Query” provenance sheet, by the .md exporter’s header, and by the .pptx exporter’s footer.

papers

Tuple (frozen). Order matters — exporters render in the order given.

Helpers

len(collection)                # → len(collection.papers)
collection.to_dict()           # → JSON-serialisable
PaperCollection.from_dict(d)   # → round-trip

ExportOptions

The exporter contract.

@dataclass(frozen=True)
class ExportOptions:
    formats: tuple[str, ...]       # subset of ALL_EXPORTS
    out_dir: str                   # filesystem path
    filename_stem: str | None = None  # override autogen
    include_abstract: bool = True  # off → drops abstract + summary
    language: str = "en"           # slide-deck language code
    max_slides_per_paper: int = 25 # 0 = unlimited

Field

Notes

formats

Validated against ALL_EXPORTS = ("bib", "md", "pptx", "xlsx", "pdf", "json", "ris", "csv", "csl").

out_dir

Created if missing. Path-traversal-safe (resolved via utils.path_safety).

filename_stem

When None, the pipeline generates {slug-of-query}-{YYYYMMDD-HHMMSS}. Hand-authored regen scripts typically set this to the BibTeX key.

include_abstract

False produces a deck that’s title + authors + link slides only — useful when you want a one-sentence summary deck for hundreds of papers.

language

Must be one of the 14 supported slide-deck languages. Unknown codes fall back to en via normalise_language.

max_slides_per_paper

Caps each paper’s slide count; the exporter drops lower-priority sections (figures, paper-tables, contribution-summary, pagination tails) until the count fits. Cover / overview / contributions / metrics / core observation / references are always kept. Pass 0 to disable the cap.

Identifier parsing

thesisagents.core.identifiers.parse_identifier(value: str) is the single entry point for resolving a --paper argument. It returns a ParsedIdentifier carrying:

@dataclass(frozen=True)
class ParsedIdentifier:
    value: str           # canonical form (e.g. "2401.08741")
    kind: IdentifierKind # ARXIV | DOI | PMID | IEEE_DOC

Accepted input forms:

Kind

Examples

arXiv

2401.08741, 2401.08741v2, arXiv:2401.08741, https://arxiv.org/abs/2401.08741, https://arxiv.org/pdf/2401.08741v2.pdf, cs.LG/0001001 (legacy)

DOI

10.1145/3411764.3445005, doi:10.1145/..., https://doi.org/10.1145/...

PMID

34567890, https://pubmed.ncbi.nlm.nih.gov/34567890/

IEEE

https://ieeexplore.ieee.org/document/10965643 (number is the IEEE document ID, not the DOI)

The CLI raises a friendly error: could not classify identifier for any value that doesn’t match.

Exceptions

The whole project’s error type hierarchy is in thesisagents.core.exceptions:

ThesisAgentsError                     # base — surfaces as exit code 2
├── ConfigError                          # missing API key, malformed env var
├── FetchError
│   ├── RateLimitError                   # 429 / explicit upstream rate limit
│   ├── ParseError                       # malformed JSON / XML / HTML
│   └── SourceUnavailableError           # 5xx that retries can't recover
├── CacheError                           # disk-cache I/O failure
└── ExportError                          # exporter failed to write

Every fetcher’s top-level method wraps upstream exceptions into one of the above. Surface code (CLI / MCP / GUI) catches the base ThesisAgentsError and renders it as a one-line error message; unexpected exceptions surface as a stack trace so bugs are loud.