Architecture

How ThesisAgents is organised, why those boundaries exist, and how a single keyword turns into a thesis-style .pptx.

One-paragraph summary

A user (a human, a CLI process, an MCP-aware LLM, or the desktop GUI) submits a Query. The pipeline fans out to per-source Fetcher plugins, normalises each plugin’s payload into a shared Paper record, deduplicates by DOI / arXiv-ID / fuzzy title, ranks by recency + citation count, optionally enriches each paper with a structured PaperSummary, and hands the resulting PaperCollection to one or more Exporter plugins (.pptx, .xlsx, .bib, .md, .json, .ris, .csv, .csl.json). All outbound HTTP goes through one HTTPS-only client per source, all per-source rate limits live in a token bucket, and every fetcher test uses a recorded fixture (zero live HTTP in the test suite).

Layered view

┌─────────────────────────────────────────────────────────────┐
│  Surfaces                                                   │
│  CLI · MCP server · Desktop GUI (PySide6) · Python library  │
├─────────────────────────────────────────────────────────────┤
│  Pipeline                                                   │
│  Query → fetch → normalise → dedup → rank → enrich → export │
├─────────────────────────────────────────────────────────────┤
│  Core domain                                                │
│  Paper · PaperCollection · PaperSummary · RqResult · Query  │
├──────────────────────────┬──────────────────────────────────┤
│  Fetchers                │  Exporters                       │
│  arxiv, semantic_scholar │  pptx (3 tiers) · xlsx · bibtex  │
│  openalex, pubmed, …     │  markdown · json · pptx_edit     │
├──────────────────────────┴──────────────────────────────────┤
│  Infra                                                      │
│  HTTPS-only client · token-bucket rate limit · cache · i18n │
└─────────────────────────────────────────────────────────────┘

Dependencies only flow downward. Surfaces depend on the pipeline, the pipeline depends on the core domain + fetchers + exporters, and everything depends on infra. An exporter never imports a fetcher — it only consumes a PaperCollection.

Top-level layout

ThesisAgents/
├── thesisagents/                 # main package — core runtime
│   ├── core/                       # domain (Paper, Query, dedup, rank, pipeline)
│   ├── fetchers/                   # HTTPS-only http client + Fetcher base
│   ├── exporters/                  # pptx / xlsx / bib / md / json / ris / csv / csl + pptx_edit + i18n
│   ├── intelligence/               # PDF + Anthropic summariser ([intelligence] extra)
│   ├── mcp/                        # FastMCP server registering 12 tools ([mcp] extra)
│   ├── gui/                        # PySide6 desktop UI ([gui] extra)
│   ├── utils/                      # logging, path safety, async helpers
│   ├── cli.py                      # argparse CLI
│   └── __main__.py                 # `python -m thesisagents`
├── sources/<name>/                 # per-source plugins (arxiv, pubmed, …)
│   ├── __init__.py                 # exports `fetcher_class`
│   ├── fetcher.py                  # Fetcher subclass
│   ├── parser.py                   # payload → Paper
│   └── config.py                   # RateLimit + endpoint URL
├── tests/                          # pytest suite + recorded fixtures
├── docs/                           # Sphinx (en + 13 language stubs)
├── scripts/                        # regen / fixture-record helpers
└── pyproject.toml                  # metadata, ruff, bandit, extras

Core vs source plugins

The split between thesisagents/ and sources/<name>/ is dependency surface and failure isolation, not “anything source-related is a plugin.”

A feature is a source plugin when ANY of the following holds:

  1. It needs a heavy or optional runtime dep (vendor SDK, Selenium).

  2. It needs failure isolation — a flaky upstream should not break the rest of the pipeline.

  3. It needs an independent release cadence — a Scholar HTML layout change should ship without re-shipping the engine.

A feature stays in core when:

  • It runs on the default dep set (no extras).

  • It serves the everyday workflow every user expects to work (arxiv, semantic_scholar, pubmed, openalex are core; scholar scrape and ieee scrape are opt-in plugins).

Concrete consequence: a flaky ACM endpoint cannot break an arXiv search. Each fetcher catches its own exceptions and returns an empty result; the pipeline aggregates whatever non-empty results came back.

The pipeline

                Query
                  │
                  ▼
          ┌───────────────┐
          │ load_fetcher  │  one per Query.source
          └───────────────┘
                  │
                  ▼ (asyncio.gather, per-source semaphore)
       ┌──────────┴──────────┐
       ▼          ▼          ▼
   Fetcher     Fetcher     Fetcher     ← per-source token-bucket rate limit
   .fetch()    .fetch()    .fetch()       on the HTTPS-only async client
       │          │          │
       └──────────┼──────────┘
                  ▼
            list[Paper]
                  │
                  ▼
            ┌──────────┐
            │ dedupe   │  by DOI → arXiv ID → SHA-256(title+1st-author+year)
            └──────────┘
                  │
                  ▼
            ┌──────────┐
            │ rank     │  recency × log(citation_count)
            └──────────┘
                  │
                  ▼
        (optional) top-tier filter
                  │
                  ▼
          ┌────────────────┐
          │ oa_resolver    │  Unpaywall + arXiv title fallback —
          └────────────────┘  fills pdf_url for paywalled-source papers
                  │
                  ▼
        (optional) enrich      PDF → PaperSummary
                  │
                  ▼
          PaperCollection
                  │
                  ▼
          ┌───────────────┐
          │ Exporter      │  pptx, xlsx, bibtex, md, json, ris, csv, csl
          └───────────────┘

OA PDF resolution

thesisagents.core.oa_resolver runs after dedup + rank + top-tier filter. For every paper still missing pdf_url, five strategies fire in order, returning the first hit:

  1. arXiv-ID direct — if the paper carries arxiv_id (set by the openalex / pubmed / crossref / semantic_scholar parsers when the upstream identified an arXiv preprint), derive https://arxiv.org/pdf/{arxiv_id}.pdf directly. Zero network round-trip; highest precision; fastest.

  2. Unpaywall (https://api.unpaywall.org/v2/{doi}) — free, no API key; needs THESISAGENTS_CONTACT_EMAIL for politeness. ~50M papers indexed.

  3. Semantic Scholar OA index — S2’s openAccessPdf field is partially disjoint from Unpaywall; when one misses, the other often hits. Free, no API key required (rate-limited).

  4. CORE.ac.uk — aggregator of 200M+ OA repository items (institutional repos, regional preprint servers, OA journals). Needs THESISAGENTS_CORE_API_KEY (free); skipped silently when unset.

  5. arXiv title search — for papers without a DOI / arxiv_id, search arXiv by the paper’s title. Exact-match on the normalised title.

Every lookup is best-effort and never raises; a paper that resists all five passes through with pdf_url=None and the downstream paywall gate / per-paper renderer falls back to the lightweight tier.

Disabled per-run via the CLI’s --no-oa-resolve flag or run_search(query, resolve_oa=False) from Python.

Dedup

thesisagents.core.dedup is a three-pass merge:

  1. Strong-ID pass — papers sharing a DOI or arXiv ID are merged into one, keeping the most complete record (longest abstract, most authors, citation count from the source that has it).

  2. Title pass — among papers without strong IDs, normalise the title (NFKC + lowercase + strip punctuation), then SHA-256-hash title + first_author + year. Identical hashes are merged.

  3. Field union — for merged duplicates, every optional field ( doi, arxiv_id, pdf_url, venue, citation_count, abstract) is taken from whichever source had it.

The dedup pass is O(N) — the bottleneck is hashing, not the field union step.

Ranking

Default rank score: 0.5 · normalised_year + 0.5 · log(1 + citation_count) / 20.

Older but heavily-cited papers (the “Attention Is All You Need” of any field) still win against recent unknowns; very recent papers without citations are surfaced because the recency term keeps them in the top quartile.

Override the weight split per query via the optional min_citations filter on the MCP search tool.

Enrichment

Two distinct paths. The decision tree:

ANTHROPIC_API_KEY set?
├── yes → Python pipeline: pypdf/pymupdf extracts text,
│         thesisagents.intelligence.summarise calls the
│         Anthropic API, returns a structured PaperSummary
│         (motivation, contributions, method, results,
│         limitations + the rich tier).
└── no  → LLM-as-agent: the MCP client (e.g. Claude Code)
          calls fetch_pdf_text(), reads the text in its own
          context, writes a summary dict, passes it to export().
          No API key needed.

Both paths produce the same PaperSummary shape; the exporter doesn’t know or care which one wrote it.

The data model

Three frozen dataclasses carry the entire flow. Their fields are described in detail in Data model; a one-line summary:

  • Query — keywords, sources, max_results, year window, flags.

  • Paper — title / authors / year / venue / abstract / URLs / IDs / citation count / optional summary: PaperSummary.

  • PaperCollectionquery: Query + papers: tuple[Paper].

Frozen by design: any “edit” creates a new instance via dataclasses.replace(paper, summary=...). This makes the pipeline trivially safe to fan out across asyncio tasks.

Surfaces

Each surface is a thin adapter over the same pipeline.

CLI (thesisagents.cli)

argparse parses flags into a Query / single-paper identifier. The CLI is the only surface that does its own asyncio.run; the library APIs return coroutines.

MCP server (thesisagents.mcp)

FastMCP registers twelve tools. The agent calls them in sequence (list_sourcessearchfetch_pdf_text per paper → export); the server is stateless across tool calls so the agent’s context is the only place state lives. See MCP doc.

Desktop GUI (thesisagents.gui)

PySide6 widgets call the same run_search / export_collection that the CLI does, but on a QThreadPool worker so the UI thread stays responsive. See GUI doc.

Python library

Anything in thesisagents.core.pipeline is importable from your own code:

import asyncio
from thesisagents.core.models import Query
from thesisagents.core.pipeline import run_search
from thesisagents.exporters import export_collection
from thesisagents.core.models import ExportOptions

async def main():
    q = Query(keywords="transformer", sources=("arxiv",), max_results=10)
    collection = await run_search(q)
    written = export_collection(
        collection,
        ExportOptions(formats=("pptx", "bibtex"), out_dir="./exports"),
    )
    print(written)

asyncio.run(main())

Infrastructure

HTTPS-only HTTP client

thesisagents.fetchers.http.get_client(source) returns a per-source httpx.AsyncClient that:

  • Refuses any URL whose scheme isn’t https (refused both at request time AND mid-redirect).

  • Carries the source’s User-Agent.

  • Routes every request through the source’s token-bucket rate limiter.

  • Retries 429 / 5xx with exponential backoff + jitter.

  • Pools connections for the process lifetime.

There is exactly one client per source per process. Re-entering the pipeline reuses the same client. shutdown_clients() closes all clients at CLI exit; it’s tolerant of clients whose loop already closed (test-suite isolation requirement).

Rate limiting

Token bucket in thesisagents.fetchers.rate_limit. Each source declares its bucket parameters in sources/<name>/config.py:

RATE_LIMIT = RateLimit(
    requests_per_second=1 / 3.0,   # 1 request every 3 s
    burst=1,
    jitter_seconds=0.5,
)

The bucket is a decorator on the HTTP client — retries also go through it. There is no way to bypass the bucket without deleting code from the source plugin.

Cache

thesisagents.core.cache provides an SHA-256-keyed disk cache for raw responses. Default TTL is 24h; override per-source if needed. Tests redirect the cache root to tmp_path so they never touch the user’s cache.

i18n

Two separate tables to balance scope:

  • thesisagents.exporters.i18n — slide-deck strings (“Agenda”, “References”, “Paper N of M”, “Background”, etc.) in all 14 supported languages. Coverage enforced by tests/test_i18n.py::test_every_language_has_every_key.

  • thesisagents.gui.i18n — UI label strings, identical language set, coverage enforced by tests/gui/test_i18n.py.

Adding a new key requires filling in all 14 languages.

Source plugin contract

A source plugin lives at sources/<name>/ and must expose:

  • sources/<name>/__init__.py setting fetcher_class = FetcherClass.

  • sources/<name>/fetcher.py with a Fetcher subclass.

  • sources/<name>/parser.py converting raw payloads → Paper.

  • sources/<name>/config.py declaring the RateLimit.

The pipeline finds plugins by injecting sources/ into sys.path at startup (thesisagents.app.source_manager). At fetch time it imports <name>, reads fetcher_class, and instantiates it with the shared HTTP client + cache.

Full authoring guide: Source plugin authoring.

Slide-deck rendering tiers

The .pptx exporter dispatches to one of three layouts based on how much info each paper carries:

Tier

Trigger

Slides per paper

Lightweight

only abstract populated

4–6 (cover + agenda + Background / Approach / Findings sentence buckets + references)

Enriched-flat

Paper.summary has motivation / contributions / method / results / limitations / takeaways

one slide per non-empty section

Thesis-style

Paper.summary.has_rich_fields() is true (pain_points, research_question, contributions_detailed, headline_metrics, technique_table, evaluation_sections, system_flow, research_questions, rq_results, core_observation, limitations, future_work, …)

20+ slides per paper

All three tiers share the same shape-naming convention so pptx_edit.update_slide(..., title=...) looks up shapes by name.

Post-build visual-identity passes

After the chosen tier builds the deck on the light palette, three non-invasive walk-and-rewrite passes run before the file is saved:

  1. Typography (_apply_typography(prs, language)) — walks every text run, writes <a:latin typeface=…> AND <a:ea typeface=…> on the run’s XML based on _FONT_FAMILIES[language]. Setting only run.font.name (the Latin slot) leaves CJK glyphs in PowerPoint’s default East-Asian font; both slots matter.

  2. Accent geometry (_decorate_with_accents(prs)) — adds the accent_top bar to every content slide and an accent_left band to the cover. Both are full-width / full-height navy rectangles the user never sees as separate shapes but instantly reads as “this deck has an identity”.

  3. Dark-mode recolour (_apply_dark_mode(prs), runs when ExportOptions.dark_mode=True, which is the default) — walks every slide / shape / run / table cell and swaps light-palette RGBs to their dark equivalents via _LIGHT_TO_DARK_TEXT + _LIGHT_TO_DARK_FILL dicts. The slide background switches to #12151B; body text goes to #E5E7EB; the teal accent (#0E7490) goes to a brighter #2DD4BF. The pass is intentionally non-invasive: it doesn’t refactor the 100+ direct _BRAND_* constant references in the builders, it just rewrites RGBs after the fact.

The three passes ship with regression tests in tests/test_exporters.py: test_pptx_default_is_dark_mode, test_pptx_dark_mode_has_no_invisible_runs (no run is rgb=None or black), test_pptx_dark_mode_no_light_text_on_light_fill (no near-white text inside a near-white-filled callout), and test_pptx_no_red_text_runs (red #C0392B is banned for text).

Why the design choices

Choice

Reason

Per-source plugins, not adapters

A flaky upstream (Scholar layout change, IEEE token expiry) shouldn’t break the whole pipeline. Plugins fail in isolation.

Async I/O, sync exporters

Network is parallelisable; rendering a .pptx is CPU-bound and finishes in milliseconds — no win from making it async.

One HTTPS-only client per source

Shared connection pools + token bucket. Multiple clients per source would defeat both.

Frozen dataclasses

Trivially thread/coroutine-safe; “edits” create new instances via dataclasses.replace.

Recorded fixtures only

Tests run offline, deterministically, in <30 s. Live HTTP would make CI flaky and rate-limited.

Two i18n tables (UI vs deck)

Lets the UI ship with fewer translations than the deck if needed; today both cover all 14 languages, but the split keeps optionality.

No global mutable state

Singletons (HTTP clients, cache handle, rate-limit buckets) are encapsulated in module-level classes. Streamlit st.session_state is the only mutable per-session state, and it’s per-session by design.

Performance notes

  • The bottleneck for a typical search is network latency, not CPU. Async fan-out across sources brings a 10-source search down from sum(latency) to max(latency).

  • The pptx exporter is the single biggest CPU consumer — about 200 ms per paper for the thesis-style tier. Lightweight tier is 10× faster.

  • Dedup is O(N) on the number of papers; with --max 200 × 11 sources that’s ~2200 papers max, and dedup still finishes in under 50 ms.

  • The [intelligence] extra’s Anthropic API call is the dominant cost when --enrich is on — typically 5–15 s per paper. The pipeline batches these with a per-source semaphore.

See thesisagents/utils/profiling.py for with section("name"): helpers if you’re chasing a regression.