# Architecture How ThesisAgents is organised, why those boundaries exist, and how a single keyword turns into a thesis-style `.pptx`. ## One-paragraph summary A user (a human, a CLI process, an MCP-aware LLM, or the desktop GUI) submits a `Query`. The pipeline fans out to per-source **Fetcher** plugins, normalises each plugin's payload into a shared `Paper` record, deduplicates by DOI / arXiv-ID / fuzzy title, ranks by recency + citation count, optionally enriches each paper with a structured `PaperSummary`, and hands the resulting `PaperCollection` to one or more **Exporter** plugins (`.pptx`, `.xlsx`, `.bib`, `.md`, `.json`, `.ris`, `.csv`, `.csl.json`). All outbound HTTP goes through one HTTPS-only client per source, all per-source rate limits live in a token bucket, and every fetcher test uses a recorded fixture (zero live HTTP in the test suite). ## Layered view ``` ┌─────────────────────────────────────────────────────────────┐ │ Surfaces │ │ CLI · MCP server · Desktop GUI (PySide6) · Python library │ ├─────────────────────────────────────────────────────────────┤ │ Pipeline │ │ Query → fetch → normalise → dedup → rank → enrich → export │ ├─────────────────────────────────────────────────────────────┤ │ Core domain │ │ Paper · PaperCollection · PaperSummary · RqResult · Query │ ├──────────────────────────┬──────────────────────────────────┤ │ Fetchers │ Exporters │ │ arxiv, semantic_scholar │ pptx (3 tiers) · xlsx · bibtex │ │ openalex, pubmed, … │ markdown · json · pptx_edit │ ├──────────────────────────┴──────────────────────────────────┤ │ Infra │ │ HTTPS-only client · token-bucket rate limit · cache · i18n │ └─────────────────────────────────────────────────────────────┘ ``` Dependencies only flow downward. Surfaces depend on the pipeline, the pipeline depends on the core domain + fetchers + exporters, and everything depends on infra. **An exporter never imports a fetcher** — it only consumes a `PaperCollection`. ## Top-level layout ``` ThesisAgents/ ├── thesisagents/ # main package — core runtime │ ├── core/ # domain (Paper, Query, dedup, rank, pipeline) │ ├── fetchers/ # HTTPS-only http client + Fetcher base │ ├── exporters/ # pptx / xlsx / bib / md / json / ris / csv / csl + pptx_edit + i18n │ ├── intelligence/ # PDF + Anthropic summariser ([intelligence] extra) │ ├── mcp/ # FastMCP server registering 12 tools ([mcp] extra) │ ├── gui/ # PySide6 desktop UI ([gui] extra) │ ├── utils/ # logging, path safety, async helpers │ ├── cli.py # argparse CLI │ └── __main__.py # `python -m thesisagents` ├── sources// # per-source plugins (arxiv, pubmed, …) │ ├── __init__.py # exports `fetcher_class` │ ├── fetcher.py # Fetcher subclass │ ├── parser.py # payload → Paper │ └── config.py # RateLimit + endpoint URL ├── tests/ # pytest suite + recorded fixtures ├── docs/ # Sphinx (en + 13 language stubs) ├── scripts/ # regen / fixture-record helpers └── pyproject.toml # metadata, ruff, bandit, extras ``` ## Core vs source plugins The split between `thesisagents/` and `sources//` is **dependency surface and failure isolation**, not "anything source-related is a plugin." A feature is a **source plugin** when ANY of the following holds: 1. It needs a heavy or optional runtime dep (vendor SDK, Selenium). 2. It needs failure isolation — a flaky upstream should not break the rest of the pipeline. 3. It needs an independent release cadence — a Scholar HTML layout change should ship without re-shipping the engine. A feature stays in **core** when: - It runs on the default dep set (no extras). - It serves the everyday workflow every user expects to work (arxiv, semantic_scholar, pubmed, openalex are core; scholar scrape and ieee scrape are opt-in plugins). Concrete consequence: a flaky ACM endpoint cannot break an arXiv search. Each fetcher catches its own exceptions and returns an empty result; the pipeline aggregates whatever non-empty results came back. ## The pipeline ``` Query │ ▼ ┌───────────────┐ │ load_fetcher │ one per Query.source └───────────────┘ │ ▼ (asyncio.gather, per-source semaphore) ┌──────────┴──────────┐ ▼ ▼ ▼ Fetcher Fetcher Fetcher ← per-source token-bucket rate limit .fetch() .fetch() .fetch() on the HTTPS-only async client │ │ │ └──────────┼──────────┘ ▼ list[Paper] │ ▼ ┌──────────┐ │ dedupe │ by DOI → arXiv ID → SHA-256(title+1st-author+year) └──────────┘ │ ▼ ┌──────────┐ │ rank │ recency × log(citation_count) └──────────┘ │ ▼ (optional) top-tier filter │ ▼ ┌────────────────┐ │ oa_resolver │ Unpaywall + arXiv title fallback — └────────────────┘ fills pdf_url for paywalled-source papers │ ▼ (optional) enrich PDF → PaperSummary │ ▼ PaperCollection │ ▼ ┌───────────────┐ │ Exporter │ pptx, xlsx, bibtex, md, json, ris, csv, csl └───────────────┘ ``` ### OA PDF resolution `thesisagents.core.oa_resolver` runs after dedup + rank + top-tier filter. For every paper still missing `pdf_url`, five strategies fire in order, returning the first hit: 1. **arXiv-ID direct** — if the paper carries `arxiv_id` (set by the openalex / pubmed / crossref / semantic_scholar parsers when the upstream identified an arXiv preprint), derive `https://arxiv.org/pdf/{arxiv_id}.pdf` directly. Zero network round-trip; highest precision; fastest. 2. **Unpaywall** (https://api.unpaywall.org/v2/{doi}) — free, no API key; needs `THESISAGENTS_CONTACT_EMAIL` for politeness. ~50M papers indexed. 3. **Semantic Scholar OA index** — S2's `openAccessPdf` field is partially disjoint from Unpaywall; when one misses, the other often hits. Free, no API key required (rate-limited). 4. **CORE.ac.uk** — aggregator of 200M+ OA repository items (institutional repos, regional preprint servers, OA journals). Needs `THESISAGENTS_CORE_API_KEY` (free); skipped silently when unset. 5. **arXiv title search** — for papers without a DOI / arxiv_id, search arXiv by the paper's title. Exact-match on the normalised title. Every lookup is best-effort and never raises; a paper that resists all five passes through with `pdf_url=None` and the downstream paywall gate / per-paper renderer falls back to the lightweight tier. Disabled per-run via the CLI's `--no-oa-resolve` flag or `run_search(query, resolve_oa=False)` from Python. ### Dedup `thesisagents.core.dedup` is a three-pass merge: 1. Strong-ID pass — papers sharing a DOI or arXiv ID are merged into one, keeping the most complete record (longest abstract, most authors, citation count from the source that has it). 2. Title pass — among papers without strong IDs, normalise the title (NFKC + lowercase + strip punctuation), then SHA-256-hash `title + first_author + year`. Identical hashes are merged. 3. Field union — for merged duplicates, every optional field ( `doi`, `arxiv_id`, `pdf_url`, `venue`, `citation_count`, `abstract`) is taken from whichever source had it. The dedup pass is O(N) — the bottleneck is hashing, not the field union step. ### Ranking Default rank score: `0.5 · normalised_year + 0.5 · log(1 + citation_count) / 20`. Older but heavily-cited papers (the "Attention Is All You Need" of any field) still win against recent unknowns; very recent papers without citations are surfaced because the recency term keeps them in the top quartile. Override the weight split per query via the optional `min_citations` filter on the MCP `search` tool. ### Enrichment Two distinct paths. The decision tree: ``` ANTHROPIC_API_KEY set? ├── yes → Python pipeline: pypdf/pymupdf extracts text, │ thesisagents.intelligence.summarise calls the │ Anthropic API, returns a structured PaperSummary │ (motivation, contributions, method, results, │ limitations + the rich tier). └── no → LLM-as-agent: the MCP client (e.g. Claude Code) calls fetch_pdf_text(), reads the text in its own context, writes a summary dict, passes it to export(). No API key needed. ``` Both paths produce the same `PaperSummary` shape; the exporter doesn't know or care which one wrote it. ## The data model Three frozen dataclasses carry the entire flow. Their fields are described in detail in [Data model](data_model.md); a one-line summary: - **`Query`** — keywords, sources, max_results, year window, flags. - **`Paper`** — title / authors / year / venue / abstract / URLs / IDs / citation count / optional `summary: PaperSummary`. - **`PaperCollection`** — `query: Query` + `papers: tuple[Paper]`. Frozen by design: any "edit" creates a new instance via `dataclasses.replace(paper, summary=...)`. This makes the pipeline trivially safe to fan out across asyncio tasks. ## Surfaces Each surface is a thin adapter over the same pipeline. ### CLI (`thesisagents.cli`) `argparse` parses flags into a `Query` / single-paper identifier. The CLI is the only surface that does its own `asyncio.run`; the library APIs return coroutines. ### MCP server (`thesisagents.mcp`) FastMCP registers twelve tools. The agent calls them in sequence (`list_sources` → `search` → `fetch_pdf_text` per paper → `export`); the server is stateless across tool calls so the agent's context is the only place state lives. See [MCP doc](mcp.md). ### Desktop GUI (`thesisagents.gui`) PySide6 widgets call the same `run_search` / `export_collection` that the CLI does, but on a `QThreadPool` worker so the UI thread stays responsive. See [GUI doc](gui.md). ### Python library Anything in `thesisagents.core.pipeline` is importable from your own code: ```python import asyncio from thesisagents.core.models import Query from thesisagents.core.pipeline import run_search from thesisagents.exporters import export_collection from thesisagents.core.models import ExportOptions async def main(): q = Query(keywords="transformer", sources=("arxiv",), max_results=10) collection = await run_search(q) written = export_collection( collection, ExportOptions(formats=("pptx", "bibtex"), out_dir="./exports"), ) print(written) asyncio.run(main()) ``` ## Infrastructure ### HTTPS-only HTTP client `thesisagents.fetchers.http.get_client(source)` returns a per-source `httpx.AsyncClient` that: - Refuses any URL whose scheme isn't `https` (refused both at request time AND mid-redirect). - Carries the source's User-Agent. - Routes every request through the source's token-bucket rate limiter. - Retries 429 / 5xx with exponential backoff + jitter. - Pools connections for the process lifetime. There is exactly **one** client per source per process. Re-entering the pipeline reuses the same client. `shutdown_clients()` closes all clients at CLI exit; it's tolerant of clients whose loop already closed (test-suite isolation requirement). ### Rate limiting Token bucket in `thesisagents.fetchers.rate_limit`. Each source declares its bucket parameters in `sources//config.py`: ```python RATE_LIMIT = RateLimit( requests_per_second=1 / 3.0, # 1 request every 3 s burst=1, jitter_seconds=0.5, ) ``` The bucket is a decorator on the HTTP client — **retries also go through it**. There is no way to bypass the bucket without deleting code from the source plugin. ### Cache `thesisagents.core.cache` provides an SHA-256-keyed disk cache for raw responses. Default TTL is 24h; override per-source if needed. Tests redirect the cache root to `tmp_path` so they never touch the user's cache. ### i18n Two separate tables to balance scope: - `thesisagents.exporters.i18n` — slide-deck strings ("Agenda", "References", "Paper N of M", "Background", etc.) in all 14 supported languages. Coverage enforced by `tests/test_i18n.py::test_every_language_has_every_key`. - `thesisagents.gui.i18n` — UI label strings, identical language set, coverage enforced by `tests/gui/test_i18n.py`. Adding a new key requires filling in all 14 languages. ## Source plugin contract A source plugin lives at `sources//` and must expose: - `sources//__init__.py` setting `fetcher_class = FetcherClass`. - `sources//fetcher.py` with a `Fetcher` subclass. - `sources//parser.py` converting raw payloads → `Paper`. - `sources//config.py` declaring the `RateLimit`. The pipeline finds plugins by injecting `sources/` into `sys.path` at startup (`thesisagents.app.source_manager`). At fetch time it imports ``, reads `fetcher_class`, and instantiates it with the shared HTTP client + cache. Full authoring guide: [Source plugin authoring](source_plugins.md). ## Slide-deck rendering tiers The `.pptx` exporter dispatches to one of three layouts based on how much info each paper carries: | Tier | Trigger | Slides per paper | |---|---|---| | Lightweight | only `abstract` populated | 4–6 (cover + agenda + Background / Approach / Findings sentence buckets + references) | | Enriched-flat | `Paper.summary` has `motivation` / `contributions` / `method` / `results` / `limitations` / `takeaways` | one slide per non-empty section | | Thesis-style | `Paper.summary.has_rich_fields()` is true (pain_points, research_question, contributions_detailed, headline_metrics, technique_table, evaluation_sections, system_flow, research_questions, rq_results, core_observation, limitations, future_work, ...) | 20+ slides per paper | All three tiers share the same shape-naming convention so `pptx_edit.update_slide(..., title=...)` looks up shapes by name. ### Post-build visual-identity passes After the chosen tier builds the deck on the light palette, three non-invasive walk-and-rewrite passes run before the file is saved: 1. **Typography** (`_apply_typography(prs, language)`) — walks every text run, writes `` AND `` on the run's XML based on `_FONT_FAMILIES[language]`. Setting only `run.font.name` (the Latin slot) leaves CJK glyphs in PowerPoint's default East-Asian font; both slots matter. 2. **Accent geometry** (`_decorate_with_accents(prs)`) — adds the `accent_top` bar to every content slide and an `accent_left` band to the cover. Both are full-width / full-height navy rectangles the user never sees as separate shapes but instantly reads as "this deck has an identity". 3. **Dark-mode recolour** (`_apply_dark_mode(prs)`, runs when `ExportOptions.dark_mode=True`, which is the default) — walks every slide / shape / run / table cell and swaps light-palette RGBs to their dark equivalents via `_LIGHT_TO_DARK_TEXT` + `_LIGHT_TO_DARK_FILL` dicts. The slide background switches to `#12151B`; body text goes to `#E5E7EB`; the teal accent (`#0E7490`) goes to a brighter `#2DD4BF`. The pass is intentionally non-invasive: it doesn't refactor the 100+ direct `_BRAND_*` constant references in the builders, it just rewrites RGBs after the fact. The three passes ship with regression tests in `tests/test_exporters.py`: `test_pptx_default_is_dark_mode`, `test_pptx_dark_mode_has_no_invisible_runs` (no run is `rgb=None` or black), `test_pptx_dark_mode_no_light_text_on_light_fill` (no near-white text inside a near-white-filled callout), and `test_pptx_no_red_text_runs` (red `#C0392B` is banned for text). ## Why the design choices | Choice | Reason | |---|---| | **Per-source plugins, not adapters** | A flaky upstream (Scholar layout change, IEEE token expiry) shouldn't break the whole pipeline. Plugins fail in isolation. | | **Async I/O, sync exporters** | Network is parallelisable; rendering a `.pptx` is CPU-bound and finishes in milliseconds — no win from making it async. | | **One HTTPS-only client per source** | Shared connection pools + token bucket. Multiple clients per source would defeat both. | | **Frozen dataclasses** | Trivially thread/coroutine-safe; "edits" create new instances via `dataclasses.replace`. | | **Recorded fixtures only** | Tests run offline, deterministically, in <30 s. Live HTTP would make CI flaky and rate-limited. | | **Two i18n tables (UI vs deck)** | Lets the UI ship with fewer translations than the deck if needed; today both cover all 14 languages, but the split keeps optionality. | | **No global mutable state** | Singletons (HTTP clients, cache handle, rate-limit buckets) are encapsulated in module-level classes. Streamlit `st.session_state` is the only mutable per-session state, and it's per-session by design. | ## Performance notes - The bottleneck for a typical search is **network latency**, not CPU. Async fan-out across sources brings a 10-source search down from `sum(latency)` to `max(latency)`. - The `pptx` exporter is the single biggest CPU consumer — about 200 ms per paper for the thesis-style tier. Lightweight tier is 10× faster. - Dedup is O(N) on the number of papers; with `--max 200` × 11 sources that's ~2200 papers max, and dedup still finishes in under 50 ms. - The `[intelligence]` extra's Anthropic API call is the dominant cost when `--enrich` is on — typically 5–15 s per paper. The pipeline batches these with a per-source semaphore. See `thesisagents/utils/profiling.py` for `with section("name"):` helpers if you're chasing a regression.