# Source plugin authoring guide Adding a new academic source — your own institution's repository, a vendor API, a regional preprint server — without touching the core engine. Plugins are how ThesisAgents stays extensible while keeping its dependency surface small. ## When to write a plugin Write a source plugin when ANY of: 1. **The source needs a heavy or optional dependency** (vendor SDK, Selenium for JS-rendered pages). Putting it in core forces every user to install it; putting it in a plugin makes it opt-in. 2. **The source's failure mode could break the rest of the pipeline.** A flaky upstream should fail in isolation; aggregating other sources' results should still succeed. 3. **The source has independent release cadence.** A Scholar HTML layout change should be patchable without re-shipping the engine. If your source uses only `httpx` and returns clean JSON, it's arguably core material — but the plugin pattern is cheap, so going through it is usually the right default. ## File layout ``` sources// ├── __init__.py # exports fetcher_class ├── fetcher.py # the actual plugin ├── parser.py # raw payload → Paper └── config.py # RateLimit + endpoint URLs ``` The directory `sources//` must match the **source name** the user will pass to `--source `. Stick to lowercase, underscores allowed (e.g. `semantic_scholar`). The pipeline finds your plugin by injecting `sources/` into `sys.path` at startup. The injection is done by `thesisagents.app.source_manager` for runtime and by `tests/conftest.py` for the test suite — you don't need to touch either. ## Step-by-step ### 1. Pick a name and register it Add your source name to `thesisagents/core/constants.py`: ```python PLUGIN_SOURCES: tuple[str, ...] = ( # existing plugin sources "ieee", "springer", "scholar", # ⇣ your new source "your_name", ) ``` `ALL_SOURCES = CORE_SOURCES + PLUGIN_SOURCES` picks it up automatically. If your plugin should be in the default search mix (no API key, ToS-friendly), also add it to `DEFAULT_SOURCES`. If it needs an opt-in env var, leave it out — the pipeline will skip it silently when its `Fetcher` raises `ConfigError` at construction. ### 2. Declare the rate limit + endpoint `sources/your_name/config.py`: ```python from thesisagents.fetchers.rate_limit import RateLimit ENDPOINT = "https://api.your-source.example/v1/search" RATE_LIMIT = RateLimit( requests_per_second=2, # match upstream's published ToS burst=4, # how many can fire back-to-back jitter_seconds=0.2, # random delay added per request ) USER_AGENT = "ThesisAgents/0.1 (+https://github.com/Integration-Automation/ThesisAgents)" ``` Pick a conservative rate limit. The bucket is the only thing protecting you from getting your IP blocked; if your source publishes "10 req/s" you should run at 5 req/s with jitter to account for short bursts. ### 3. Write the parser `sources/your_name/parser.py`: ```python from __future__ import annotations from typing import Any from thesisagents.core.models import Paper def parse_search_payload(payload: dict[str, Any]) -> list[Paper]: """Convert the source's raw search response into a list of Paper.""" return [_parse_one(entry) for entry in payload.get("results", [])] def _parse_one(entry: dict[str, Any]) -> Paper: return Paper( source="your_name", source_id=str(entry["id"]), title=entry["title"].strip(), authors=tuple(a["name"] for a in entry.get("authors", [])), year=entry.get("year"), venue=entry.get("venue"), abstract=entry.get("abstract", "") or "", url=entry["landing_page_url"], doi=entry.get("doi"), arxiv_id=entry.get("arxiv_id"), pdf_url=_pick_pdf_url(entry), citation_count=entry.get("citation_count"), raw=entry, ) def _pick_pdf_url(entry: dict[str, Any]) -> str | None: """Return the publicly-fetchable PDF URL or None.""" for link in entry.get("links", []): if link.get("type") == "application/pdf" and link["url"].startswith("https://"): return link["url"] return None ``` Field rules: - **Always** populate `source`, `source_id`, `title`, `authors`, `abstract`, `url`. The pipeline will reject papers missing any required field. - **Strip versioning** from `arxiv_id` (`2401.08741v2 → 2401.08741`). - **Strip URL prefixes** from `doi` (`https://doi.org/10.x/y → 10.x/y`). - **HTTPS only** for `pdf_url`. The downloader refuses non-HTTPS. - Keep the raw payload in `raw` so the LLM-as-agent flow and debug logging can use it. ### 4. Write the fetcher `sources/your_name/fetcher.py`: ```python from __future__ import annotations import os from typing import Any from thesisagents.core.exceptions import ( ConfigError, ParseError, RateLimitError, SourceUnavailableError, ) from thesisagents.core.models import Paper, Query from thesisagents.fetchers.base import Fetcher from thesisagents.fetchers.http import get_client from .config import ENDPOINT, RATE_LIMIT, USER_AGENT from .parser import parse_search_payload class YourFetcher(Fetcher): """Plugin for the YourSource search API.""" name = "your_name" rate_limit = RATE_LIMIT user_agent = USER_AGENT def __init__(self) -> None: # OPTIONAL: enforce env-var presence here. The pipeline will # catch ConfigError and silently skip your plugin. self._api_key = os.environ.get("THESISAGENTS_YOUR_NAME_API_KEY") if self._api_key is None: raise ConfigError( "THESISAGENTS_YOUR_NAME_API_KEY not set; YourSource plugin disabled" ) async def fetch(self, query: Query) -> list[Paper]: client = get_client("your_name") try: response = await client.get( ENDPOINT, params={ "q": query.keywords, "limit": query.max_results, "year_from": query.year_from, "year_to": query.year_to, }, headers={"X-API-Key": self._api_key}, ) response.raise_for_status() except RateLimitError: raise except Exception as err: raise SourceUnavailableError(f"YourSource unreachable: {err}") from err try: payload: dict[str, Any] = response.json() except ValueError as err: raise ParseError(f"YourSource returned non-JSON: {err}") from err return parse_search_payload(payload) ``` `get_client("your_name")` is the only legal way to hit the network. It returns the per-source HTTPS-only `httpx.AsyncClient` with your plugin's User-Agent + rate-limit decorator + retry policy already applied. **Do not construct your own `httpx.AsyncClient` or call `httpx.get` / `requests.get` directly.** ### 5. Wire up the registration `sources/your_name/__init__.py`: ```python from .fetcher import YourFetcher fetcher_class = YourFetcher __all__ = ["fetcher_class"] ``` The pipeline's plugin loader reads `fetcher_class` from each source's `__init__.py` and instantiates it. The attribute name is fixed; the class name is free. ### 6. Record a fixture and write the test Tests are **hermetic** — no live HTTP. Every fetcher test uses a recorded fixture loaded via a monkey-patched HTTP transport. Record one fixture per test scenario: ```bash python scripts/record_fixture.py --source your_name \ --query "transformer attention" --max 5 ``` This writes `tests/fixtures/your_name/transformer-attention.json` (or `.html` / `.xml` for non-JSON sources). The recording script strips any user-specific tokens from the request before saving. Then write the test: ```python # tests/sources/your_name/test_your_name.py import json from pathlib import Path import pytest from thesisagents.core.models import Query from your_name.fetcher import YourFetcher @pytest.fixture() def transformer_fixture(): p = Path(__file__).parent.parent.parent / "fixtures" / "your_name" / "transformer-attention.json" return json.loads(p.read_text(encoding="utf-8")) def test_fetcher_parses_transformer_results(http_recorder, transformer_fixture, monkeypatch): monkeypatch.setenv("THESISAGENTS_YOUR_NAME_API_KEY", "test-key") http_recorder.add_response( url="https://api.your-source.example/v1/search", params={"q": "transformer attention", "limit": 5}, json=transformer_fixture, ) fetcher = YourFetcher() papers = await fetcher.fetch( Query(keywords="transformer attention", sources=("your_name",), max_results=5) ) assert len(papers) > 0 assert papers[0].source == "your_name" assert papers[0].title # always present assert papers[0].url.startswith("https://") ``` Add tests for: - **Happy path** (above) — recorded fixture parses cleanly. - **Empty result set** — fixture with zero entries returns `[]`. - **Missing optional fields** — entries with no DOI / no abstract / no year still parse without raising. - **Malformed JSON** — `ParseError` raised on broken response. - **HTTP 429** — `RateLimitError` raised. - **HTTP 500** — `SourceUnavailableError` raised. - **Unicode** — title / authors in CJK / Cyrillic / Devanagari parse cleanly. - **No API key** (if your plugin needs one) — `ConfigError` raised at `__init__`. The `http_recorder` fixture is defined in `tests/conftest.py`. ### 7. Verify Run the full chain: ```bash # Unit tests for your plugin python -m pytest tests/sources/your_name/ # Integration: plugin shows up in the source list python -c "from thesisagents.app.source_manager import list_sources; \ print('your_name' in [s.name for s in list_sources()])" # Live smoke (only if you have credentials): THESISAGENTS_YOUR_NAME_API_KEY=... \ thesisagents --query "diffusion models" --source your_name --max 5 \ --out ./smoke/your_name/ # Lint + security python -m ruff check sources/your_name/ python -m bandit -c pyproject.toml -r sources/your_name/ ``` All three must pass before commit (the project's Definition of Done). ### 8. Update docs - Add your source to the table in [Configuration](configuration.md) with its rate limit + any required env var. - Add your source to the "Available source plugins" table in [CLI](cli.md). - Document any caveats (e.g. "results are limited to titles + abstracts; full text not available via the API"). ## Common pitfalls ### Constructing your own `httpx.AsyncClient` **Don't.** Use `get_client(your_source_name)`. It applies: - HTTPS-only enforcement (refuses plain HTTP, even after redirect). - Your declared rate-limit token bucket. - Exponential backoff on 429 / 5xx (which also goes through the bucket). - A per-source User-Agent that respects upstream attribution rules. Constructing your own bypasses all four; you'll get IP-blocked within a day. ### Hardcoding an API key **Don't.** Load from `os.environ.get("THESISAGENTS_..._API_KEY")` and document the variable in [Configuration](configuration.md). The GUI's Settings page picks up any variable matching the `THESISAGENTS_..._API_KEY` pattern automatically when extended. ### Returning records without `source_id` The dedup pass uses `source_id` as a fallback when DOI / arXiv ID are missing. Without it, every record from your source is treated as a separate paper even when titles match. ### Returning HTML rendered into `abstract` The exporter expects plain text. If your source returns HTML, strip it with `beautifulsoup4`: ```python from bs4 import BeautifulSoup raw_html = entry.get("abstract_html", "") abstract = BeautifulSoup(raw_html, "lxml").get_text(separator=" ").strip() ``` ### Forgetting to set `pdf_url=None` when the link isn't public A paywalled PDF link with `https://` will pass the HTTPS check but return 403 at download time. Better to leave `pdf_url=None` so the paywall gate triggers correctly. ### Using `xml.etree` on untrusted input For sources that return XML (PubMed, arXiv Atom feed, ...), use `defusedxml` not `xml.etree`. The bandit rule `B405` will flag the unsafe usage at lint time. ### Putting source-specific HTML selectors in core If your plugin needs to parse HTML with `bs4` selectors, those selectors live in `sources/your_name/parser.py`. They never go under `thesisagents/core/`. ## When a plugin should be promoted to core You'll know it's time when: - Every user wants the plugin enabled (no opt-in env var). - The plugin uses only the core dep set. - The upstream has stable rate limits and a stable contract. - The plugin has had no breaking changes in 6+ months. To promote: move the code into `thesisagents/core//`, update the import paths in `core_manager`, and remove the entry from `PLUGIN_SOURCES` in `core/constants.py`. The user-visible interface (the `--source ` flag) doesn't change. This has happened exactly zero times to date — the plugin pattern turns out to be the right home for most sources permanently. ## Worked examples in-tree - `sources/arxiv/` — JSON / Atom hybrid, no API key, low rate limit. Good starting point for a simple read-only source. - `sources/pubmed/` — XML response, optional API key, two-step flow (search → fetch full record). Good example of multi-call patterns. - `sources/ieee/` — dual API + scrape paths gated by different env vars. Good example of optional dependency handling. - `sources/scholar/` — pure HTML scrape with Selenium fallback, ToS-opt-in. Good example of when not to do this — the code is fragile by necessity. - `sources/springer/` — `ConfigError` at construction when key is missing. Good example of soft-skip integration. - `sources/europepmc/` — open REST API, no key, JSON. Clean example of the over-fetch-then-year-post-filter pattern + structured-vs-flat author fallback. - `sources/doaj/` — open API where the **query rides in the URL path** (percent-encoded), not a query parameter. Good example of a non-standard endpoint shape. - `sources/hal/` — Solr-backed API whose fields are arrays even when single-valued. Good example of defensive array-or-scalar unwrapping. - `sources/core/` — opt-in via `THESISAGENTS_CORE_API_KEY` passed as a Bearer header. Good example of header-based auth + soft-skip.