Source plugin authoring guide
Adding a new academic source — your own institution’s repository, a vendor API, a regional preprint server — without touching the core engine. Plugins are how ThesisAgents stays extensible while keeping its dependency surface small.
When to write a plugin
Write a source plugin when ANY of:
The source needs a heavy or optional dependency (vendor SDK, Selenium for JS-rendered pages). Putting it in core forces every user to install it; putting it in a plugin makes it opt-in.
The source’s failure mode could break the rest of the pipeline. A flaky upstream should fail in isolation; aggregating other sources’ results should still succeed.
The source has independent release cadence. A Scholar HTML layout change should be patchable without re-shipping the engine.
If your source uses only httpx and returns clean JSON, it’s
arguably core material — but the plugin pattern is cheap, so going
through it is usually the right default.
File layout
sources/<your_name>/
├── __init__.py # exports fetcher_class
├── fetcher.py # the actual plugin
├── parser.py # raw payload → Paper
└── config.py # RateLimit + endpoint URLs
The directory sources/<your_name>/ must match the source name
the user will pass to --source <your_name>. Stick to lowercase,
underscores allowed (e.g. semantic_scholar).
The pipeline finds your plugin by injecting sources/ into
sys.path at startup. The injection is done by
thesisagents.app.source_manager for runtime and by
tests/conftest.py for the test suite — you don’t need to touch
either.
Step-by-step
1. Pick a name and register it
Add your source name to thesisagents/core/constants.py:
PLUGIN_SOURCES: tuple[str, ...] = (
# existing plugin sources
"ieee",
"springer",
"scholar",
# ⇣ your new source
"your_name",
)
ALL_SOURCES = CORE_SOURCES + PLUGIN_SOURCES picks it up
automatically. If your plugin should be in the default search mix
(no API key, ToS-friendly), also add it to DEFAULT_SOURCES. If
it needs an opt-in env var, leave it out — the pipeline will skip
it silently when its Fetcher raises ConfigError at
construction.
2. Declare the rate limit + endpoint
sources/your_name/config.py:
from thesisagents.fetchers.rate_limit import RateLimit
ENDPOINT = "https://api.your-source.example/v1/search"
RATE_LIMIT = RateLimit(
requests_per_second=2, # match upstream's published ToS
burst=4, # how many can fire back-to-back
jitter_seconds=0.2, # random delay added per request
)
USER_AGENT = "ThesisAgents/0.1 (+https://github.com/Integration-Automation/ThesisAgents)"
Pick a conservative rate limit. The bucket is the only thing protecting you from getting your IP blocked; if your source publishes “10 req/s” you should run at 5 req/s with jitter to account for short bursts.
3. Write the parser
sources/your_name/parser.py:
from __future__ import annotations
from typing import Any
from thesisagents.core.models import Paper
def parse_search_payload(payload: dict[str, Any]) -> list[Paper]:
"""Convert the source's raw search response into a list of Paper."""
return [_parse_one(entry) for entry in payload.get("results", [])]
def _parse_one(entry: dict[str, Any]) -> Paper:
return Paper(
source="your_name",
source_id=str(entry["id"]),
title=entry["title"].strip(),
authors=tuple(a["name"] for a in entry.get("authors", [])),
year=entry.get("year"),
venue=entry.get("venue"),
abstract=entry.get("abstract", "") or "",
url=entry["landing_page_url"],
doi=entry.get("doi"),
arxiv_id=entry.get("arxiv_id"),
pdf_url=_pick_pdf_url(entry),
citation_count=entry.get("citation_count"),
raw=entry,
)
def _pick_pdf_url(entry: dict[str, Any]) -> str | None:
"""Return the publicly-fetchable PDF URL or None."""
for link in entry.get("links", []):
if link.get("type") == "application/pdf" and link["url"].startswith("https://"):
return link["url"]
return None
Field rules:
Always populate
source,source_id,title,authors,abstract,url. The pipeline will reject papers missing any required field.Strip versioning from
arxiv_id(2401.08741v2 → 2401.08741).Strip URL prefixes from
doi(https://doi.org/10.x/y → 10.x/y).HTTPS only for
pdf_url. The downloader refuses non-HTTPS.Keep the raw payload in
rawso the LLM-as-agent flow and debug logging can use it.
4. Write the fetcher
sources/your_name/fetcher.py:
from __future__ import annotations
import os
from typing import Any
from thesisagents.core.exceptions import (
ConfigError, ParseError, RateLimitError, SourceUnavailableError,
)
from thesisagents.core.models import Paper, Query
from thesisagents.fetchers.base import Fetcher
from thesisagents.fetchers.http import get_client
from .config import ENDPOINT, RATE_LIMIT, USER_AGENT
from .parser import parse_search_payload
class YourFetcher(Fetcher):
"""Plugin for the YourSource search API."""
name = "your_name"
rate_limit = RATE_LIMIT
user_agent = USER_AGENT
def __init__(self) -> None:
# OPTIONAL: enforce env-var presence here. The pipeline will
# catch ConfigError and silently skip your plugin.
self._api_key = os.environ.get("THESISAGENTS_YOUR_NAME_API_KEY")
if self._api_key is None:
raise ConfigError(
"THESISAGENTS_YOUR_NAME_API_KEY not set; YourSource plugin disabled"
)
async def fetch(self, query: Query) -> list[Paper]:
client = get_client("your_name")
try:
response = await client.get(
ENDPOINT,
params={
"q": query.keywords,
"limit": query.max_results,
"year_from": query.year_from,
"year_to": query.year_to,
},
headers={"X-API-Key": self._api_key},
)
response.raise_for_status()
except RateLimitError:
raise
except Exception as err:
raise SourceUnavailableError(f"YourSource unreachable: {err}") from err
try:
payload: dict[str, Any] = response.json()
except ValueError as err:
raise ParseError(f"YourSource returned non-JSON: {err}") from err
return parse_search_payload(payload)
get_client("your_name") is the only legal way to hit the network.
It returns the per-source HTTPS-only httpx.AsyncClient with your
plugin’s User-Agent + rate-limit decorator + retry policy already
applied. Do not construct your own httpx.AsyncClient or call
httpx.get / requests.get directly.
5. Wire up the registration
sources/your_name/__init__.py:
from .fetcher import YourFetcher
fetcher_class = YourFetcher
__all__ = ["fetcher_class"]
The pipeline’s plugin loader reads fetcher_class from each
source’s __init__.py and instantiates it. The attribute name is
fixed; the class name is free.
6. Record a fixture and write the test
Tests are hermetic — no live HTTP. Every fetcher test uses a recorded fixture loaded via a monkey-patched HTTP transport.
Record one fixture per test scenario:
python scripts/record_fixture.py --source your_name \
--query "transformer attention" --max 5
This writes tests/fixtures/your_name/transformer-attention.json
(or .html / .xml for non-JSON sources). The recording script
strips any user-specific tokens from the request before saving.
Then write the test:
# tests/sources/your_name/test_your_name.py
import json
from pathlib import Path
import pytest
from thesisagents.core.models import Query
from your_name.fetcher import YourFetcher
@pytest.fixture()
def transformer_fixture():
p = Path(__file__).parent.parent.parent / "fixtures" / "your_name" / "transformer-attention.json"
return json.loads(p.read_text(encoding="utf-8"))
def test_fetcher_parses_transformer_results(http_recorder, transformer_fixture, monkeypatch):
monkeypatch.setenv("THESISAGENTS_YOUR_NAME_API_KEY", "test-key")
http_recorder.add_response(
url="https://api.your-source.example/v1/search",
params={"q": "transformer attention", "limit": 5},
json=transformer_fixture,
)
fetcher = YourFetcher()
papers = await fetcher.fetch(
Query(keywords="transformer attention", sources=("your_name",), max_results=5)
)
assert len(papers) > 0
assert papers[0].source == "your_name"
assert papers[0].title # always present
assert papers[0].url.startswith("https://")
Add tests for:
Happy path (above) — recorded fixture parses cleanly.
Empty result set — fixture with zero entries returns
[].Missing optional fields — entries with no DOI / no abstract / no year still parse without raising.
Malformed JSON —
ParseErrorraised on broken response.HTTP 429 —
RateLimitErrorraised.HTTP 500 —
SourceUnavailableErrorraised.Unicode — title / authors in CJK / Cyrillic / Devanagari parse cleanly.
No API key (if your plugin needs one) —
ConfigErrorraised at__init__.
The http_recorder fixture is defined in tests/conftest.py.
7. Verify
Run the full chain:
# Unit tests for your plugin
python -m pytest tests/sources/your_name/
# Integration: plugin shows up in the source list
python -c "from thesisagents.app.source_manager import list_sources; \
print('your_name' in [s.name for s in list_sources()])"
# Live smoke (only if you have credentials):
THESISAGENTS_YOUR_NAME_API_KEY=... \
thesisagents --query "diffusion models" --source your_name --max 5 \
--out ./smoke/your_name/
# Lint + security
python -m ruff check sources/your_name/
python -m bandit -c pyproject.toml -r sources/your_name/
All three must pass before commit (the project’s Definition of Done).
8. Update docs
Add your source to the table in Configuration with its rate limit + any required env var.
Add your source to the “Available source plugins” table in CLI.
Document any caveats (e.g. “results are limited to titles + abstracts; full text not available via the API”).
Common pitfalls
Constructing your own httpx.AsyncClient
Don’t. Use get_client(your_source_name). It applies:
HTTPS-only enforcement (refuses plain HTTP, even after redirect).
Your declared rate-limit token bucket.
Exponential backoff on 429 / 5xx (which also goes through the bucket).
A per-source User-Agent that respects upstream attribution rules.
Constructing your own bypasses all four; you’ll get IP-blocked within a day.
Hardcoding an API key
Don’t. Load from os.environ.get("THESISAGENTS_..._API_KEY")
and document the variable in Configuration.
The GUI’s Settings page picks up any variable matching the
THESISAGENTS_..._API_KEY pattern automatically when extended.
Returning records without source_id
The dedup pass uses source_id as a fallback when DOI / arXiv ID
are missing. Without it, every record from your source is treated
as a separate paper even when titles match.
Returning HTML rendered into abstract
The exporter expects plain text. If your source returns HTML,
strip it with beautifulsoup4:
from bs4 import BeautifulSoup
raw_html = entry.get("abstract_html", "")
abstract = BeautifulSoup(raw_html, "lxml").get_text(separator=" ").strip()
Forgetting to set pdf_url=None when the link isn’t public
A paywalled PDF link with https:// will pass the HTTPS check but
return 403 at download time. Better to leave pdf_url=None so the
paywall gate triggers correctly.
Using xml.etree on untrusted input
For sources that return XML (PubMed, arXiv Atom feed, …), use
defusedxml not xml.etree. The bandit rule B405 will flag the
unsafe usage at lint time.
Putting source-specific HTML selectors in core
If your plugin needs to parse HTML with bs4 selectors, those
selectors live in sources/your_name/parser.py. They never go
under thesisagents/core/.
When a plugin should be promoted to core
You’ll know it’s time when:
Every user wants the plugin enabled (no opt-in env var).
The plugin uses only the core dep set.
The upstream has stable rate limits and a stable contract.
The plugin has had no breaking changes in 6+ months.
To promote: move the code into thesisagents/core/<source>/,
update the import paths in core_manager, and remove the entry
from PLUGIN_SOURCES in core/constants.py. The user-visible
interface (the --source <name> flag) doesn’t change.
This has happened exactly zero times to date — the plugin pattern turns out to be the right home for most sources permanently.
Worked examples in-tree
sources/arxiv/— JSON / Atom hybrid, no API key, low rate limit. Good starting point for a simple read-only source.sources/pubmed/— XML response, optional API key, two-step flow (search → fetch full record). Good example of multi-call patterns.sources/ieee/— dual API + scrape paths gated by different env vars. Good example of optional dependency handling.sources/scholar/— pure HTML scrape with Selenium fallback, ToS-opt-in. Good example of when not to do this — the code is fragile by necessity.sources/springer/—ConfigErrorat construction when key is missing. Good example of soft-skip integration.sources/europepmc/— open REST API, no key, JSON. Clean example of the over-fetch-then-year-post-filter pattern + structured-vs-flat author fallback.sources/doaj/— open API where the query rides in the URL path (percent-encoded), not a query parameter. Good example of a non-standard endpoint shape.sources/hal/— Solr-backed API whose fields are arrays even when single-valued. Good example of defensive array-or-scalar unwrapping.sources/core/— opt-in viaTHESISAGENTS_CORE_API_KEYpassed as a Bearer header. Good example of header-based auth + soft-skip.