Source plugin authoring guide

Adding a new academic source — your own institution’s repository, a vendor API, a regional preprint server — without touching the core engine. Plugins are how ThesisAgents stays extensible while keeping its dependency surface small.

When to write a plugin

Write a source plugin when ANY of:

The source needs a heavy or optional dependency (vendor SDK, Selenium for JS-rendered pages). Putting it in core forces every user to install it; putting it in a plugin makes it opt-in.
The source’s failure mode could break the rest of the pipeline. A flaky upstream should fail in isolation; aggregating other sources’ results should still succeed.
The source has independent release cadence. A Scholar HTML layout change should be patchable without re-shipping the engine.

If your source uses only httpx and returns clean JSON, it’s arguably core material — but the plugin pattern is cheap, so going through it is usually the right default.

File layout

sources/<your_name>/
├── __init__.py        # exports fetcher_class
├── fetcher.py         # the actual plugin
├── parser.py          # raw payload → Paper
└── config.py          # RateLimit + endpoint URLs

The directory sources/<your_name>/ must match the source name the user will pass to --source <your_name>. Stick to lowercase, underscores allowed (e.g. semantic_scholar).

The pipeline finds your plugin by injecting sources/ into sys.path at startup. The injection is done by thesisagents.app.source_manager for runtime and by tests/conftest.py for the test suite — you don’t need to touch either.

Step-by-step

1. Pick a name and register it

Add your source name to thesisagents/core/constants.py:

PLUGIN_SOURCES: tuple[str, ...] = (
    # existing plugin sources
    "ieee",
    "springer",
    "scholar",
    # ⇣ your new source
    "your_name",
)

ALL_SOURCES = CORE_SOURCES + PLUGIN_SOURCES picks it up automatically. If your plugin should be in the default search mix (no API key, ToS-friendly), also add it to DEFAULT_SOURCES. If it needs an opt-in env var, leave it out — the pipeline will skip it silently when its Fetcher raises ConfigError at construction.

2. Declare the rate limit + endpoint

sources/your_name/config.py:

from thesisagents.fetchers.rate_limit import RateLimit

ENDPOINT = "https://api.your-source.example/v1/search"

RATE_LIMIT = RateLimit(
    requests_per_second=2,    # match upstream's published ToS
    burst=4,                   # how many can fire back-to-back
    jitter_seconds=0.2,        # random delay added per request
)

USER_AGENT = "ThesisAgents/0.1 (+https://github.com/Integration-Automation/ThesisAgents)"

Pick a conservative rate limit. The bucket is the only thing protecting you from getting your IP blocked; if your source publishes “10 req/s” you should run at 5 req/s with jitter to account for short bursts.

3. Write the parser

sources/your_name/parser.py:

from __future__ import annotations

from typing import Any

from thesisagents.core.models import Paper


def parse_search_payload(payload: dict[str, Any]) -> list[Paper]:
    """Convert the source's raw search response into a list of Paper."""
    return [_parse_one(entry) for entry in payload.get("results", [])]


def _parse_one(entry: dict[str, Any]) -> Paper:
    return Paper(
        source="your_name",
        source_id=str(entry["id"]),
        title=entry["title"].strip(),
        authors=tuple(a["name"] for a in entry.get("authors", [])),
        year=entry.get("year"),
        venue=entry.get("venue"),
        abstract=entry.get("abstract", "") or "",
        url=entry["landing_page_url"],
        doi=entry.get("doi"),
        arxiv_id=entry.get("arxiv_id"),
        pdf_url=_pick_pdf_url(entry),
        citation_count=entry.get("citation_count"),
        raw=entry,
    )


def _pick_pdf_url(entry: dict[str, Any]) -> str | None:
    """Return the publicly-fetchable PDF URL or None."""
    for link in entry.get("links", []):
        if link.get("type") == "application/pdf" and link["url"].startswith("https://"):
            return link["url"]
    return None

Field rules:

Always populate source, source_id, title, authors, abstract, url. The pipeline will reject papers missing any required field.
Strip versioning from arxiv_id (2401.08741v2 → 2401.08741).
Strip URL prefixes from doi (https://doi.org/10.x/y → 10.x/y).
HTTPS only for pdf_url. The downloader refuses non-HTTPS.
Keep the raw payload in raw so the LLM-as-agent flow and debug logging can use it.

4. Write the fetcher

sources/your_name/fetcher.py:

from __future__ import annotations

import os
from typing import Any

from thesisagents.core.exceptions import (
    ConfigError, ParseError, RateLimitError, SourceUnavailableError,
)
from thesisagents.core.models import Paper, Query
from thesisagents.fetchers.base import Fetcher
from thesisagents.fetchers.http import get_client

from .config import ENDPOINT, RATE_LIMIT, USER_AGENT
from .parser import parse_search_payload


class YourFetcher(Fetcher):
    """Plugin for the YourSource search API."""

    name = "your_name"
    rate_limit = RATE_LIMIT
    user_agent = USER_AGENT

    def __init__(self) -> None:
        # OPTIONAL: enforce env-var presence here. The pipeline will
        # catch ConfigError and silently skip your plugin.
        self._api_key = os.environ.get("THESISAGENTS_YOUR_NAME_API_KEY")
        if self._api_key is None:
            raise ConfigError(
                "THESISAGENTS_YOUR_NAME_API_KEY not set; YourSource plugin disabled"
            )

    async def fetch(self, query: Query) -> list[Paper]:
        client = get_client("your_name")
        try:
            response = await client.get(
                ENDPOINT,
                params={
                    "q": query.keywords,
                    "limit": query.max_results,
                    "year_from": query.year_from,
                    "year_to": query.year_to,
                },
                headers={"X-API-Key": self._api_key},
            )
            response.raise_for_status()
        except RateLimitError:
            raise
        except Exception as err:
            raise SourceUnavailableError(f"YourSource unreachable: {err}") from err

        try:
            payload: dict[str, Any] = response.json()
        except ValueError as err:
            raise ParseError(f"YourSource returned non-JSON: {err}") from err

        return parse_search_payload(payload)

get_client("your_name") is the only legal way to hit the network. It returns the per-source HTTPS-only httpx.AsyncClient with your plugin’s User-Agent + rate-limit decorator + retry policy already applied. Do not construct your own httpx.AsyncClient or call httpx.get / requests.get directly.

5. Wire up the registration

sources/your_name/__init__.py:

from .fetcher import YourFetcher

fetcher_class = YourFetcher

__all__ = ["fetcher_class"]

The pipeline’s plugin loader reads fetcher_class from each source’s __init__.py and instantiates it. The attribute name is fixed; the class name is free.

6. Record a fixture and write the test

Tests are hermetic — no live HTTP. Every fetcher test uses a recorded fixture loaded via a monkey-patched HTTP transport.

Record one fixture per test scenario:

python scripts/record_fixture.py --source your_name \
    --query "transformer attention" --max 5

This writes tests/fixtures/your_name/transformer-attention.json (or .html / .xml for non-JSON sources). The recording script strips any user-specific tokens from the request before saving.

Then write the test:

# tests/sources/your_name/test_your_name.py
import json
from pathlib import Path

import pytest
from thesisagents.core.models import Query
from your_name.fetcher import YourFetcher


@pytest.fixture()
def transformer_fixture():
    p = Path(__file__).parent.parent.parent / "fixtures" / "your_name" / "transformer-attention.json"
    return json.loads(p.read_text(encoding="utf-8"))


def test_fetcher_parses_transformer_results(http_recorder, transformer_fixture, monkeypatch):
    monkeypatch.setenv("THESISAGENTS_YOUR_NAME_API_KEY", "test-key")
    http_recorder.add_response(
        url="https://api.your-source.example/v1/search",
        params={"q": "transformer attention", "limit": 5},
        json=transformer_fixture,
    )
    fetcher = YourFetcher()
    papers = await fetcher.fetch(
        Query(keywords="transformer attention", sources=("your_name",), max_results=5)
    )
    assert len(papers) > 0
    assert papers[0].source == "your_name"
    assert papers[0].title  # always present
    assert papers[0].url.startswith("https://")

Add tests for:

Happy path (above) — recorded fixture parses cleanly.
Empty result set — fixture with zero entries returns [].
Missing optional fields — entries with no DOI / no abstract / no year still parse without raising.
Malformed JSON — ParseError raised on broken response.
HTTP 429 — RateLimitError raised.
HTTP 500 — SourceUnavailableError raised.
Unicode — title / authors in CJK / Cyrillic / Devanagari parse cleanly.
No API key (if your plugin needs one) — ConfigError raised at __init__.

The http_recorder fixture is defined in tests/conftest.py.

7. Verify

Run the full chain:

# Unit tests for your plugin
python -m pytest tests/sources/your_name/

# Integration: plugin shows up in the source list
python -c "from thesisagents.app.source_manager import list_sources; \
    print('your_name' in [s.name for s in list_sources()])"

# Live smoke (only if you have credentials):
THESISAGENTS_YOUR_NAME_API_KEY=... \
    thesisagents --query "diffusion models" --source your_name --max 5 \
                   --out ./smoke/your_name/

# Lint + security
python -m ruff check sources/your_name/
python -m bandit -c pyproject.toml -r sources/your_name/

All three must pass before commit (the project’s Definition of Done).

8. Update docs

Add your source to the table in Configuration with its rate limit + any required env var.
Add your source to the “Available source plugins” table in CLI.
Document any caveats (e.g. “results are limited to titles + abstracts; full text not available via the API”).

Common pitfalls

Constructing your own `httpx.AsyncClient`

Don’t. Use get_client(your_source_name). It applies:

HTTPS-only enforcement (refuses plain HTTP, even after redirect).
Your declared rate-limit token bucket.
Exponential backoff on 429 / 5xx (which also goes through the bucket).
A per-source User-Agent that respects upstream attribution rules.

Constructing your own bypasses all four; you’ll get IP-blocked within a day.

Hardcoding an API key

Don’t. Load from os.environ.get("THESISAGENTS_..._API_KEY") and document the variable in Configuration. The GUI’s Settings page picks up any variable matching the THESISAGENTS_..._API_KEY pattern automatically when extended.

Returning records without `source_id`

The dedup pass uses source_id as a fallback when DOI / arXiv ID are missing. Without it, every record from your source is treated as a separate paper even when titles match.

Returning HTML rendered into `abstract`

The exporter expects plain text. If your source returns HTML, strip it with beautifulsoup4:

from bs4 import BeautifulSoup

raw_html = entry.get("abstract_html", "")
abstract = BeautifulSoup(raw_html, "lxml").get_text(separator=" ").strip()

Forgetting to set `pdf_url=None` when the link isn’t public

A paywalled PDF link with https:// will pass the HTTPS check but return 403 at download time. Better to leave pdf_url=None so the paywall gate triggers correctly.

Using `xml.etree` on untrusted input

For sources that return XML (PubMed, arXiv Atom feed, …), use defusedxml not xml.etree. The bandit rule B405 will flag the unsafe usage at lint time.

Putting source-specific HTML selectors in core

If your plugin needs to parse HTML with bs4 selectors, those selectors live in sources/your_name/parser.py. They never go under thesisagents/core/.

When a plugin should be promoted to core

You’ll know it’s time when:

Every user wants the plugin enabled (no opt-in env var).
The plugin uses only the core dep set.
The upstream has stable rate limits and a stable contract.
The plugin has had no breaking changes in 6+ months.

To promote: move the code into thesisagents/core/<source>/, update the import paths in core_manager, and remove the entry from PLUGIN_SOURCES in core/constants.py. The user-visible interface (the --source <name> flag) doesn’t change.

This has happened exactly zero times to date — the plugin pattern turns out to be the right home for most sources permanently.

Worked examples in-tree

sources/arxiv/ — JSON / Atom hybrid, no API key, low rate limit. Good starting point for a simple read-only source.
sources/pubmed/ — XML response, optional API key, two-step flow (search → fetch full record). Good example of multi-call patterns.
sources/ieee/ — dual API + scrape paths gated by different env vars. Good example of optional dependency handling.
sources/scholar/ — pure HTML scrape with Selenium fallback, ToS-opt-in. Good example of when not to do this — the code is fragile by necessity.
sources/springer/ — ConfigError at construction when key is missing. Good example of soft-skip integration.
sources/europepmc/ — open REST API, no key, JSON. Clean example of the over-fetch-then-year-post-filter pattern + structured-vs-flat author fallback.
sources/doaj/ — open API where the query rides in the URL path (percent-encoded), not a query parameter. Good example of a non-standard endpoint shape.
sources/hal/ — Solr-backed API whose fields are arrays even when single-valued. Good example of defensive array-or-scalar unwrapping.
sources/core/ — opt-in via THESISAGENTS_CORE_API_KEY passed as a Bearer header. Good example of header-based auth + soft-skip.