Source plugin authoring guide

Adding a new academic source — your own institution’s repository, a vendor API, a regional preprint server — without touching the core engine. Plugins are how ThesisAgents stays extensible while keeping its dependency surface small.

When to write a plugin

Write a source plugin when ANY of:

  1. The source needs a heavy or optional dependency (vendor SDK, Selenium for JS-rendered pages). Putting it in core forces every user to install it; putting it in a plugin makes it opt-in.

  2. The source’s failure mode could break the rest of the pipeline. A flaky upstream should fail in isolation; aggregating other sources’ results should still succeed.

  3. The source has independent release cadence. A Scholar HTML layout change should be patchable without re-shipping the engine.

If your source uses only httpx and returns clean JSON, it’s arguably core material — but the plugin pattern is cheap, so going through it is usually the right default.

File layout

sources/<your_name>/
├── __init__.py        # exports fetcher_class
├── fetcher.py         # the actual plugin
├── parser.py          # raw payload → Paper
└── config.py          # RateLimit + endpoint URLs

The directory sources/<your_name>/ must match the source name the user will pass to --source <your_name>. Stick to lowercase, underscores allowed (e.g. semantic_scholar).

The pipeline finds your plugin by injecting sources/ into sys.path at startup. The injection is done by thesisagents.app.source_manager for runtime and by tests/conftest.py for the test suite — you don’t need to touch either.

Step-by-step

1. Pick a name and register it

Add your source name to thesisagents/core/constants.py:

PLUGIN_SOURCES: tuple[str, ...] = (
    # existing plugin sources
    "ieee",
    "springer",
    "scholar",
    # ⇣ your new source
    "your_name",
)

ALL_SOURCES = CORE_SOURCES + PLUGIN_SOURCES picks it up automatically. If your plugin should be in the default search mix (no API key, ToS-friendly), also add it to DEFAULT_SOURCES. If it needs an opt-in env var, leave it out — the pipeline will skip it silently when its Fetcher raises ConfigError at construction.

2. Declare the rate limit + endpoint

sources/your_name/config.py:

from thesisagents.fetchers.rate_limit import RateLimit

ENDPOINT = "https://api.your-source.example/v1/search"

RATE_LIMIT = RateLimit(
    requests_per_second=2,    # match upstream's published ToS
    burst=4,                   # how many can fire back-to-back
    jitter_seconds=0.2,        # random delay added per request
)

USER_AGENT = "ThesisAgents/0.1 (+https://github.com/Integration-Automation/ThesisAgents)"

Pick a conservative rate limit. The bucket is the only thing protecting you from getting your IP blocked; if your source publishes “10 req/s” you should run at 5 req/s with jitter to account for short bursts.

3. Write the parser

sources/your_name/parser.py:

from __future__ import annotations

from typing import Any

from thesisagents.core.models import Paper


def parse_search_payload(payload: dict[str, Any]) -> list[Paper]:
    """Convert the source's raw search response into a list of Paper."""
    return [_parse_one(entry) for entry in payload.get("results", [])]


def _parse_one(entry: dict[str, Any]) -> Paper:
    return Paper(
        source="your_name",
        source_id=str(entry["id"]),
        title=entry["title"].strip(),
        authors=tuple(a["name"] for a in entry.get("authors", [])),
        year=entry.get("year"),
        venue=entry.get("venue"),
        abstract=entry.get("abstract", "") or "",
        url=entry["landing_page_url"],
        doi=entry.get("doi"),
        arxiv_id=entry.get("arxiv_id"),
        pdf_url=_pick_pdf_url(entry),
        citation_count=entry.get("citation_count"),
        raw=entry,
    )


def _pick_pdf_url(entry: dict[str, Any]) -> str | None:
    """Return the publicly-fetchable PDF URL or None."""
    for link in entry.get("links", []):
        if link.get("type") == "application/pdf" and link["url"].startswith("https://"):
            return link["url"]
    return None

Field rules:

  • Always populate source, source_id, title, authors, abstract, url. The pipeline will reject papers missing any required field.

  • Strip versioning from arxiv_id (2401.08741v2 2401.08741).

  • Strip URL prefixes from doi (https://doi.org/10.x/y 10.x/y).

  • HTTPS only for pdf_url. The downloader refuses non-HTTPS.

  • Keep the raw payload in raw so the LLM-as-agent flow and debug logging can use it.

4. Write the fetcher

sources/your_name/fetcher.py:

from __future__ import annotations

import os
from typing import Any

from thesisagents.core.exceptions import (
    ConfigError, ParseError, RateLimitError, SourceUnavailableError,
)
from thesisagents.core.models import Paper, Query
from thesisagents.fetchers.base import Fetcher
from thesisagents.fetchers.http import get_client

from .config import ENDPOINT, RATE_LIMIT, USER_AGENT
from .parser import parse_search_payload


class YourFetcher(Fetcher):
    """Plugin for the YourSource search API."""

    name = "your_name"
    rate_limit = RATE_LIMIT
    user_agent = USER_AGENT

    def __init__(self) -> None:
        # OPTIONAL: enforce env-var presence here. The pipeline will
        # catch ConfigError and silently skip your plugin.
        self._api_key = os.environ.get("THESISAGENTS_YOUR_NAME_API_KEY")
        if self._api_key is None:
            raise ConfigError(
                "THESISAGENTS_YOUR_NAME_API_KEY not set; YourSource plugin disabled"
            )

    async def fetch(self, query: Query) -> list[Paper]:
        client = get_client("your_name")
        try:
            response = await client.get(
                ENDPOINT,
                params={
                    "q": query.keywords,
                    "limit": query.max_results,
                    "year_from": query.year_from,
                    "year_to": query.year_to,
                },
                headers={"X-API-Key": self._api_key},
            )
            response.raise_for_status()
        except RateLimitError:
            raise
        except Exception as err:
            raise SourceUnavailableError(f"YourSource unreachable: {err}") from err

        try:
            payload: dict[str, Any] = response.json()
        except ValueError as err:
            raise ParseError(f"YourSource returned non-JSON: {err}") from err

        return parse_search_payload(payload)

get_client("your_name") is the only legal way to hit the network. It returns the per-source HTTPS-only httpx.AsyncClient with your plugin’s User-Agent + rate-limit decorator + retry policy already applied. Do not construct your own httpx.AsyncClient or call httpx.get / requests.get directly.

5. Wire up the registration

sources/your_name/__init__.py:

from .fetcher import YourFetcher

fetcher_class = YourFetcher

__all__ = ["fetcher_class"]

The pipeline’s plugin loader reads fetcher_class from each source’s __init__.py and instantiates it. The attribute name is fixed; the class name is free.

6. Record a fixture and write the test

Tests are hermetic — no live HTTP. Every fetcher test uses a recorded fixture loaded via a monkey-patched HTTP transport.

Record one fixture per test scenario:

python scripts/record_fixture.py --source your_name \
    --query "transformer attention" --max 5

This writes tests/fixtures/your_name/transformer-attention.json (or .html / .xml for non-JSON sources). The recording script strips any user-specific tokens from the request before saving.

Then write the test:

# tests/sources/your_name/test_your_name.py
import json
from pathlib import Path

import pytest
from thesisagents.core.models import Query
from your_name.fetcher import YourFetcher


@pytest.fixture()
def transformer_fixture():
    p = Path(__file__).parent.parent.parent / "fixtures" / "your_name" / "transformer-attention.json"
    return json.loads(p.read_text(encoding="utf-8"))


def test_fetcher_parses_transformer_results(http_recorder, transformer_fixture, monkeypatch):
    monkeypatch.setenv("THESISAGENTS_YOUR_NAME_API_KEY", "test-key")
    http_recorder.add_response(
        url="https://api.your-source.example/v1/search",
        params={"q": "transformer attention", "limit": 5},
        json=transformer_fixture,
    )
    fetcher = YourFetcher()
    papers = await fetcher.fetch(
        Query(keywords="transformer attention", sources=("your_name",), max_results=5)
    )
    assert len(papers) > 0
    assert papers[0].source == "your_name"
    assert papers[0].title  # always present
    assert papers[0].url.startswith("https://")

Add tests for:

  • Happy path (above) — recorded fixture parses cleanly.

  • Empty result set — fixture with zero entries returns [].

  • Missing optional fields — entries with no DOI / no abstract / no year still parse without raising.

  • Malformed JSONParseError raised on broken response.

  • HTTP 429RateLimitError raised.

  • HTTP 500SourceUnavailableError raised.

  • Unicode — title / authors in CJK / Cyrillic / Devanagari parse cleanly.

  • No API key (if your plugin needs one) — ConfigError raised at __init__.

The http_recorder fixture is defined in tests/conftest.py.

7. Verify

Run the full chain:

# Unit tests for your plugin
python -m pytest tests/sources/your_name/

# Integration: plugin shows up in the source list
python -c "from thesisagents.app.source_manager import list_sources; \
    print('your_name' in [s.name for s in list_sources()])"

# Live smoke (only if you have credentials):
THESISAGENTS_YOUR_NAME_API_KEY=... \
    thesisagents --query "diffusion models" --source your_name --max 5 \
                   --out ./smoke/your_name/

# Lint + security
python -m ruff check sources/your_name/
python -m bandit -c pyproject.toml -r sources/your_name/

All three must pass before commit (the project’s Definition of Done).

8. Update docs

  • Add your source to the table in Configuration with its rate limit + any required env var.

  • Add your source to the “Available source plugins” table in CLI.

  • Document any caveats (e.g. “results are limited to titles + abstracts; full text not available via the API”).

Common pitfalls

Constructing your own httpx.AsyncClient

Don’t. Use get_client(your_source_name). It applies:

  • HTTPS-only enforcement (refuses plain HTTP, even after redirect).

  • Your declared rate-limit token bucket.

  • Exponential backoff on 429 / 5xx (which also goes through the bucket).

  • A per-source User-Agent that respects upstream attribution rules.

Constructing your own bypasses all four; you’ll get IP-blocked within a day.

Hardcoding an API key

Don’t. Load from os.environ.get("THESISAGENTS_..._API_KEY") and document the variable in Configuration. The GUI’s Settings page picks up any variable matching the THESISAGENTS_..._API_KEY pattern automatically when extended.

Returning records without source_id

The dedup pass uses source_id as a fallback when DOI / arXiv ID are missing. Without it, every record from your source is treated as a separate paper even when titles match.

Returning HTML rendered into abstract

The exporter expects plain text. If your source returns HTML, strip it with beautifulsoup4:

from bs4 import BeautifulSoup

raw_html = entry.get("abstract_html", "")
abstract = BeautifulSoup(raw_html, "lxml").get_text(separator=" ").strip()

Using xml.etree on untrusted input

For sources that return XML (PubMed, arXiv Atom feed, …), use defusedxml not xml.etree. The bandit rule B405 will flag the unsafe usage at lint time.

Putting source-specific HTML selectors in core

If your plugin needs to parse HTML with bs4 selectors, those selectors live in sources/your_name/parser.py. They never go under thesisagents/core/.

When a plugin should be promoted to core

You’ll know it’s time when:

  • Every user wants the plugin enabled (no opt-in env var).

  • The plugin uses only the core dep set.

  • The upstream has stable rate limits and a stable contract.

  • The plugin has had no breaking changes in 6+ months.

To promote: move the code into thesisagents/core/<source>/, update the import paths in core_manager, and remove the entry from PLUGIN_SOURCES in core/constants.py. The user-visible interface (the --source <name> flag) doesn’t change.

This has happened exactly zero times to date — the plugin pattern turns out to be the right home for most sources permanently.

Worked examples in-tree

  • sources/arxiv/ — JSON / Atom hybrid, no API key, low rate limit. Good starting point for a simple read-only source.

  • sources/pubmed/ — XML response, optional API key, two-step flow (search → fetch full record). Good example of multi-call patterns.

  • sources/ieee/ — dual API + scrape paths gated by different env vars. Good example of optional dependency handling.

  • sources/scholar/ — pure HTML scrape with Selenium fallback, ToS-opt-in. Good example of when not to do this — the code is fragile by necessity.

  • sources/springer/ConfigError at construction when key is missing. Good example of soft-skip integration.

  • sources/europepmc/ — open REST API, no key, JSON. Clean example of the over-fetch-then-year-post-filter pattern + structured-vs-flat author fallback.

  • sources/doaj/ — open API where the query rides in the URL path (percent-encoded), not a query parameter. Good example of a non-standard endpoint shape.

  • sources/hal/ — Solr-backed API whose fields are arrays even when single-valued. Good example of defensive array-or-scalar unwrapping.

  • sources/core/ — opt-in via THESISAGENTS_CORE_API_KEY passed as a Bearer header. Good example of header-based auth + soft-skip.