# Configuration reference Every knob the user can turn — environment variables, CLI flags that map to env vars, GUI Settings page fields, and the on-disk QSettings store. This page is the single source of truth; other docs link here. ## Environment variables All env vars are read **at the moment a fetcher or extra is constructed**, not lazily — so a `set ANTHROPIC_API_KEY=...` after the CLI is running has no effect. Set them in the shell before launching, or use the GUI's Settings page, which mirrors each value into `os.environ` before any fetcher initialises. ### LLM enrichment | Variable | Default | Effect | |---|---|---| | `ANTHROPIC_API_KEY` | unset | Required for the Python `--enrich` pipeline. Triggers auto-enrichment when set (override with `--lightweight`). Never needed for the LLM-as-agent flow over MCP. | | `THESISAGENTS_LLM_MODEL` | `claude-opus-4-7` | Override the model used when `--enrich` is on. `--llm-model` on the CLI takes precedence. | ### Source plugin keys | Variable | Default | Effect | |---|---|---| | `THESISAGENTS_S2_API_KEY` | unset | Higher rate limit on the Semantic Scholar plugin (1/s anonymous → 10/s with key). **Also used by the OA resolver's S2 `openAccessPdf` lookup step** — without the key the resolver's S2 calls hit the anonymous tier and rate-limit fast. Free key at . | | `THESISAGENTS_NCBI_API_KEY` | unset | Raises PubMed's anonymous limit from 3 req/s to 10 req/s. | | `THESISAGENTS_IEEE_API_KEY` | unset | Switches the IEEE plugin from the scrape fallback to the official Xplore API (`ieeexploreapi.ieee.org`). Surfaces `pdf_url` for papers in your subscription scope. Apply at . | | `THESISAGENTS_DISABLE_IEEE_SCRAPING` | unset | **IEEE plugin is now default-ON.** Set `=1` to opt out of the scrape fallback. IEEE Xplore ToS are grey on automated traffic — set this if you don't want the scrape path running. | | `THESISAGENTS_SPRINGER_API_KEY` | unset | Free key from . **Required** for the Springer plugin — it raises `ConfigError` at construction without a key, which the pipeline silently skips. | | `THESISAGENTS_CROSSREF_PLUS_TOKEN` | unset | Crossref Plus subscriber token. Attached to requests as `Crossref-Plus-API-Token: Bearer `. Raises rate limits and improves cache freshness on the `acm` and `crossref` plugins. | | `THESISAGENTS_DISABLE_SCHOLAR_SCRAPING` | unset | **Scholar plugin is now default-ON.** Set `=1` to opt out. Google's ToS forbids automated access; default-on for coverage, opt-out if you'd rather not take the captcha / IP-block risk. | | `THESISAGENTS_DISABLE_WEBRUNNER` | unset | **Scholar + IEEE plugins + the PDF downloader for paywalled publisher CDNs all default to driving a real visible Chrome through WebRunner** (`je_web_runner` is a default dependency) — publisher bot-detection is far less aggressive on real browsers. PDF download host list includes `ieeexplore.ieee.org`, `dl.acm.org`, `link.springer.com`, `sciencedirect.com`, `onlinelibrary.wiley.com`, `tandfonline.com`, `academic.oup.com`, `nature.com`, `science.org`, plus a few engineering-society CDNs. Set `=1` to force the httpx paths instead (useful for CI / Docker without a Chrome binary). | | `THESISAGENTS_CHROME_PROFILE_DIR` | unset | When set, passes `--user-data-dir=` to Chrome so cookies / login state survive across CLI invocations. Used by **both** Scholar (one-time Google sign-in suppresses Scholar captchas) and IEEE (institutional auth cookies surface paywalled metadata). | | `THESISAGENTS_CORE_API_KEY` | unset | Free key from . Enables both the OA resolver's CORE.ac.uk lookup step (200M+ institutional / regional OA repository items) **and** the `core` search source. Skipped silently when unset — the `core` source drops out and the other OA strategies (Unpaywall, Semantic Scholar, arXiv) still run. | | `THESISAGENTS_CONTACT_EMAIL` | unset | Sent to Crossref / OpenAlex as the `mailto=` parameter (entry into their polite pool), to NCBI as `tool` / `email` headers, **and to Unpaywall as `email=`** for the post-dedup OA PDF resolver. Highly recommended — without it the resolver skips Unpaywall lookups entirely, which is the single biggest PDF coverage win for IEEE / ACM / Springer / Elsevier paywalled papers (typical lift 40-70%). | ### PDF download | Variable | Default | Effect | |---|---|---| | `THESISAGENTS_PDF_COOKIES_FILE` | unset | Path to a Netscape-format `cookies.txt`. When set, cookies whose `domain` field matches the host of an outbound PDF download are attached to that request only. **You** are responsible for compliance with the relevant publisher's terms of service. A startup warning fires when the env var is loaded. | ### Logging / debug | Variable | Default | Effect | |---|---|---| | `THESISAGENTS_LOG_LEVEL` | `INFO` | Set to `DEBUG` for verbose tracing — every HTTP request, every fetcher cache hit/miss, every rate-limit wait. `WARNING` quiets the normal per-paper progress line. | ### Qt / GUI These are read by Qt itself (not by ThesisAgents), but the GUI sets sensible defaults when they're absent. See the [GUI doc](gui.md) for the full HiDPI story. | Variable | Default | Effect | |---|---|---| | `QT_ENABLE_HIGHDPI_SCALING` | `1` (set by app.py if unset) | Enables Qt's HiDPI scaling. | | `QT_SCALE_FACTOR_ROUNDING_POLICY` | `PassThrough` (set by app.py if unset) | Lets fractional scale factors (125%, 150%) flow through unchanged. | | `QT_QPA_PLATFORM` | OS default | Set to `offscreen` for headless / CI runs of the GUI. | ## QSettings on-disk store When the GUI's **Settings → Save** is clicked, the values land in a per-OS persistent store **and** are mirrored into `os.environ` for the current process. Each subsequent launch of the GUI calls `apply_saved_env()` at startup, so the env vars are re-applied automatically. ### Storage locations | OS | Path | |---|---| | Windows | `HKEY_CURRENT_USER\Software\ThesisAgents\ThesisAgents` | | macOS | `~/Library/Preferences/com.ThesisAgents.ThesisAgents.plist` | | Linux | `~/.config/ThesisAgents/ThesisAgents.conf` | The QSettings organisation name is `ThesisAgents` and the application name is `ThesisAgents`. Tests inject a temporary storage path via `QSettings.setPath(...)` so they never touch the user's real store. ### Keys | QSettings key | Env var mirrored to | Type | |---|---|---| | `api/anthropic` | `ANTHROPIC_API_KEY` | string | | `api/semantic_scholar` | `THESISAGENTS_S2_API_KEY` | string | | `api/ncbi` | `THESISAGENTS_NCBI_API_KEY` | string | | `api/ieee` | `THESISAGENTS_IEEE_API_KEY` | string | | `api/springer` | `THESISAGENTS_SPRINGER_API_KEY` | string | | `api/crossref_plus` | `THESISAGENTS_CROSSREF_PLUS_TOKEN` | string | | `contact/email` | `THESISAGENTS_CONTACT_EMAIL` | string | | `pdf/cookies_file` | `THESISAGENTS_PDF_COOKIES_FILE` | string (absolute path) | | `ui/language` | (not mirrored) | string — a BCP-47 code from the 14 supported languages | ## CLI flags ↔ env var equivalents Many CLI flags have an environment-variable counterpart so unattended deployments don't need to pass long argument lists. | CLI flag | Env var | Notes | |---|---|---| | `--llm-model MODEL` | `THESISAGENTS_LLM_MODEL` | CLI wins if both are set. | That's the only one today; everything else is per-source plugin config above. The CLI explicitly does NOT read API keys via flags (no `--anthropic-key`) — passing a key on the command line writes it to shell history, which is a security footgun. ## CLI defaults Behaviour you don't see in `--help` because it's hard-coded: | Setting | Value | Where | |---|---|---| | Default page size (`--max`) | `25` | `thesisagents.core.constants.DEFAULT_PAGE_SIZE` | | Max results per source (`--max` ceiling) | `200` | `thesisagents.core.constants.MAX_RESULTS_PER_SOURCE` | | Default cache TTL | `86400` seconds (24 h) | `thesisagents.core.constants.CACHE_TTL_SECONDS` | | Default output dir | `./exports` | `thesisagents.cli._DEFAULT_OUT_DIR` | | Default export formats (`--query`) | `pptx, xlsx, bibtex` | `thesisagents.cli._DEFAULT_EXPORTS_SEARCH` | | Default export formats (`--paper`) | `pptx, bibtex` | `thesisagents.cli._DEFAULT_EXPORTS_SINGLE` | | Default slide language (`--lang`) | `en` | `thesisagents.exporters.i18n.DEFAULT_LANGUAGE` | | Default max slides per paper | `25` | `thesisagents.cli` (`--max-slides`) | | Paywall warning threshold | `0.30` (30%) | `thesisagents.cli.DEFAULT_PAYWALL_THRESHOLD` | | Top-tier-only filter | on (off via `--all-venues`) | `thesisagents.cli` | ## Per-source rate limits Defined in each plugin's `config.py`; enforced by a per-source token bucket in `thesisagents.fetchers.rate_limit`. The defaults match each upstream's published or observed soft limit: | Source | Rate | Jitter | Notes | |---|---|---|---| | `arxiv` | 1 req / 3 s | 0.5 s | Matches arXiv's API ToS. | | `semantic_scholar` | 1 req / s | 0.1 s | Anonymous limit; 10/s with API key. | | `openalex` | 10 req / s | 0.1 s | Polite pool (with `THESISAGENTS_CONTACT_EMAIL`). | | `pubmed` | 3 req / s | 0.1 s | Anonymous; 10/s with `THESISAGENTS_NCBI_API_KEY`. | | `acm` (via crossref) | 50 req / s | 0.05 s | Crossref public; Plus token raises further. | | `dblp` | 1 req / 2 s | 0.5 s | Conservative — DBLP is single-server. | | `crossref` | 50 req / s | 0.05 s | Polite pool with `mailto`. | | `openaire` | 2 req / s | 0.2 s | Conservative — OpenAIRE rate limits are not published. | | `europepmc` | 5 req / s | 0.2 s | Half of Europe PMC's stated 10 req/s ceiling. | | `doaj` | 2 req / s | 0.3 s | DOAJ asks for ≤ 2 req/s from a single client. | | `hal` | 2 req / s | 0.3 s | Polite — HAL publishes no per-second cap. | | `ieee` (API) | 10 req / s | 0.1 s | Per the IEEE Xplore API ToS. | | `ieee` (scrape) | 1 req / 5 s | 1.0 s | ToS-grey — extra-conservative. | | `springer` | 5 req / s | 0.2 s | Per the Springer Meta API ToS. | | `core` | 1 req / s | 0.3 s | Free-tier headroom; needs `THESISAGENTS_CORE_API_KEY`. | | `scholar` | 1 req / 10 s | 2.0 s | ToS forbids scraping — extra-conservative. | These are enforced by a decorator on the HTTP client; retries on 429 / 5xx also go through the bucket so a burst can't slip past the limit. ## Cache layout `thesisagents.core.cache` (used internally by fetchers) keys every raw response by `sha256(source + normalised_query + page)` and stores under `${XDG_CACHE_HOME:-~/.cache}/thesisagents/`. Override via: | Variable | Default | Effect | |---|---|---| | `THESISAGENTS_CACHE_DIR` | `~/.cache/thesisagents` (Linux), `~/Library/Caches/thesisagents` (macOS), `%LOCALAPPDATA%\thesisagents\Cache` (Windows) | Override the cache root. The autouse `_isolate_user_paths` test fixture redirects this to `tmp_path` so tests never write to your real cache. | Clear the cache by deleting the directory; ThesisAgents re-creates it on demand. ## Suppressing Scholar captchas with a persistent Chrome profile Google flags an IP after a few automated Scholar requests even with WebRunner's real-browser path. The reliable workaround is to seed a persistent Chrome profile with a real Google sign-in once; subsequent headless runs reuse the same session cookies, which Google trusts. **One-time setup:** ```powershell # 1. Pick a directory anywhere on disk $env:THESISAGENTS_CHROME_PROFILE_DIR = "D:\thesisagents-scholar-profile" # 2. Open Chrome visibly and trigger one Scholar request $env:THESISAGENTS_CHROME_HEADLESS = "0" thesisagents --query "any keywords" --source scholar --max 1 --out .\tmp\ # Chrome opens. Sign into your Google account, accept any consent # banners, complete any captcha. The window holds open for 60s. ``` **Every run after that:** ```powershell $env:THESISAGENTS_CHROME_PROFILE_DIR = "D:\thesisagents-scholar-profile" Remove-Item Env:\THESISAGENTS_CHROME_HEADLESS # back to headless thesisagents --query "..." --out .\exports\ ``` Chrome boots headless but loads the same profile dir, sends your authenticated Google session cookie, and Scholar serves real results instead of a captcha page. **Caveats:** - Only one Chrome process can hold the profile dir at a time. If you have a regular Chrome open on the same profile path, the WebRunner instance will fail to start. Use a dedicated path. - The session cookie is a real authentication credential. Treat the profile directory like a secret — back it up if you re-image the machine, restrict file permissions. - Cookie eventually expires (~1-2 months for Google). Re-do the interactive sign-in then. ## Settings the project explicitly does NOT have By design — listing them so a contributor doesn't accidentally add them. - **No `--anthropic-key` / `--ieee-key` / etc. CLI flag.** Keys on the command line land in shell history, a screen-sharing reveal, and process listings. Use env vars or the GUI. - **No global rate-limit override.** Per-source buckets are the enforcement boundary; a global override would let users break a single source's ToS by upping the wrong limit. - **No "skip robots.txt" flag.** The scrape-based plugins are off-by-default precisely because of robots / ToS concerns. - **No `--insecure` / `--allow-http` flag.** All egress is HTTPS, enforced by the project's transport wrapper. Plain-HTTP URLs are refused even after a redirect.