GLOSSARY

Terms used on this site

Definitions of terms used across the site — the 2-axis AI judgement (AI-built / AI-using), auxiliary scores, signature files, adoption count. See the methodology page for full formulas.

AXIS 2-axis sample judgement (the headline)

AI-built (is_ai_built)

Repos developed using AI coding tools. Judged by presence of any signature file (CLAUDE.md, .cursorrules, AGENTS.md, .windsurfrules, .github/copilot-instructions.md, .aider.conf.yml, .clinerules). Applied to a monthly random sample of N=2,000 drawn from GH Archive.

AI-using (is_ai_using)

Repos that depend on AI SDKs. Judged by package manifests (requirements.txt, pyproject.toml, package.json, go.mod, Cargo.toml, etc.) containing AI SDK dependencies (openai, anthropic, langchain, @anthropic-ai/sdk, etc.). Orthogonal to is_ai_built.

AI-involved (either/both)

Repos that satisfy at least one axis (or both). The headline number on the home page — the size of the set of repos with any AI involvement.

Random sample / denominator

Monthly N=2,000 deterministic random sample (FARM_FINGERPRINT) from GH Archive in BigQuery. GraphQL applies the 2-axis judgement to each; the ratio is multiplied by the GitHub-wide active-repo total (the denominator) for the same month to estimate the absolute count. N=2,000 → ±1pt at 95% CI.

SCORING Auxiliary scores (per-repo tracking only)

Auxiliary scores used for per-repo detail pages and the tracking threshold. The site-wide AI involvement ratios come from the 2-axis sample above; these scores are per-repo references.

AI relevance score

0-100 auxiliary score of how AI-related a repo is. Checks AI keywords in name/description/topics/README. ≥40 triggers detailed tracking (and the per-repo detail page).

→ formula

Solo developer score

0-100. Heuristic: owner_type=User, low followers, has README, multiple pushes, has homepage. ≥60 → solo candidate.

Web launch score

0-100. homepage URL + deployed on Vercel/Netlify + README contains demo URL or landing keywords. ≥50 → web-launch candidate.

Composite score

Components normalized to 2021 avg = 100, then weighted-summed. Auxiliary; the headlines are the concrete counts.

Continuity rate

Fraction of AI-relevant repos created 3 months before the target month that pushed within the last 30 days. 0.0-1.0; null when cohort is empty.

Adoption / share

Per-category mention count and share. Computed by substring match on each repo's description + topics + AI summary corpus.

DETECTION Detection

Signature file

Config files generated/consumed by AI coding tools. Their presence in a repo signals tool usage: CLAUDE.md / AGENTS.md / .cursorrules / .windsurfrules / .github/copilot-instructions.md / .aider.conf.yml / .clinerules

→ tools ranking

AI keyword dictionary

The keyword corpus (`ai_keywords` table) per category. LangChain / OpenAI / pgvector / Claude etc. Short noisy keywords (Lit / Bun) deactivated.

AI summary

500-1000 char README summary generated by OpenAI gpt-5-mini. Stored for both ja/en. Generated within a $3/month budget cap.

Detected AI stack

On each repo detail page, the AI-related keywords found in the repo's description / topics / summary, grouped by category.

SOURCES Data sources

GitHub GraphQL / REST

GitHub's two API families. GraphQL for batch repo details + READMEs, REST Search for new-repo discovery, REST Code Search for site-wide signature counts.

GH Archive

Third-party project archiving public GitHub events (create / push / fork / watch). Available free as a public BigQuery dataset.

BigQuery

Google Cloud's data warehouse. Used here to query the GH Archive monthly tables, kept within the 1TiB/month free tier.

Monthly partition

Time-series tables like `repo_metric_snapshots` are partitioned by `RANGE (TO_DAYS(month))` for efficient month-scoped queries and easy old-month archival.

EXPLORE Related

🧮

Methodology

Formulas and detection rules

🔌

Data sources

API usage and rate limits

ℹ️

About

Purpose & policy