AI Dev Impact Lab JA
FORMULAS & DETECTION · 377 tracked keywords
METHODOLOGY

Scoring formulas & detection rules

Every number on this site is computed by the formulas defined here — AI relevance scoring, AI coding tool detection, and adoption counts for all 13 categories. Reproducible by design.

DEFINITION 1. Two-axis AI definition

"Is this an AI project?" cannot be captured with a single criterion. We measure two orthogonal axes.

AI-built (is_ai_built)

Whether the repo's code was written with help from AI coding tools. The repo's purpose is irrelevant — could be a chatbot or a todo app.

Detection: GraphQL checks the existence of any of these files at HEAD:

  • CLAUDE.md → Claude Code
  • .cursorrules → Cursor
  • AGENTS.md → OpenAI Codex / generic
  • .windsurfrules → Windsurf
  • .github/copilot-instructions.md → GitHub Copilot
  • .aider.conf.yml → Aider
  • .clinerules → Cline

AI-using (is_ai_using)

Whether the repo's code embeds AI capabilities. If it depends on an AI SDK, it definitively integrates AI.

Detection: Check that a dependency manifest references an AI SDK:

  • Python: requirements.txt / pyproject.toml / Pipfile
  • JS/TS: package.json
  • Go: go.mod · Rust: Cargo.toml

AI SDKs detected: openai, anthropic, langchain, llama-index, huggingface, cohere, mistralai, replicate, qdrant, pinecone, chroma, weaviate, litellm, ollama …

ESTIMATION 2. Sample-based GitHub-wide estimation

GitHub has 100M+ repositories — exhaustive detection is infeasible. We estimate global ratios via statistical sampling.

  1. Random sample of N=2,000 from BigQuery GH Archive's active set for the target month (deterministic via FARM_FINGERPRINT)
  2. Detect is_ai_built / is_ai_using on each sampled repo via GraphQL
  3. Sample ratios: built/N, using/N, plus both/either
  4. Estimated GitHub-wide count = sample ratio × GH-Archive monthly active total

N=2,000 gives ~±1pt at 95% confidence. Samples are independent across months.

AUX 3. Auxiliary scores (reference)

Legacy auxiliary scores per tracked repo (0-100). The site's primary axis is the two-axis AI detection above; these are kept for repo-level filtering and detail pages.

AI relevance (0-100)

AI keyword density across name/desc/topics/README. ≥40 → tracked.

Solo dev likelihood (0-100)

Heuristic: owner_type, low followers, has README, multiple pushes, has homepage.

Web launch (0-100)

homepage URL + deployed on Vercel/Netlify + landing keywords in README.

Continuity rate

Fraction of repos created 3 months ago that pushed within last 30 days.

DETECT 3. AI coding tool detection

Via GitHub GraphQL `object()` we check whether the following files/directories exist at HEAD. Hits are recorded in `repo_ai_signals` and feed the AI coding tools ranking .

File / dirToolVendor
CLAUDE.md / .claude/Claude CodeAnthropic
AGENTS.mdOpenAI Codex / 汎用OpenAI
.cursorrules / .cursor/rules/CursorAnysphere
.github/copilot-instructions.mdGitHub CopilotGitHub
.windsurfrulesWindsurfCodeium
.aider.conf.ymlAiderAider
.clinerulesClineCline

ADOPTION 4. Adoption count for the 13 categories

For LLM provider / framework / vector DB / model categories, we substring-search a per-repo corpus (description + GitHub topics + AI-summarized README) against each category's keyword dictionary (<code>ai_keywords</code>, currently 377 active keywords). The count is the number of distinct repos that mention each keyword.

CADENCE 5. Update cadence

JobCadenceWhat
ai-index:dailydaily 01:10 JSTdiscover new AI repos, refresh scores
ai-index:weeklySun 03:10 JSTfull repo re-scan
ai-index:monthly1st 04:10 JSTGH Archive aggregation, monthly metrics
ai-index:generate-report1st 05:30 JSTmonthly report draft + auto-publish
ai-index:summarizemanual (within $3/mo budget)README AI summarization (gpt-5-mini)

EXPLORE Related

🔌

Data sources

How GitHub API, GH Archive, OpenAI, Google Trends are used

📖

Glossary

Definitions used on this site

📊

13 rankings

Real results computed by these formulas