SHUR IQ Experiment Lab

Autoresearch Pipeline Status
Last updated: 2026-03-28 | Branch: feature/semantic-layer-sbpi
Dashboard
Experiment Registry
Results & Insights
The Pipeline
Gemini Proposals
Cross-Vertical Transfer
10
Total Experiments
2 complete, 3 planned, 2 starting, 3 proposed
69.9%
Best Accuracy
Exp 2: TPE optimization (30 trials)
+46.3%
Improvement
From 23.5% (KG-augmented baseline)
75%
Next Target
Exp 3: MOTPE + Temporal Decay
Accuracy Trajectory Directional prediction accuracy across experiment iterations
Exp 1: Baseline Methods Best: 47.1% (mean reversion)
47.1%
Exp 2: Markovick TPE Optimization 69.9%
69.9%
75% target
85% goal
Nightly Cycle Status 9-phase automated pipeline
Active — Running daily via weekly-prediction-cycle.py
Phase 1 ETL Load Oxigraph ingest
Phase 2 Accuracy Check Prior predictions
Phase 3 Predict Multi-signal
Phase 4 Attest Evidence quality
Phase 5 Insights SPARQL digest
Phase 6 KG Optimize TPE (Exp 2)
Phase 7 Event Impact Track A: BI Agent
Phase 8 Defensive BI Track B: Mitigations
Phase 9 Signal Weights Track C: Autoresearch
Required phase    Advisory phase
Latest Insight Digest 2026-03-28 10:24 — Weekly movers + predictive signals
DramaBox +4.0 | Tier 1 | 82.75 — $500M valuation signal, SE Asia fastest-growing, ONLY profitable pure-play
JioHotstar +3.95 | Tier 2 | 62.25 — IPL launch imminent, 300M subscriber leverage
COL/BeLive +3.15 | Tier 3 | 44.55 — FILMART launch converts to execution, SaaS provable
Disney +2.3 | Tier 1 | 76.55 — Locker Diaries #1, DramaBox Accelerator investment
ReelShort -2.05 | Tier 1 | 82.0 — Production head defection, ShortMax 3888% growth eroding position
Netflix -2.0 | Tier 2 | 60.8 — No production activity, mobile engagement gap widening
Amazon -2.6 | Tier 3 | 50.2 — ONLY major platform with zero microdrama strategy
KLIP -2.65 | Tier 4 | 22.35 — Structural squeeze from JioHotstar

Predictive Signals:
BULLISH: JioHotstar (+9.45), COL/BeLive (+7.25), Disney (+5.55), DramaBox (+5.25), GoodShort (+4.5)
BEARISH: Amazon (-5.8), Netflix (-5.0), ReelShort (-2.6)
Full Experiment Registry All experiments — complete, active, planned, and proposed. Click rows to expand.
Experiment Status Methodology Key Metric Data Req. Timeline
Exp 0: Baseline Methods Complete Persistence, naive momentum, mean reversion, KG-augmented 47.1% accuracy 3 weeks W10-W11

Description

Four baseline prediction methods tested against the first 3 weeks of SBPI data. Established the performance floor that all subsequent experiments must beat.

Results

  • Persistence: 23.5% directional accuracy
  • Naive momentum: 23.5%
  • Mean reversion: 47.1% (best baseline)
  • KG-augmented: 23.5% (default parameters)
Exp 2: Markovick TPE Optimization Complete 12-param TPE via Optuna (Markovick et al. 2025) 69.9% accuracy 3 weeks W11-W12

Description

Applied Markovick et al. (arXiv:2505.24478v1) methodology: treat the KG-to-prediction interface as a 12-parameter search space. Tree-structured Parzen Estimator optimization over 30 trials to maximize directional accuracy on historical data.

Key Findings

  • Best score: 0.6986 (trial 28 of 30)
  • direction_threshold shifted from 0.5 to 1.295 (+159%)
  • mean_reversion_rate increased from 0.1 to 0.257 (+157%)
  • New signals activated: divergence_weight (0.180), tier_proximity_weight (0.096)
  • anomaly_contributes flipped from False to True
Exp 1: Goodhart Guard Planned Overtuning detection + early stopping (Schneider et al. 2025) Protective (degenerate rate) 2 weeks Week 0-1

Description

Implements early stopping and default-baseline comparison to detect overtuning in the nightly TPE loop. With 30 trials on 51 observations, the trials-per-data-point ratio of 0.59 exceeds the safe threshold of 0.3 identified by Schneider et al.

Expected Outcome

  • Detect ~10% of nightly runs producing overtuned configs
  • Prevent 2-5 ppt accuracy drops on unseen weeks
  • Adds ~30 seconds to Phase 6
  • Must run before Experiments 3-5
Exp 3: MOTPE Multi-Objective Planned Multi-objective TPE: accuracy + Brier + MAE (Barker et al. 2025) Target: 75%+ 4 weeks Week 1-2

Description

Replace single-objective TPE with Optuna MOTPESampler. Optimize jointly over directional accuracy (maximize), Brier score (minimize), and MAE (minimize). Produces a Pareto front of non-dominated configurations instead of a single best point.

Expected Outcome

  • Brier score improvement of 10-20% from ~0.25 baseline
  • Accuracy stable or +3-8%
  • Resists "predict stable everywhere" degenerate solution
  • ~40 lines code change in kg_interface_optimizer.py
Exp 4: Dimension Weight Optimization Planned TPE over dimension weights (Lu et al. 2025, Wakayama 2024) +5-15% relative 6 weeks Week 4-6

Description

Current dimension weights (Distribution 0.25, Content 0.20, Narrative 0.20, Community 0.20, Monetization 0.15) are set by intuition. This experiment adds 4 free weight parameters to the TPE search space (12 → 16 params), with the 5th constrained to sum-to-1.

Expected Outcome

  • 5-15% relative accuracy improvement from static weight optimization
  • Additional 3-8% from covariate-dependent weights (Phase 2)
  • Requires MOTPE (Exp 3) to be active
  • ~80 lines across optimizer + sbpi_to_rdf.py
Exp 5: Temporal Decay Signal Blocked Exponential temporal decay (Gastinger et al. 2024) +8-15% relative 8 weeks Week 8-10

Description

Add exponential temporal decay weighting: recent weeks contribute more to predictions. Introduces 2 new parameters (temporal_decay_rate, temporal_lookback). Blocked until 8+ weeks of data exist. Currently at ~4 weeks.

Expected Outcome

  • 8-15% relative accuracy improvement
  • Captures market momentum recency bias
  • Blocked until ~late April 2026 (W18+)
  • ~120 lines across prediction engine + optimizer
Exp 6: Cross-Vertical Transfer (K-Pop) Starting Warm-start from micro-drama params (Zeng et al. 2025) 40-60% fewer trials 4 weeks target Week 10-12

Description

Transfer the 12 optimized parameters from the micro-drama vertical to warm-start a K-Pop vertical. Tests whether the SBPI methodology generalizes across entertainment domains. Adapted from BOLT (Zeng et al. 2025) multi-task Bayesian optimization.

Expected Outcome

  • 40-60% reduction in trials-to-convergence
  • 5-10% higher ceiling accuracy vs cold-start
  • Validates platform thesis (methodology transfers across verticals)
  • ~200 lines new code (cross_vertical_transfer.py)
Recursive Triple Expansion Proposed Automated KG growth via $7 slow-model extraction Billion-node scaling N/A Future

Description

From Gemini session: use a locally hosted quantized model (Llama-3-70B class) at $7/report to extract entities and relationships from crawled sources 24/7. Scale from 96 nodes / 268 edges to billion-node territory. An "Ontological Referee" agent checks extractions against the existing graph for redundancy.

Projected Economics

  • $7 per internal briefing (local compute cost)
  • 2,000+ briefings/month at negligible marginal cost
  • 100 nodes per briefing extraction density
  • 12B node/year at full scale
Self-Consistency Validation Proposed KG vs parametric memory accuracy delta "Ontological Premium" N/A Future

Description

From Gemini session: run thousands of "Self-Consistency" tests — ask the system to solve a problem using its KG vs. its parametric (LLM) memory. The delta in accuracy is the "Ontological Premium" — the measurable value of the curated knowledge graph over vanilla LLM output.

Investor Value

  • Proves: "ShurIQ users experience 90% fewer hallucinations than vanilla GPT-4" (hypothesis)
  • Quantifies the IP value of the knowledge graph directly
  • Enables licensing model: per-query or per-vertical access
  • Turns "consulting" narrative into "pre-computed intelligence" narrative
Ontological Referee Loop Proposed Redundancy detection + ontology quality scoring Extraction precision N/A Future

Description

From Gemini session: a second, faster agent checks proposed KG extractions against the existing graph to identify: (a) redundant data (already known), (b) contradictions (conflicts with existing triples), (c) high-value bridge nodes (connect previously disconnected clusters). Only high-value nodes are baked into the permanent graph.

Quality Metrics

  • Extraction yield: high-fidelity facts per report
  • Ontology breadth: unique classes in schema
  • Inference premium: KG-augmented quality lift
  • Amortized extraction cost trending toward zero
Experiment 1: Baseline Results 4 methods tested on 3 weeks of SBPI data (W10-W12, 17 companies)
Method Dir. Accuracy MAE Brier Score Notes
Persistence 23.5% 1.803 0.250 Predicts no change. Floor performance.
Naive Momentum 23.5% 1.803 0.279 Extends last-week direction. No improvement over persistence.
Mean Reversion 47.1% 2.107 0.250 Best baseline. Companies tend to revert toward tier mean.
KG-Augmented (defaults) 23.5% 1.803 0.250 Default parameters leave significant accuracy on the table.
Experiment 2: TPE Optimization Results 30 Optuna TPE trials on 2 transition pairs (W10→W11, W11→W12). Best trial: #28.
69.9%
Best Accuracy
Trial 28 of 30
63.4%
Mean Across Trials
σ = 3.32%
57.5%
Worst Trial
Trial 4
2,588
KG Triples
In Oxigraph store
70% 65% 60% 55%
Trial 0 Trial 15 Trial 29
Best trial (0.6986)    Other trials
Optimized Configuration 12 parameters from best-config.json — delta from Exp 1 defaults
Parameter Exp 1 Default Exp 2 Optimized Delta Interpretation
direction_threshold 0.500 1.295 +159% Higher bar for calling a directional move. Reduces false positives.
confidence_base 0.600 0.443 -26% Lower base confidence. System is more cautious by default.
magnitude_thresh_1 3.000 3.020 +1% Near-default. Magnitude thresholds were already reasonable.
magnitude_thresh_2 5.000 5.076 +2% Near-default.
consistency_thresh 2.000 1.980 -1% Near-default.
magnitude_bonus_1 0.100 0.120 +20% Slightly rewards larger moves.
magnitude_bonus_2 0.100 0.136 +36% Larger bonus for big moves. System learns big moves are informative.
consistency_bonus 0.050 0.040 -20% Consistency signal matters less than expected.
mean_reversion_rate 0.100 0.257 +157% Strong mean reversion signal. Companies tend to revert toward tier means.
anomaly_contributes False True changed Anomaly signal activated. Dimension-composite gaps are predictive.
divergence_weight 0.000 0.180 new signal Inter-dimension divergence is informative (18% weight).
tier_proximity_weight 0.000 0.096 new signal Proximity to tier boundaries is predictive (9.6% weight).
Key Insight: The two largest parameter shifts (direction_threshold +159%, mean_reversion_rate +157%) point to the same conclusion: the micro-drama competitive landscape is dominated by reversion dynamics, not momentum. Companies overshoot in both directions and pull back. The optimizer also activated three previously dormant signals (anomaly, divergence, tier proximity), confirming that the knowledge graph structure contains predictive information that raw scoring misses.
9-Phase Nightly Prediction Cycle Orchestrated by weekly-prediction-cycle.py — all phases sequential, advisory phases non-blocking
1

ETL Load

Required — sbpi_to_rdf.py --all --validate

Loads new week's SBPI scoring data into the Oxigraph RDF store. Validates triples against the SBPI ontology (sbpi.ttl). Currently processing 2,588 triples across 17 companies, 5 dimensions, 3 weekly snapshots.

2

Prediction Accuracy Check

Optional — prediction_engine.py --report

Compares previous week's predictions against actual outcomes. Feeds accuracy metrics into the optimization loop. Skipped if no prior predictions exist.

3

Prediction Generation

Required — prediction_engine.py --generate

Multi-signal prediction engine using the 12 optimized parameters from best-config.json. Generates directional predictions (up/down/stable) with confidence scores and magnitude estimates for each company.

4

Attestation Upgrade

Required — attestation_upgrade.py --upgrade

Evaluates evidence quality backing each score. Upgrades attestation metadata based on source diversity, recency, and corroboration. Tracks the provenance chain from raw source to scored assertion.

5

Nightly Insights

Required — nightly-insights.py --schedule all --output file

Runs 7 SPARQL queries (weekly movers, tier transitions, dimension anomalies, distribution-community gaps, predictive signals, attestation coverage, platform vs pure-play) against the Oxigraph store. Produces a timestamped markdown insight digest.

6

KG Interface Optimization (Exp 2)

Advisory — kg_interface_optimizer.py --nightly

Re-runs 30-trial TPE optimization against expanded historical data. Writes improved parameters to best-config.json if a better configuration is found. This is the core autoresearch loop from Markovick et al.

7

Event Impact Analysis (Track A)

Advisory — event_impact_analyzer.py --nightly

Per-company event impact reports. Researches news, deals, and app store movements. Scores impact across 5 SBPI dimensions. Classifies events as MATERIAL, MONITORING, or NOISE. Last run analyzed 22 companies with 3 material events detected.

8

Defensive BI Recommendations (Track B)

Advisory — defensive_bi_agent.py --nightly

Generates mitigation strategies for MATERIAL impact events from Track A. Filters for strategic relevance to prevent reactive noise. Only triggers when Track A identifies events worth defending against.

9

Signal Weight Optimization (Track C)

Advisory — signal_weight_optimizer.py --nightly

TPE autoresearch loop specifically for signal weighting in the BI agent output. Re-optimizes only when new accuracy labels are available. Prevents reactive noise from accumulating in the BI recommendations.

Data Flow Architecture From raw sources to scored predictions
SerpAPI / Manual Research
    |
    v
SBPI Scoring (5 dimensions x 17 companies)
    |                                           sbpi_to_rdf.py
    v
RDF Triples (sbpi.ttl ontology)  ---------->  Oxigraph Store (2,588 triples)
    |                                               |
    v                                               v
SPARQL Queries (7 query library)          KG Interface (12 params)
    |                                               |
    v                                               v
Insight Digest (nightly-insights.py)      Prediction Engine (multi-signal)
    |                                               |
    v                                               v
Markdown Reports                          TPE Optimization (30 trials/night)
    |                                               |
    v                                               v
insights/ directory                       best-config.json
    |                                               |
    +----------- Weekly Editorial ----------+-------+
                                            |
                                     Event Impact (SerpAPI)
                                            |
                                     Defensive BI Agent
The $7 Economics: The "slow model" cost structure means each internal autoresearch report costs approximately $7 in compute. At 2,000+ reports/month, the internal research pipeline runs for under $14,000/month while generating proprietary knowledge graph assets that compound in value. This decouples IP growth from the client revenue cycle.
Infrastructure Stack
Component Technology Role
RDF StoreOxigraph (local, port 7878)SPARQL endpoint for knowledge graph queries
Ontologysbpi.ttl (Turtle/RDF)5-dimension scoring schema + attestation model
OptimizerOptuna TPE (Python)Tree-structured Parzen Estimator for parameter search
ETLPython (sbpi_to_rdf.py)Scoring data → RDF triples → Oxigraph
ResearchSerpAPI + Claude CLIEvent research and impact scoring
Query LibrarySPARQL (.rq files)7 analytical queries (movers, anomalies, signals, etc.)
SchedulerPython (weekly-prediction-cycle.py)9-phase orchestrator
ReportingCloudflare PagesStatic editorial sites (sbpi-semantic-layer.pages.dev)
Gemini Session: Scaling Proposals From "Working With Gemini Session on ShurIQ IP and K-Pop Stack Ranking and Auto Research" — proposals for scaling from 96 nodes to billion-node territory

The Gemini brainstorming session identified three new experiment concepts that extend the current 5-experiment autoresearch expansion plan. These proposals target the "hyper-scale" thesis: proving that ShurIQ's knowledge graph, grown via automated research at $7/report, becomes a moat that compounds independent of client revenue.

Recursive Triple Expansion

Source: Gemini Session — "The $7 Flywheel" / Karpathy Auto-Research Method

Deploy a locally hosted quantized model (Llama-3-70B class) to crawl Common Crawl, Semantic Scholar, and industry-specific feeds 24/7. Each $7 processing run extracts entities, relationships, and ontological tags based on the ShurIQ schema. An "Ontological Referee" agent checks extractions against the existing graph for redundancy or contradictions before committing to the permanent store.

Target Scale: From 96 nodes / 268 edges (current, Issue No. 3) to 1B+ nodes
Economics: $7/report × 2,000/month = $14K/month for 200K new nodes/month
Variable: At 100 nodes/briefing extraction density, ~12B nodes in first year at full scale

Self-Consistency Validation

Source: Gemini Session — "Quantifying the IP for Licensing"

Run thousands of "Self-Consistency" tests: ask the system to solve a problem using its KG (non-parametric, curated) vs. its parametric memory (raw LLM). The delta in accuracy is the "Ontological Premium" — the measurable value of the curated knowledge graph over vanilla LLM output. This turns the knowledge graph from an abstract asset into a quantifiable competitive advantage.

Hypothesis: "ShurIQ users experience 90% fewer hallucinations than vanilla GPT-4"
Verification: Karpathy-style self-consistency testing across domains
Output: "Inference Premium" metric — the measurable lift in report quality when using KG vs. raw LLM

Ontological Referee Loop

Source: Gemini Session — "The Stack Rank Weekly"

A two-agent quality gate for the Recursive Triple Expansion pipeline. Agent 1 (the "Slow Processor") extracts entities and relationships. Agent 2 (the "Referee") checks extractions against the existing graph across three dimensions: redundancy (already known), contradiction (conflicts with existing triples), and bridge value (connects previously disconnected clusters). Only high-value nodes that score above a bridge-value threshold get committed to the permanent graph.

KPIs: Extraction Yield (facts/report), Ontology Breadth (unique classes), Inference Premium (quality lift), Amortized Extraction Cost (trending → $0)
Gate Logic: Propose → Cross-Reference → Stack Rank by bridge value → Human approval for top nodes → Commit
Goal: Ensure billions of nodes are signal, not noise
The "Hyper-Scale" Variable Simulation From Gemini session economic modeling
Metric Current (Issue 3) Hyper-Scale Goal Multiplier
Node Count 96 nodes / 268 edges 1B+ nodes Automated crawling via $7 local slow-model extraction
Accuracy 69.9% (directional) 85%+ Experiments 2-5: MOTPE, temporal decay, dimension weights
Verticals 1 (micro-drama) 10-20 verticals Experiment 6: cross-vertical transfer (K-Pop next)
Internal Reports ~3/week (manual) 24,000+/year $7 per report, fully automated pipeline
The L2 Thesis: Following Scott Galloway's L2 model — the stack ranking publication creates a "prestige intelligence engine" where 5-10 premium clients pay $1-5M/year, while 100+ brands pay $10-100K for the ranking and guidance. The knowledge graph IP compounds underneath this revenue model, creating a feedback loop where each client engagement deposits validated nodes into the permanent graph. By Year 5, the service revenue is healthy, but the calculated knowledge graph asset value reaches $600M under conservative valuation assumptions.
Experiment 6: K-Pop Vertical — Cross-Vertical Transfer Warm-starting from micro-drama optimized parameters
12
Parameters to Transfer
From micro-drama best-config.json
K-Pop
Target Vertical
AI Agents already done; K-Pop is next
40-60%
Trial Reduction Target
From 30 to 12-18 trials
+5-10%
Ceiling Accuracy Gain
vs. cold-start random initialization
Transfer Architecture

What Transfers (from Micro-Drama)

  • direction_threshold: 1.295 — bar for calling directional moves
  • confidence_base: 0.443 — default confidence calibration
  • mean_reversion_rate: 0.257 — reversion signal strength
  • divergence_weight: 0.180 — inter-dimension gap signal
  • tier_proximity_weight: 0.096 — boundary effects
  • anomaly_contributes: True — anomaly signal activation
  • + 6 magnitude/consistency thresholds and bonuses

What's New (K-Pop Specific)

  • K-Pop-specific edge types: Fandom metrics, comeback cycles, group/agency relationships
  • DART financial data: Korean financial disclosure system for agency revenue
  • Sentiment layer: Fan community sentiment from Weverse, Twitter/X, Naver
  • Dimension semantics differ: "Distribution" maps to multi-platform presence differently in K-Pop
  • Dimension weights will NOT transfer (0.25/0.20/0.20/0.20/0.15 are micro-drama specific)
Central Hypothesis
"Community Strength predicts Touring Revenue" — In K-Pop, the community dimension (fan engagement, fandom mobilization, social proof) should be the strongest predictor of revenue outcomes, unlike micro-drama where distribution power dominates. If the 12 interface parameters transfer while dimension weights require recalibration, it proves the SBPI methodology captures structural market dynamics that generalize across entertainment verticals.
Transfer Methodology Based on BOLT framework (Zeng et al. 2025, arXiv:2503.08131)
1

Copy micro-drama best-config.json

Source config: 12 parameters optimized over 30 TPE trials on 3 weeks of data
2

Generate 5-10 neighboring configurations

Gaussian perturbation, σ = 0.1 of each parameter's range
3

Seed K-Pop Optuna study via enqueue_trial()

Warm-start the optimizer with micro-drama trajectory, not just the best point
4

Run TPE on K-Pop data

Measure trials-to-convergence vs. cold-start baseline
5

Ablation study: which parameter subsets transferred

Test interface params vs. dimension weights separately. Publish transfer-log.json.
Success Criteria
Criterion Threshold Why It Matters
Trials-to-convergence reduction ≥ 40% fewer trials than cold-start Proves the optimizer "remembers" across verticals
Ceiling accuracy delta ≥ 5% higher than cold-start ceiling Warm-start reaches a better optimum, not just faster
Interface param stability ≤ 20% drift from micro-drama values Confirms structural parameters are domain-agnostic
Dimension weight divergence Significant divergence expected Confirms weights are domain-specific (validates the split)
Dependencies and Risks
Risk Impact Mitigation
Source config is overtuned Propagates degeneracy to K-Pop Exp 1 (Goodhart Guard) must clear source config first
K-Pop dimension semantics too different Interface params don't transfer Ablation study separates interface from dimension parameters
Insufficient K-Pop data Can't evaluate predictions Need 4+ weeks of K-Pop scoring data before starting
Experiments 1-4 not stable on source vertical Transferring from a moving target Sequential execution order: Exp 1 → 3 → 4 → 5 → 6