SHUR IQ Experiment Lab

10
Total Experiments
2 complete, 3 planned, 2 starting, 3 proposed

69.9%

Best Accuracy

Exp 2: TPE optimization (30 trials)

+46.3%

Improvement

From 23.5% (KG-augmented baseline)

75%

Next Target

Exp 3: MOTPE + Temporal Decay

Accuracy Trajectory Directional prediction accuracy across experiment iterations

Exp 1: Baseline Methods Best: 47.1% (mean reversion)

47.1%

Exp 2: Markovick TPE Optimization 69.9%

69.9%

75% target

85% goal

Nightly Cycle Status 9-phase automated pipeline

Active — Running daily via weekly-prediction-cycle.py

Phase 1 ETL Load Oxigraph ingest

→

Phase 2 Accuracy Check Prior predictions

→

Phase 3 Predict Multi-signal

→

Phase 4 Attest Evidence quality

→

Phase 5 Insights SPARQL digest

→

Phase 6 KG Optimize TPE (Exp 2)

→

Phase 7 Event Impact Track A: BI Agent

→

Phase 8 Defensive BI Track B: Mitigations

→

Phase 9 Signal Weights Track C: Autoresearch

Required phase Advisory phase

Latest Insight Digest 2026-03-28 10:24 — Weekly movers + predictive signals

DramaBox +4.0 | Tier 1 | 82.75 — $500M valuation signal, SE Asia fastest-growing, ONLY profitable pure-play
JioHotstar +3.95 | Tier 2 | 62.25 — IPL launch imminent, 300M subscriber leverage
COL/BeLive +3.15 | Tier 3 | 44.55 — FILMART launch converts to execution, SaaS provable
Disney +2.3 | Tier 1 | 76.55 — Locker Diaries #1, DramaBox Accelerator investment
ReelShort -2.05 | Tier 1 | 82.0 — Production head defection, ShortMax 3888% growth eroding position
Netflix -2.0 | Tier 2 | 60.8 — No production activity, mobile engagement gap widening
Amazon -2.6 | Tier 3 | 50.2 — ONLY major platform with zero microdrama strategy
KLIP -2.65 | Tier 4 | 22.35 — Structural squeeze from JioHotstar

Predictive Signals:
BULLISH: JioHotstar (+9.45), COL/BeLive (+7.25), Disney (+5.55), DramaBox (+5.25), GoodShort (+4.5)
BEARISH: Amazon (-5.8), Netflix (-5.0), ReelShort (-2.6)

Full Experiment Registry All experiments — complete, active, planned, and proposed. Click rows to expand.

Experiment	Status	Methodology	Key Metric	Data Req.	Timeline
Exp 0: Baseline Methods	Complete	Persistence, naive momentum, mean reversion, KG-augmented	47.1% accuracy	3 weeks	W10-W11
Description Four baseline prediction methods tested against the first 3 weeks of SBPI data. Established the performance floor that all subsequent experiments must beat. Results Persistence: 23.5% directional accuracy Naive momentum: 23.5% Mean reversion: 47.1% (best baseline) KG-augmented: 23.5% (default parameters)
Exp 2: Markovick TPE Optimization	Complete	12-param TPE via Optuna (Markovick et al. 2025)	69.9% accuracy	3 weeks	W11-W12
Description Applied Markovick et al. (arXiv:2505.24478v1) methodology: treat the KG-to-prediction interface as a 12-parameter search space. Tree-structured Parzen Estimator optimization over 30 trials to maximize directional accuracy on historical data. Key Findings Best score: 0.6986 (trial 28 of 30) direction_threshold shifted from 0.5 to 1.295 (+159%) mean_reversion_rate increased from 0.1 to 0.257 (+157%) New signals activated: divergence_weight (0.180), tier_proximity_weight (0.096) anomaly_contributes flipped from False to True
Exp 1: Goodhart Guard	Planned	Overtuning detection + early stopping (Schneider et al. 2025)	Protective (degenerate rate)	2 weeks	Week 0-1
Description Implements early stopping and default-baseline comparison to detect overtuning in the nightly TPE loop. With 30 trials on 51 observations, the trials-per-data-point ratio of 0.59 exceeds the safe threshold of 0.3 identified by Schneider et al. Expected Outcome Detect ~10% of nightly runs producing overtuned configs Prevent 2-5 ppt accuracy drops on unseen weeks Adds ~30 seconds to Phase 6 Must run before Experiments 3-5
Exp 3: MOTPE Multi-Objective	Planned	Multi-objective TPE: accuracy + Brier + MAE (Barker et al. 2025)	Target: 75%+	4 weeks	Week 1-2
Description Replace single-objective TPE with Optuna MOTPESampler. Optimize jointly over directional accuracy (maximize), Brier score (minimize), and MAE (minimize). Produces a Pareto front of non-dominated configurations instead of a single best point. Expected Outcome Brier score improvement of 10-20% from ~0.25 baseline Accuracy stable or +3-8% Resists "predict stable everywhere" degenerate solution ~40 lines code change in kg_interface_optimizer.py
Exp 4: Dimension Weight Optimization	Planned	TPE over dimension weights (Lu et al. 2025, Wakayama 2024)	+5-15% relative	6 weeks	Week 4-6
Description Current dimension weights (Distribution 0.25, Content 0.20, Narrative 0.20, Community 0.20, Monetization 0.15) are set by intuition. This experiment adds 4 free weight parameters to the TPE search space (12 → 16 params), with the 5th constrained to sum-to-1. Expected Outcome 5-15% relative accuracy improvement from static weight optimization Additional 3-8% from covariate-dependent weights (Phase 2) Requires MOTPE (Exp 3) to be active ~80 lines across optimizer + sbpi_to_rdf.py
Exp 5: Temporal Decay Signal	Blocked	Exponential temporal decay (Gastinger et al. 2024)	+8-15% relative	8 weeks	Week 8-10
Description Add exponential temporal decay weighting: recent weeks contribute more to predictions. Introduces 2 new parameters (temporal_decay_rate, temporal_lookback). Blocked until 8+ weeks of data exist. Currently at ~4 weeks. Expected Outcome 8-15% relative accuracy improvement Captures market momentum recency bias Blocked until ~late April 2026 (W18+) ~120 lines across prediction engine + optimizer
Exp 6: Cross-Vertical Transfer (K-Pop)	Starting	Warm-start from micro-drama params (Zeng et al. 2025)	40-60% fewer trials	4 weeks target	Week 10-12
Description Transfer the 12 optimized parameters from the micro-drama vertical to warm-start a K-Pop vertical. Tests whether the SBPI methodology generalizes across entertainment domains. Adapted from BOLT (Zeng et al. 2025) multi-task Bayesian optimization. Expected Outcome 40-60% reduction in trials-to-convergence 5-10% higher ceiling accuracy vs cold-start Validates platform thesis (methodology transfers across verticals) ~200 lines new code (cross_vertical_transfer.py)
Recursive Triple Expansion	Proposed	Automated KG growth via $7 slow-model extraction	Billion-node scaling	N/A	Future
Description From Gemini session: use a locally hosted quantized model (Llama-3-70B class) at $7/report to extract entities and relationships from crawled sources 24/7. Scale from 96 nodes / 268 edges to billion-node territory. An "Ontological Referee" agent checks extractions against the existing graph for redundancy. Projected Economics $7 per internal briefing (local compute cost) 2,000+ briefings/month at negligible marginal cost 100 nodes per briefing extraction density 12B node/year at full scale
Self-Consistency Validation	Proposed	KG vs parametric memory accuracy delta	"Ontological Premium"	N/A	Future
Description From Gemini session: run thousands of "Self-Consistency" tests — ask the system to solve a problem using its KG vs. its parametric (LLM) memory. The delta in accuracy is the "Ontological Premium" — the measurable value of the curated knowledge graph over vanilla LLM output. Investor Value Proves: "ShurIQ users experience 90% fewer hallucinations than vanilla GPT-4" (hypothesis) Quantifies the IP value of the knowledge graph directly Enables licensing model: per-query or per-vertical access Turns "consulting" narrative into "pre-computed intelligence" narrative
Ontological Referee Loop	Proposed	Redundancy detection + ontology quality scoring	Extraction precision	N/A	Future
Description From Gemini session: a second, faster agent checks proposed KG extractions against the existing graph to identify: (a) redundant data (already known), (b) contradictions (conflicts with existing triples), (c) high-value bridge nodes (connect previously disconnected clusters). Only high-value nodes are baked into the permanent graph. Quality Metrics Extraction yield: high-fidelity facts per report Ontology breadth: unique classes in schema Inference premium: KG-augmented quality lift Amortized extraction cost trending toward zero

Experiment 1: Baseline Results 4 methods tested on 3 weeks of SBPI data (W10-W12, 17 companies)

Method	Dir. Accuracy	MAE	Brier Score	Notes
Persistence	23.5%	1.803	0.250	Predicts no change. Floor performance.
Naive Momentum	23.5%	1.803	0.279	Extends last-week direction. No improvement over persistence.
Mean Reversion	47.1%	2.107	0.250	Best baseline. Companies tend to revert toward tier mean.
KG-Augmented (defaults)	23.5%	1.803	0.250	Default parameters leave significant accuracy on the table.

Experiment 2: TPE Optimization Results 30 Optuna TPE trials on 2 transition pairs (W10→W11, W11→W12). Best trial: #28.

69.9%

Best Accuracy

Trial 28 of 30

63.4%

Mean Across Trials

σ = 3.32%

57.5%

Worst Trial

Trial 4

2,588
KG Triples
In Oxigraph store

70% 65% 60% 55%

Trial 0 Trial 15 Trial 29

Best trial (0.6986) Other trials

Optimized Configuration 12 parameters from best-config.json — delta from Exp 1 defaults

Parameter	Exp 1 Default	Exp 2 Optimized	Delta	Interpretation
direction_threshold	0.500	1.295	+159%	Higher bar for calling a directional move. Reduces false positives.
confidence_base	0.600	0.443	-26%	Lower base confidence. System is more cautious by default.
magnitude_thresh_1	3.000	3.020	+1%	Near-default. Magnitude thresholds were already reasonable.
magnitude_thresh_2	5.000	5.076	+2%	Near-default.
consistency_thresh	2.000	1.980	-1%	Near-default.
magnitude_bonus_1	0.100	0.120	+20%	Slightly rewards larger moves.
magnitude_bonus_2	0.100	0.136	+36%	Larger bonus for big moves. System learns big moves are informative.
consistency_bonus	0.050	0.040	-20%	Consistency signal matters less than expected.
mean_reversion_rate	0.100	0.257	+157%	Strong mean reversion signal. Companies tend to revert toward tier means.
anomaly_contributes	False	True	changed	Anomaly signal activated. Dimension-composite gaps are predictive.
divergence_weight	0.000	0.180	new signal	Inter-dimension divergence is informative (18% weight).
tier_proximity_weight	0.000	0.096	new signal	Proximity to tier boundaries is predictive (9.6% weight).

Key Insight: The two largest parameter shifts (direction_threshold +159%, mean_reversion_rate +157%) point to the same conclusion: the micro-drama competitive landscape is dominated by reversion dynamics, not momentum. Companies overshoot in both directions and pull back. The optimizer also activated three previously dormant signals (anomaly, divergence, tier proximity), confirming that the knowledge graph structure contains predictive information that raw scoring misses.

9-Phase Nightly Prediction Cycle Orchestrated by weekly-prediction-cycle.py — all phases sequential, advisory phases non-blocking

ETL Load

Required — sbpi_to_rdf.py --all --validate

Loads new week's SBPI scoring data into the Oxigraph RDF store. Validates triples against the SBPI ontology (sbpi.ttl). Currently processing 2,588 triples across 17 companies, 5 dimensions, 3 weekly snapshots.

Prediction Accuracy Check

Optional — prediction_engine.py --report

Compares previous week's predictions against actual outcomes. Feeds accuracy metrics into the optimization loop. Skipped if no prior predictions exist.

Prediction Generation

Required — prediction_engine.py --generate

Multi-signal prediction engine using the 12 optimized parameters from best-config.json. Generates directional predictions (up/down/stable) with confidence scores and magnitude estimates for each company.

Attestation Upgrade

Required — attestation_upgrade.py --upgrade

Evaluates evidence quality backing each score. Upgrades attestation metadata based on source diversity, recency, and corroboration. Tracks the provenance chain from raw source to scored assertion.

Nightly Insights

Required — nightly-insights.py --schedule all --output file

Runs 7 SPARQL queries (weekly movers, tier transitions, dimension anomalies, distribution-community gaps, predictive signals, attestation coverage, platform vs pure-play) against the Oxigraph store. Produces a timestamped markdown insight digest.

KG Interface Optimization (Exp 2)

Advisory — kg_interface_optimizer.py --nightly

Re-runs 30-trial TPE optimization against expanded historical data. Writes improved parameters to best-config.json if a better configuration is found. This is the core autoresearch loop from Markovick et al.

Event Impact Analysis (Track A)

Advisory — event_impact_analyzer.py --nightly

Per-company event impact reports. Researches news, deals, and app store movements. Scores impact across 5 SBPI dimensions. Classifies events as MATERIAL, MONITORING, or NOISE. Last run analyzed 22 companies with 3 material events detected.

Defensive BI Recommendations (Track B)

Advisory — defensive_bi_agent.py --nightly

Generates mitigation strategies for MATERIAL impact events from Track A. Filters for strategic relevance to prevent reactive noise. Only triggers when Track A identifies events worth defending against.

Signal Weight Optimization (Track C)

Advisory — signal_weight_optimizer.py --nightly

TPE autoresearch loop specifically for signal weighting in the BI agent output. Re-optimizes only when new accuracy labels are available. Prevents reactive noise from accumulating in the BI recommendations.

Data Flow Architecture From raw sources to scored predictions

SerpAPI / Manual Research
    |
    v
SBPI Scoring (5 dimensions x 17 companies)
    |                                           sbpi_to_rdf.py
    v
RDF Triples (sbpi.ttl ontology)  ---------->  Oxigraph Store (2,588 triples)
    |                                               |
    v                                               v
SPARQL Queries (7 query library)          KG Interface (12 params)
    |                                               |
    v                                               v
Insight Digest (nightly-insights.py)      Prediction Engine (multi-signal)
    |                                               |
    v                                               v
Markdown Reports                          TPE Optimization (30 trials/night)
    |                                               |
    v                                               v
insights/ directory                       best-config.json
    |                                               |
    +----------- Weekly Editorial ----------+-------+
                                            |
                                     Event Impact (SerpAPI)
                                            |
                                     Defensive BI Agent

The $7 Economics: The "slow model" cost structure means each internal autoresearch report costs approximately $7 in compute. At 2,000+ reports/month, the internal research pipeline runs for under $14,000/month while generating proprietary knowledge graph assets that compound in value. This decouples IP growth from the client revenue cycle.

Infrastructure Stack

Component	Technology	Role
RDF Store	Oxigraph (local, port 7878)	SPARQL endpoint for knowledge graph queries
Ontology	sbpi.ttl (Turtle/RDF)	5-dimension scoring schema + attestation model
Optimizer	Optuna TPE (Python)	Tree-structured Parzen Estimator for parameter search
ETL	Python (sbpi_to_rdf.py)	Scoring data → RDF triples → Oxigraph
Research	SerpAPI + Claude CLI	Event research and impact scoring
Query Library	SPARQL (.rq files)	7 analytical queries (movers, anomalies, signals, etc.)
Scheduler	Python (weekly-prediction-cycle.py)	9-phase orchestrator
Reporting	Cloudflare Pages	Static editorial sites (sbpi-semantic-layer.pages.dev)

Gemini Session: Scaling Proposals From "Working With Gemini Session on ShurIQ IP and K-Pop Stack Ranking and Auto Research" — proposals for scaling from 96 nodes to billion-node territory

The Gemini brainstorming session identified three new experiment concepts that extend the current 5-experiment autoresearch expansion plan. These proposals target the "hyper-scale" thesis: proving that ShurIQ's knowledge graph, grown via automated research at $7/report, becomes a moat that compounds independent of client revenue.

Recursive Triple Expansion

Source: Gemini Session — "The $7 Flywheel" / Karpathy Auto-Research Method

Deploy a locally hosted quantized model (Llama-3-70B class) to crawl Common Crawl, Semantic Scholar, and industry-specific feeds 24/7. Each $7 processing run extracts entities, relationships, and ontological tags based on the ShurIQ schema. An "Ontological Referee" agent checks extractions against the existing graph for redundancy or contradictions before committing to the permanent store.

Target Scale: From 96 nodes / 268 edges (current, Issue No. 3) to 1B+ nodes
Economics: $7/report × 2,000/month = $14K/month for 200K new nodes/month
Variable: At 100 nodes/briefing extraction density, ~12B nodes in first year at full scale

Self-Consistency Validation

Source: Gemini Session — "Quantifying the IP for Licensing"

Run thousands of "Self-Consistency" tests: ask the system to solve a problem using its KG (non-parametric, curated) vs. its parametric memory (raw LLM). The delta in accuracy is the "Ontological Premium" — the measurable value of the curated knowledge graph over vanilla LLM output. This turns the knowledge graph from an abstract asset into a quantifiable competitive advantage.

Hypothesis: "ShurIQ users experience 90% fewer hallucinations than vanilla GPT-4"
Verification: Karpathy-style self-consistency testing across domains
Output: "Inference Premium" metric — the measurable lift in report quality when using KG vs. raw LLM

Ontological Referee Loop

Source: Gemini Session — "The Stack Rank Weekly"

A two-agent quality gate for the Recursive Triple Expansion pipeline. Agent 1 (the "Slow Processor") extracts entities and relationships. Agent 2 (the "Referee") checks extractions against the existing graph across three dimensions: redundancy (already known), contradiction (conflicts with existing triples), and bridge value (connects previously disconnected clusters). Only high-value nodes that score above a bridge-value threshold get committed to the permanent graph.

KPIs: Extraction Yield (facts/report), Ontology Breadth (unique classes), Inference Premium (quality lift), Amortized Extraction Cost (trending → $0)
Gate Logic: Propose → Cross-Reference → Stack Rank by bridge value → Human approval for top nodes → Commit
Goal: Ensure billions of nodes are signal, not noise

The "Hyper-Scale" Variable Simulation From Gemini session economic modeling

Metric	Current (Issue 3)	Hyper-Scale Goal	Multiplier
Node Count	96 nodes / 268 edges	1B+ nodes	Automated crawling via $7 local slow-model extraction
Accuracy	69.9% (directional)	85%+	Experiments 2-5: MOTPE, temporal decay, dimension weights
Verticals	1 (micro-drama)	10-20 verticals	Experiment 6: cross-vertical transfer (K-Pop next)
Internal Reports	~3/week (manual)	24,000+/year	$7 per report, fully automated pipeline

The L2 Thesis: Following Scott Galloway's L2 model — the stack ranking publication creates a "prestige intelligence engine" where 5-10 premium clients pay $1-5M/year, while 100+ brands pay $10-100K for the ranking and guidance. The knowledge graph IP compounds underneath this revenue model, creating a feedback loop where each client engagement deposits validated nodes into the permanent graph. By Year 5, the service revenue is healthy, but the calculated knowledge graph asset value reaches $600M under conservative valuation assumptions.

Experiment 6: K-Pop Vertical — Cross-Vertical Transfer Warm-starting from micro-drama optimized parameters

12
Parameters to Transfer
From micro-drama best-config.json

K-Pop

Target Vertical

AI Agents already done; K-Pop is next

40-60%

Trial Reduction Target

From 30 to 12-18 trials

+5-10%

Ceiling Accuracy Gain

vs. cold-start random initialization

Transfer Architecture

What Transfers (from Micro-Drama)

direction_threshold: 1.295 — bar for calling directional moves
confidence_base: 0.443 — default confidence calibration
mean_reversion_rate: 0.257 — reversion signal strength
divergence_weight: 0.180 — inter-dimension gap signal
tier_proximity_weight: 0.096 — boundary effects
anomaly_contributes: True — anomaly signal activation
+ 6 magnitude/consistency thresholds and bonuses

What's New (K-Pop Specific)

K-Pop-specific edge types: Fandom metrics, comeback cycles, group/agency relationships
DART financial data: Korean financial disclosure system for agency revenue
Sentiment layer: Fan community sentiment from Weverse, Twitter/X, Naver
Dimension semantics differ: "Distribution" maps to multi-platform presence differently in K-Pop
Dimension weights will NOT transfer (0.25/0.20/0.20/0.20/0.15 are micro-drama specific)

Central Hypothesis

"Community Strength predicts Touring Revenue" — In K-Pop, the community dimension (fan engagement, fandom mobilization, social proof) should be the strongest predictor of revenue outcomes, unlike micro-drama where distribution power dominates. If the 12 interface parameters transfer while dimension weights require recalibration, it proves the SBPI methodology captures structural market dynamics that generalize across entertainment verticals.

Transfer Methodology Based on BOLT framework (Zeng et al. 2025, arXiv:2503.08131)

Copy micro-drama best-config.json

Source config: 12 parameters optimized over 30 TPE trials on 3 weeks of data

Generate 5-10 neighboring configurations

Gaussian perturbation, σ = 0.1 of each parameter's range

Seed K-Pop Optuna study via enqueue_trial()

Warm-start the optimizer with micro-drama trajectory, not just the best point

Run TPE on K-Pop data

Measure trials-to-convergence vs. cold-start baseline

Ablation study: which parameter subsets transferred

Test interface params vs. dimension weights separately. Publish transfer-log.json.

Success Criteria

Criterion	Threshold	Why It Matters
Trials-to-convergence reduction	≥ 40% fewer trials than cold-start	Proves the optimizer "remembers" across verticals
Ceiling accuracy delta	≥ 5% higher than cold-start ceiling	Warm-start reaches a better optimum, not just faster
Interface param stability	≤ 20% drift from micro-drama values	Confirms structural parameters are domain-agnostic
Dimension weight divergence	Significant divergence expected	Confirms weights are domain-specific (validates the split)

Dependencies and Risks

Risk	Impact	Mitigation
Source config is overtuned	Propagates degeneracy to K-Pop	Exp 1 (Goodhart Guard) must clear source config first
K-Pop dimension semantics too different	Interface params don't transfer	Ablation study separates interface from dimension parameters
Insufficient K-Pop data	Can't evaluate predictions	Need 4+ weeks of K-Pop scoring data before starting
Experiments 1-4 not stable on source vertical	Transferring from a moving target	Sequential execution order: Exp 1 → 3 → 4 → 5 → 6

SHUR IQ Experiment Lab

Description

Results

Description

Key Findings

Description

Expected Outcome

Description

Expected Outcome

Description

Expected Outcome

Description

Expected Outcome

Description

Expected Outcome

Description

Projected Economics

Description

Investor Value

Description

Quality Metrics

ETL Load

Prediction Accuracy Check

Prediction Generation

Attestation Upgrade

Nightly Insights

KG Interface Optimization (Exp 2)

Event Impact Analysis (Track A)

Defensive BI Recommendations (Track B)

Signal Weight Optimization (Track C)

Recursive Triple Expansion

Self-Consistency Validation

Ontological Referee Loop

What Transfers (from Micro-Drama)

What's New (K-Pop Specific)

Copy micro-drama best-config.json

Generate 5-10 neighboring configurations

Seed K-Pop Optuna study via enqueue_trial()

Run TPE on K-Pop data

Ablation study: which parameter subsets transferred