RESEARCH

Methodology

How TruthProbe evaluates provider reliability

Our detection system combines multiple independent signals to assess whether an inference provider is serving the model you requested. No single signal is sufficient — we use a weighted ensemble approach for robust detection.

How it works in 30 seconds

✓We purchase real accounts at each provider and send real requests weekly
✓3 independent detection signals — providers can't fake all of them simultaneously
✓100+ test requests per model per provider per week
✓Zero financial relationship with any ranked provider

Multi-Signal Detection System

Each response is analyzed across three orthogonal dimensions. Signals are combined using a weighted ensemble to produce a final confidence score.

Text 45%

Timing 40%

Field 15%

Text Analysis (45% weight)

45%

We evaluate the linguistic fingerprint of each response against known baselines for each model family.

-Vocabulary richness and token diversity metrics
-Reasoning depth scoring (chain-of-thought patterns, logical connectives)
-Stylistic markers unique to model families (sentence structure, hedging patterns)
-Semantic complexity relative to prompt difficulty
-Cross-reference against 10,000+ baseline responses per model

Timing Variance (40% weight)

40%

Response latency distributions reveal model identity — each model architecture has a characteristic timing signature.

-Time-to-first-token (TTFT) distribution analysis
-Inter-token interval variance (coefficient of variation)
-Throughput characteristics per output length
-Comparison against known latency profiles for each model
-Detection of impossibly fast responses (model too small for claimed tier)

Model Field Verification (15% weight)

15%

A basic but important signal — checking if the model identifier returned matches what was requested.

-Verify model field in response headers matches the request
-Cross-reference with known model naming conventions
-Detect aliasing patterns (e.g., custom names mapping to different models)
-Low weight because providers can trivially spoof this field

Score Interpretation

Provider reliability scores are computed weekly from aggregated signal data. Here is how to interpret them:

85+

85+ Trusted — High confidence the provider serves claimed models faithfully. Minor discrepancies within normal variance.

70-84

70-84 Acceptable — Generally reliable, but occasional anomalies detected. May have specific models or time periods with lower confidence.

50-69

50-69 Concerning — Significant anomalies detected. Evidence suggests model substitution or inference degradation on some requests.

<50

<50 Avoid — Strong evidence of systematic model substitution or inference degradation. Not recommended for production use.

Data Collection

TruthProbe uses two complementary approaches to gather detection data:

Passive Analysis

When users route traffic through TruthProbe, we analyze responses in real-time without adding meaningful latency (<50ms overhead). This provides continuous monitoring at scale.

Active Probes

We maintain paid accounts at each ranked provider and run standardized test suites weekly. Probe costs are funded entirely by TruthProbe — never passed to users. Each provider receives 100+ probe requests per model per week.

Update Frequency

Rankings are recalculated and published every Monday at 00:00 UTC. Active probes run continuously throughout the week. A provider's score reflects a rolling 4-week window of data, weighted toward the most recent week.

Common Questions

Will my traffic data be used for testing?

No. User traffic is analyzed locally in real-time for your personal audit dashboard. We never share, store, or use your request content for ranking calculations.

What if a provider improves their service?

Scores use a rolling 4-week window weighted toward recent data. Genuine improvements typically reflect in rankings within 1-2 weeks.

Why not just check the model name in the response?

Because providers can trivially set any string in that field. That's why model field verification is only 15% of our score — the other 85% comes from behavioral signals that are much harder to fake.

Limitations & Transparency

Text analysis is probabilistic — individual responses cannot be classified with certainty. Confidence emerges from aggregating many samples.

Timing analysis can be affected by network conditions and geographic routing. We account for this with regional baselines.

Providers can potentially adapt to known detection methods. We continuously update our detection signals.

New model releases require baseline calibration (typically 48-72 hours of data collection).

Scores represent aggregate behavior — a provider may perform differently on specific models or at different load levels.

TruthProbe has no financial relationship with any ranked provider. We purchase accounts at standard retail pricing. Our revenue comes exclusively from user subscriptions, not provider partnerships.