How TruthProbe evaluates provider reliability
Our detection system combines multiple independent signals to assess whether an inference provider is serving the model you requested. No single signal is sufficient — we use a weighted ensemble approach for robust detection.
Each response is analyzed across three orthogonal dimensions. Signals are combined using a weighted ensemble to produce a final confidence score.
Provider reliability scores are computed weekly from aggregated signal data. Here is how to interpret them:
85+ Trusted — High confidence the provider serves claimed models faithfully. Minor discrepancies within normal variance.
70-84 Acceptable — Generally reliable, but occasional anomalies detected. May have specific models or time periods with lower confidence.
50-69 Concerning — Significant anomalies detected. Evidence suggests model substitution or inference degradation on some requests.
<50 Avoid — Strong evidence of systematic model substitution or inference degradation. Not recommended for production use.
TruthProbe uses two complementary approaches to gather detection data:
Rankings are recalculated and published every Monday at 00:00 UTC. Active probes run continuously throughout the week. A provider's score reflects a rolling 4-week window of data, weighted toward the most recent week.
Will my traffic data be used for testing?
No. User traffic is analyzed locally in real-time for your personal audit dashboard. We never share, store, or use your request content for ranking calculations.
What if a provider improves their service?
Scores use a rolling 4-week window weighted toward recent data. Genuine improvements typically reflect in rankings within 1-2 weeks.
Why not just check the model name in the response?
Because providers can trivially set any string in that field. That's why model field verification is only 15% of our score — the other 85% comes from behavioral signals that are much harder to fake.
Text analysis is probabilistic — individual responses cannot be classified with certainty. Confidence emerges from aggregating many samples.
Timing analysis can be affected by network conditions and geographic routing. We account for this with regional baselines.
Providers can potentially adapt to known detection methods. We continuously update our detection signals.
New model releases require baseline calibration (typically 48-72 hours of data collection).
Scores represent aggregate behavior — a provider may perform differently on specific models or at different load levels.
TruthProbe has no financial relationship with any ranked provider. We purchase accounts at standard retail pricing. Our revenue comes exclusively from user subscriptions, not provider partnerships.