METACOGNITIVE BENCHMARK
How it works →

Shoreline

Mapping where capability meets self-knowledge

ConcretePhase 3 failure-aware depth.
SolidPhase 2 verified depth (difficulty-weighted).
SandPhase 1 claim depth (confidence × difficulty).

anthropic/claude-sonnet-4.5

196 trials
0.5% invalid confidence
HOW THIS ISLAND IS FORMED

Each spoke is one category. Layer depth is computed from all trials in that category after normalizing difficulty (0-100 scale).

Sand = Phase 1 claimed depth. Solid = Phase 2 verified depth. Concrete = Phase 3 failure-aware depth (wrong + low confidence).

Aggregate gap metrics: Overconfidence = max(0, Sand - Solid), Underconfidence = max(0, Solid - Sand), Blind Spots = wrong + confident.

CategoryTrialsSandSolidConcrete
Multiplication202.111.011.0
Modular Exp.2221.48.68.6
Boolean Circuits1846.477.40.0
Matrix Det.1619.310.910.9
Combinatorics2058.051.334.9
Random Gen.1834.851.151.1
Constrained Write1213.925.825.8
Sudoku Gen.1427.15.75.7
Distribution1819.34.74.7
Self-Referential1227.117.217.2
Counting2650.377.477.4
MultiplicationModular Exp.Boolean CircuitsMatrix Det.CombinatoricsRandom Gen.Constrained WriteSudoku Gen.DistributionSelf-ReferentialCounting
CONCRETE
22.5
failure-aware depth
SOLID
31.0
verified depth
SAND
29.1
claimed depth
OVERCONFIDENCE
6.7
claimed beyond solid
UNDERCONFIDENCE
8.7
solid beyond claimed
BLIND SPOTS
17.7
wrong but confident
TOTAL GAP
33.1
claim/perf misalignment + blind spots
CAPABILITY IDX
21.8
normalized difficulty frontier
DISCERNMENT
72.1
correctly detects success/failure
CALIBRATION IDX
80.4
confidence vs realized accuracy

openai/gpt-oss-120b

206 trials
29.1% invalid confidence
HOW THIS ISLAND IS FORMED

Each spoke is one category. Layer depth is computed from all trials in that category after normalizing difficulty (0-100 scale).

Sand = Phase 1 claimed depth. Solid = Phase 2 verified depth. Concrete = Phase 3 failure-aware depth (wrong + low confidence).

Aggregate gap metrics: Overconfidence = max(0, Sand - Solid), Underconfidence = max(0, Solid - Sand), Blind Spots = wrong + confident.

CategoryTrialsSandSolidConcrete
Multiplication2265.88.38.3
Modular Exp.2272.76.46.4
Boolean Circuits2468.150.850.8
Matrix Det.1663.410.910.9
Combinatorics2474.351.351.3
Random Gen.1872.011.511.5
Constrained Write1255.72.22.2
Sudoku Gen.1460.30.00.0
Distribution1874.34.34.3
Self-Referential1260.35.85.8
Counting2468.125.925.9
MultiplicationModular Exp.Boolean CircuitsMatrix Det.CombinatoricsRandom Gen.Constrained WriteSudoku Gen.DistributionSelf-ReferentialCounting
CONCRETE
16.1
failure-aware depth
SOLID
16.1
verified depth
SAND
66.8
claimed depth
OVERCONFIDENCE
50.7
claimed beyond solid
UNDERCONFIDENCE
0.0
solid beyond claimed
BLIND SPOTS
24.4
wrong but confident
TOTAL GAP
75.1
claim/perf misalignment + blind spots
CAPABILITY IDX
14.2
normalized difficulty frontier
DISCERNMENT
46.7
correctly detects success/failure
CALIBRATION IDX
53.5
confidence vs realized accuracy
COMPARATIVE INSIGHT

anthropic/claude-sonnet-4.5 leads on raw performance and self-awareness, pairing higher solid ground with tighter calibration.