METACOGNITIVE BENCHMARK

Shoreline

Mapping where capability meets self-knowledge

ConcreteHow well it catches its own mistakes.

SolidWhat it actually gets right.

SandWhat it thinks it can do.

MODEL AMODEL B

anthropic/claude-sonnet-4.5

196 trials

0.5% invalid confidence

HOW THIS ISLAND IS FORMED

Each spoke is one category. Layer depth is computed from all trials in that category after normalizing difficulty (0-100 scale).

Sand = what the model thinks it can do. Solid = what it actually gets right. Concrete = how well it catches its own mistakes.

Concrete is scaled from solid by mistake-awareness: concrete = solid × (caught mistakes / total mistakes).

Aggregate gap metrics: Overconfidence = max(0, Sand - Solid), Underconfidence = max(0, Solid - Sand), Blind Spots = wrong + confident.

Category	Trials	Sand	Solid	Concrete
Multiplication	20	2.1	11.0	8.3
Modular Exp.	22	21.4	8.6	7.7
Boolean Circuits	18	46.4	77.4	77.4
Matrix Det.	16	19.3	10.9	9.1
Combinatorics	20	58.0	51.3	5.7
Random Gen.	18	34.8	51.1	22.4
Constrained Write	12	13.9	25.8	22.0
Sudoku Gen.	14	27.1	5.7	5.2
Distribution	18	19.3	4.7	3.1
Self-Referential	12	27.1	17.2	6.1
Counting	26	50.3	77.4	51.6

Loading 3D view...

CONCRETE

19.9

catches own mistakes

SOLID

31.0

actually gets right

SAND

29.1

thinks it can do

OVERCONFIDENCE

6.7

claimed beyond solid

UNDERCONFIDENCE

8.7

solid beyond claimed

BLIND SPOTS

17.7

wrong but confident

TOTAL GAP

33.1

claim/perf misalignment + blind spots

CAPABILITY IDX

21.8

normalized difficulty frontier

DISCERNMENT

72.1

correctly detects success/failure

CALIBRATION IDX

80.4

confidence vs realized accuracy

google/gemini-3-flash-preview

200 trials

HOW THIS ISLAND IS FORMED

Each spoke is one category. Layer depth is computed from all trials in that category after normalizing difficulty (0-100 scale).

Sand = what the model thinks it can do. Solid = what it actually gets right. Concrete = how well it catches its own mistakes.

Concrete is scaled from solid by mistake-awareness: concrete = solid × (caught mistakes / total mistakes).

Aggregate gap metrics: Overconfidence = max(0, Sand - Solid), Underconfidence = max(0, Solid - Sand), Blind Spots = wrong + confident.

Category	Trials	Sand	Solid	Concrete
Multiplication	22	1.4	13.9	13.9
Modular Exp.	20	54.0	10.8	9.8
Boolean Circuits	18	46.4	77.4	77.4
Matrix Det.	16	16.9	17.4	14.2
Combinatorics	24	65.8	59.9	0.0
Random Gen.	18	77.4	28.3	11.5
Constrained Write	12	58.0	25.8	0.0
Sudoku Gen.	14	54.2	4.4	0.9
Distribution	18	73.5	4.7	1.3
Self-Referential	12	58.0	11.6	2.2
Counting	26	65.8	77.4	0.0

CONCRETE

11.9

catches own mistakes

SOLID

30.1

actually gets right

SAND

51.9

thinks it can do

OVERCONFIDENCE

26.9

claimed beyond solid

UNDERCONFIDENCE

5.0

solid beyond claimed

BLIND SPOTS

37.7

wrong but confident

TOTAL GAP

69.6

claim/perf misalignment + blind spots

CAPABILITY IDX

26.5

normalized difficulty frontier

DISCERNMENT

58.3

correctly detects success/failure

CALIBRATION IDX

64.6

confidence vs realized accuracy

COMPARATIVE INSIGHT

anthropic/claude-sonnet-4.5 leads on raw performance and self-awareness, pairing higher solid ground with tighter calibration.

anthropic/claude-sonnet-4.5

google/gemini-3-flash-preview

TRIAL RESULTS