Shoreline
Mapping where capability meets self-knowledge
anthropic/claude-sonnet-4.5
HOW THIS ISLAND IS FORMED
Each spoke is one category. Layer depth is computed from all trials in that category after normalizing difficulty (0-100 scale).
Sand = Phase 1 claimed depth. Solid = Phase 2 verified depth. Concrete = Phase 3 failure-aware depth (wrong + low confidence).
Aggregate gap metrics: Overconfidence = max(0, Sand - Solid), Underconfidence = max(0, Solid - Sand), Blind Spots = wrong + confident.
| Category | Trials | Sand | Solid | Concrete |
|---|---|---|---|---|
| Multiplication | 20 | 2.1 | 11.0 | 11.0 |
| Modular Exp. | 22 | 21.4 | 8.6 | 8.6 |
| Boolean Circuits | 18 | 46.4 | 77.4 | 0.0 |
| Matrix Det. | 16 | 19.3 | 10.9 | 10.9 |
| Combinatorics | 20 | 58.0 | 51.3 | 34.9 |
| Random Gen. | 18 | 34.8 | 51.1 | 51.1 |
| Constrained Write | 12 | 13.9 | 25.8 | 25.8 |
| Sudoku Gen. | 14 | 27.1 | 5.7 | 5.7 |
| Distribution | 18 | 19.3 | 4.7 | 4.7 |
| Self-Referential | 12 | 27.1 | 17.2 | 17.2 |
| Counting | 26 | 50.3 | 77.4 | 77.4 |
openai/gpt-oss-120b
HOW THIS ISLAND IS FORMED
Each spoke is one category. Layer depth is computed from all trials in that category after normalizing difficulty (0-100 scale).
Sand = Phase 1 claimed depth. Solid = Phase 2 verified depth. Concrete = Phase 3 failure-aware depth (wrong + low confidence).
Aggregate gap metrics: Overconfidence = max(0, Sand - Solid), Underconfidence = max(0, Solid - Sand), Blind Spots = wrong + confident.
| Category | Trials | Sand | Solid | Concrete |
|---|---|---|---|---|
| Multiplication | 22 | 65.8 | 8.3 | 8.3 |
| Modular Exp. | 22 | 72.7 | 6.4 | 6.4 |
| Boolean Circuits | 24 | 68.1 | 50.8 | 50.8 |
| Matrix Det. | 16 | 63.4 | 10.9 | 10.9 |
| Combinatorics | 24 | 74.3 | 51.3 | 51.3 |
| Random Gen. | 18 | 72.0 | 11.5 | 11.5 |
| Constrained Write | 12 | 55.7 | 2.2 | 2.2 |
| Sudoku Gen. | 14 | 60.3 | 0.0 | 0.0 |
| Distribution | 18 | 74.3 | 4.3 | 4.3 |
| Self-Referential | 12 | 60.3 | 5.8 | 5.8 |
| Counting | 24 | 68.1 | 25.9 | 25.9 |
anthropic/claude-sonnet-4.5 leads on raw performance and self-awareness, pairing higher solid ground with tighter calibration.