Making the most annoying jigsaw puzzle

Puzzles

Published

March 30, 2026

Before you scroll down

There are interactive copies of the puzzles in this post. They were designed to be as annoying as possible. That is the entire point. Please do not complain to me that they are annoying — they are working as intended.

Hover a piece to focus it (blue outline), then press R to rotate or F to flip — or use the buttons. Drag pieces onto the grid; a piece only lands if every edge it touches connects correctly. Each connector has a shape (triangle / square / semicircle, shallow or deep). Positive values stick OUT as bumps; negative values are matching holes. A connector's offset along its edge (nearer one end than the other) encodes its "flip side" — flipping a piece moves its bumps from one side of the edge to the other.

A few weeks ago I was in a shop and saw a jigsaw puzzle that looked, on purpose, deeply unpleasant to solve — uniform colour, repeating pattern, pieces that all looked roughly interchangeable. I stood there for a minute wondering how annoying you could actually make a jigsaw, if you set out to try. Not “tricky”, not “large”, but maximally annoying per piece. That question stuck with me, and this post is what happened when I took it seriously.

What makes a jigsaw annoying?

Before we can build the worst puzzle, we need to be precise about what “worst” means. Jigsaws have a lot of features that help you make progress, so let’s strip them away one at a time.

First, no picture. A printed image is a huge hint — you can glance at a piece and know roughly where it goes. We will make every piece look identical on the surface: same blank colour, no texture, no cat, not even a sky in varying shades of blue. Nothing to tell you that this piece belongs in the sky and that one belongs in the tree.

Second, double-sided pieces. A normal jigsaw has a “right way up” — a printed side and a blank side, which saves you trying pieces both ways up. If both sides are the same, you don’t even know which way a piece goes. Every piece now has twice as many orientations to try.

Third, no straight edges. No finding the corners first and getting the satisfaction of finishing the frame here. We will have no straight edges anywhere, including on the border of the puzzle itself. From the outside, every piece looks like every other piece.

Fourth, and most devious, locally ambiguous connections. In a normal jigsaw, two pieces fitting is strong evidence they belong together. We will design pieces so that the same piece can physically slot into several different neighbours. Two pieces clicking does not mean you are right.

Given all of that, the question is: how small a puzzle can we make that is still unreasonably hard?

Formalising a piece

You can get all the code at . The code is written in Rust.

Each piece has four edges: top, right, bottom, left. Each edge carries an integer. Positive values are “bumps”, negative values are the matching “holes”. Connector n fits connector −n and nothing else.

type Piece = [i32; 4];  // [top, right, bottom, left]

Rotating a piece 90° clockwise cycles the edges:

fn rotate_cw(p: Piece) -> Piece { [p[3], p[0], p[1], p[2]] }

Flipping is more interesting. We want flipping a piece over (like turning a double-sided tile) to produce a piece whose edges attach differently from the original. We’ll pair up the odd and even magnitudes — 1 with 2, 3 with 4, and so on — and say that flipping swaps each odd with its next even, preserving sign:

fn flip_value(v: i32) -> i32 {
    let a = v.abs();
    let new_a = if a % 2 == 1 { a + 1 } else { a - 1 };
    new_a * v.signum()
}

Why? Without this rule, flipping just produces another piece using the same alphabet of connector values, and the flipped copies are indistinguishable at the edge level. With the odd/even swap, flipping gives a genuinely different piece — same physical object, but attaches to neighbours through a different set of shapes. Pieces within a “pair” (here, {±1, ±2} is pair 1; {±3, ±4} is pair 3; and so on) look confusable to a solver but can’t all fit each other.

Each piece therefore has up to 8 distinct orientations: 4 rotations times 2 flips. Pieces with internal symmetry have fewer, which we discover by collecting them in a HashSet.

An arrangement and its score

A puzzle is a multiset of pieces. A valid arrangement fills a rows × cols grid so that every interior edge is consistent — wherever two pieces share an edge, if one side reads +v the other must read −v. Border edges are unconstrained; nothing sits outside the grid to fit against.

Now we need a metric. How annoying is a given puzzle to solve?

Normally a jigsaw has one solution; ours may have multiple. It’s not completely clear whether it’s better to enforce one solution, so users have a single target, or to allow multiple, so users have more solutions to stumble across. What we’re going to measure is: how many wrong turns does a solver take before finding a solution?

Counting symmetric copies, our puzzle will always have at least 8 solutions — you can rotate the answer and flip it over. So let’s run a backtracking search and count two things: the number of search-tree nodes explored (each being one placement attempt) and the number of valid arrangements found. Annoyance is the ratio:

annoyance = nodes / solutions

Roughly, the expected number of dead-end attempts between successive solutions. With at least 8 solutions baked in by symmetry, the denominator is never zero.

This isn’t a perfect model of a person solving a jigsaw. A human doesn’t enumerate all solutions — they just want one. And they don’t use row-major with no heuristics. But the ratio has a useful property: it rewards puzzles where valid placements are hard to find, rather than puzzles with one solution in a tiny search space. We’ll test the metric against a much smarter solver later and see how it holds up.

A first solver

The simplest thing that could work: fill positions in row-major order, try every unused piece in every orientation, check the top and left constraints (the only placed neighbours), recurse, backtrack. Around 40 lines of Rust. Every (piece, orientation) attempt increments nodes; every full-grid fill increments solutions.

Searching puzzle space with simulated annealing

We have a metric; we want to design puzzles that maximise it. The natural tool is simulated annealing: propose a random neighbour, accept if better, accept probabilistically if worse, cool the acceptance probability over time.

Except that the neighbourhood needs thought. If we change one arbitrary connector value, we’ll break the validity of the interior edge it sits on. We want to stay inside the space of valid puzzles throughout the search.

The trick is to mutate a whole edge atomically. Pick an edge — interior or border — uniformly at random:

Interior edge: update both sides to (+v, −v) in one step. Validity is preserved by construction.
Border edge: flip one side to ±v with a random sign. Borders have no matching constraint, so any value is fine.

What should v be? Here we can be devious. With probability p_reuse = 0.8 we sample from the set of connector magnitudes already present in the puzzle; with probability 0.2 we pick a fresh value from 1..=20:

fn choose_connector_type(grid: &[Piece], rng: &mut impl Rng,
                         p_reuse: f64, max_fresh: i32) -> i32 {
    let mut pool: Vec<i32> = grid.iter()
        .flat_map(|p| p.iter().copied())
        .map(|v| v.abs())
        .collect();
    pool.sort_unstable();
    pool.dedup();

    if !pool.is_empty() && rng.gen_range(0.0f64..1.0) < p_reuse {
        pool[rng.gen_range(0..pool.len())]
    } else {
        rng.gen_range(0..max_fresh as usize) as i32 + 1
    }
}

Reuse is what makes pieces look alike. Two pieces drawn from a small shared connector alphabet both become candidates at many positions, forcing the solver to try more placements before it discovers which one is right. This single parameter is the most important lever for pushing annoyance upward. Without it, SA tends to find puzzles with many distinct connector types — locally ambiguous but globally easy.

Cooling is geometric, T sweeping from 1000 down to 1 over the run. Accept if delta > 0 or if rand() < exp(delta/T). Standard.

First results

3×3, 50 runs, 5,000 SA iterations each, dumb solver, t_start=1000, t_end=1. Annoyance distribution:

min    = 142,566
median = 209,765
max    = 425,049

Those numbers include the factor of 8 from global symmetry in the denominator. A score of 425k means the solver explores roughly 3.4 million nodes per puzzle.

The best puzzle:

[  4,  7,  7,  3]  [ -4,  4,  7, -7]  [  3, 16,  4, -4]
[ -7,  3,  4, -3]  [ -7,  3,  3, -3]  [ -4, -4,  3, -3]
[ -4,  4, -5, -4]  [ -3,  7, -4, -4]  [ -3, -3, -3, -7]

Almost every connector value is from {3, 4, 7}, with a single 16 and a single −5 thrown in. SA has discovered — without being told — that a small connector alphabet with a handful of decoys produces lookalike pieces that defeat the row-major solver. That’s SA earning its keep.

A cap mistake

To scale up to 3×4 and 4×4 we added a node cap: if the solver explored more than N nodes, stop and return N as the score. The idea was to keep per-evaluation time bounded.

This was a mistake. In an 8-hour overnight run:

3×4 at cap 2,000,000: every run after the first saturated at exactly 2M. Standard deviation of 0.
4×4 at cap 500,000: min 250k, max 500k, stddev 8,151. Most runs pegged against the cap.

A cap that every candidate hits is useless as a search signal. SA cannot distinguish “quite annoying” from “catastrophically annoying” when both score the same. The landscape goes flat and the anneal has nothing to climb. Lesson: choose the objective so that the scores you are trying to distinguish are distinguishable.

The 3×3 run (no cap) came through fine. Best against the dumb solver: 695,841. That was our high-water mark for a while.

A cleverer solver

The dumb solver is wasteful. Row-major order commits at position 0 — the top-left corner, with only one placed neighbour to constrain it — before ever looking at positions where four neighbours would narrow the choice immediately. A good player doesn’t work that way.

An actual player uses minimum remaining values (MRV):

Look at every empty position that has at least one placed neighbour.
For each, count the (piece, orientation) placements that fit all four neighbour constraints.
If any count is zero, prune — this branch is dead.
Otherwise branch at the position with the fewest options. Forced moves (count = 1) fall out automatically as the minimum.

for pos in 0..self.n {
    if self.grid[pos].is_some() { continue; }
    if !is_constrained(pos) { continue; }

    let req = self.requirements(pos);
    let mut opts = Vec::new();
    for i in 0..self.n {
        if self.used[i] { continue; }
        for &o in &orientations[i] {
            if Solver::fits_clever(o, &req) { opts.push((i, o)); }
        }
    }
    if opts.len() < best_count {
        best_count = opts.len();
        best_pos   = Some(pos);
        best_opts  = opts;
        if best_count == 0 { break; }
    }
}

fits_clever is four Option<i32> comparisons — cheap, but called in the innermost loop, so the outer loop extracts the four neighbour requirements once per branch position rather than recomputing them per piece.

Unconstrained positions (no placed neighbours) always score maximum, so we skip them. Except on the opening move, when no position has any neighbours at all — that one gets special-cased, and we’ll come back to it.

Switching to MRV collapsed our old record. That 695,841-annoyance puzzle? Against the clever solver it scores a few thousand. All the lookalike placements the dumb solver waded through, the clever one prunes immediately. From here on, “annoyance” means annoyance against the MRV solver.

A digression: piece-level search

MRV asks “which position has the fewest candidate pieces?”. The dual question is “which piece has the fewest candidate (position, orientation) placements?”. If some unused piece has fewer options than the best position, we should branch on the piece instead.

It sounds like it ought to help. Does it?

Cheap lower bound first. Each unused piece has at least 8 placements in every position with no placed neighbours (all orientations are free). So the minimum piece count is at least 8 × unconstrained_count. If that’s already above the best position count, piece search cannot win — skip it.

Instrumented on a reference 4×4 puzzle, 20-second workload:

Total branches: 36,019,596
Piece search attempted: 691,968 (1.92% of branches)
Piece search won over position MRV: 174,600 (25% of attempts)
Net: 0.48% of branches used a piece-level finding.
Annoyance: 378,031 → 375,205 (0.75% reduction)

Almost useless. Why? Until the grid is nearly full, there’s almost always an empty position with no placed neighbours, so the 8 × unconstrained bound immediately rules piece search out. Only in the endgame does it become relevant, and even then the position heuristic usually finds the same forced move anyway.

I suspected this before running the experiment, and it’s nice when an intuition holds up. But “probably won’t help” is a weaker claim than “moves the score by 0.75%”. And the pre-check — that 8 × unconstrained lower bound — costs effectively nothing, which makes the negative result cheap to establish.

In reality, this is the biggest part of research people never write about — only a small proportion of ideas end up being worth using. It might be that other people are much better than me at choosing which ideas are worth trying!

The opening move

Here’s the change that actually mattered. Up to this point, when the grid was empty the opening move went at position 0 — top-left corner. A corner has two neighbours. The centre of the grid has four.

if !any_constrained {
    let middle = (rows / 2) * self.cols + self.cols / 2;
    best_pos = Some(middle);
    // ...
}

On the same reference 4×4 puzzle (annoyance 378,031 against the corner-first solver):

Corner-first: 378,031
Middle-first: 140,989

A 63% reduction, and evaluations per second roughly doubled. Starting in the middle means the four positions around the first piece all become constrained immediately, so MRV has four candidate positions to choose a tight branch from on move two — rather than the two a corner gives us. The tree narrows much faster.

There’s a caveat worth naming. The 378k puzzle was designed by SA optimising against corner-first, so of course it exploits corner-first’s weaknesses. The fair test is to re-run SA against middle-first and see what annoyance ceiling the new pipeline can reach.

Polish

SA uses stochastic single-edge mutations with a cooling schedule. When the run ends, the final puzzle is at a local optimum against that random process, but not necessarily against a deterministic single-edge neighbourhood. There are sometimes small improvements SA just never tried.

After every SA run, sweep every edge and try replacing it with every value in the puzzle’s palette — the set of connector values currently present, plus their negations and flips. Typically 4 to 16 candidates. Keep any change that raises annoyance. Iterate whole-grid sweeps until no pass finds anything.

fn collect_palette(grid: &[Piece]) -> Vec<i32> {
    let mut set: HashSet<i32> = HashSet::new();
    for piece in grid {
        for &v in piece {
            if v == 0 { continue; }
            set.insert(v);
            set.insert(-v);
            set.insert(flip_value(v));
            set.insert(-flip_value(v));
        }
    }
    let mut out: Vec<i32> = set.into_iter().collect();
    out.sort_unstable();
    out
}

Interior edges mutate both sides atomically (+v, −v); border edges are single-sided. Cost is low — around 400 evaluations per sweep, converging in one to three sweeps, so roughly 5% overhead on top of 20,000 SA iterations. The benefit:

Polish finds an improvement in 25.6% of runs (437 / 1,704).
Average delta across all runs: +138.
When polish does help, the wins can be dramatic. Top deltas observed: +10,726, +10,486, +10,108, +9,071, +4,858. Three of those turned middling 3k-range SA results into new-best 11k–19k records.

SA finds the structure. Polish closes the gap when SA has left the puzzle one edge off the local optimum.

When doing puzzle generation I find a polish phase is almost always required. It’s not always about making the puzzle harder — sometimes it’s about removing useless pieces, or trivial steps — but without this, users often look at puzzles and see they look obviously “unfinished”.

Results: 3×3 against MRV + polish

Overnight, 6 threads, 1,704 SA-plus-polish runs. The new best:

[  -2,   1,   1,  -2]  [   2,   1,   2,  -1]  [  -2,   2,   1,  -1]
[  -1,   2,   4,  -1]  [  -2,   2,   2,  -2]  [  -1,  -2,   1,  -2]
[  -4,   4,  -7,  -4]  [  -2,   1,   1,  -4]  [  -1,   2,   1,  -1]

Annoyance: 42,176. The puzzle uses three pair types: pair 1 (±1, ±2), pair 3 (±3, ±4), and pair 7 (±7, ±8). Of the 36 connector slots, 30 are pair 1 — almost the whole puzzle runs on the same two interchangeable shapes. Two interior edges use pair 3, and the bottom-left piece carries a genuine orphan on its bottom edge: −7, with no +7 and no +8 (which could be flipped into a +7) anywhere in the puzzle.

An orphan is a border connector with no potential partner anywhere in the puzzle. Perfectly legal — borders aren’t constrained — but a solver, human or otherwise, has to discover that fact by trying and failing. From the player’s point of view, the −4 on the same piece’s left edge looks similarly suspicious — there are +4s in the puzzle, and the player doesn’t know they’re all already committed interior — so it acts as a decoy even though it isn’t a true orphan. Nine pieces, almost identical at a glance, with a deliberate lie and a plausible-looking distraction along the border. That’s the worst 3×3 we’ve managed to design so far.

Results: 2×2

What about the smallest interesting puzzle? 2×2 evaluations finish in microseconds, so it was cheap to throw compute at: 6 threads, max_iter=5000, t_start=500, t_end=1, polish on every run. In 6.5 minutes we got through around 240,000 runs.

Best 2×2: annoyance 67.50.

[   8,   7,  -8,   6]  [  -8,   7,   7,  -7]
[   8,  -8,   7,   7]  [  -7,  -8,  -6,   8]

Same recipe as the 3×3 winners. One pair used almost everywhere — pair 7 (±7, ±8) for 14 of the 16 slots — plus a ±6 pair of decoys on the borders.

The shape of the distribution is the interesting bit. Six threads each, independently, reached exactly 67.00 — six new-best records at the same score. Only one thread, after 33,865 runs, pushed past it to 67.50. Everything else clustered in the 45–60 range.

A quick sanity check on the ratio. Annoyance = nodes / solutions, and 2×2 has at least 8 symmetric solutions, so:

67.00 = 536 search nodes across 8 solutions
67.50 = 540 search nodes across 8 solutions

A four-node gap. At this scale, six seeds converging on the same number — and 240,000 runs failing to beat it by more than 0.5 — is strong evidence that we’ve found the global optimum, or are within a handful of nodes of it.

Open questions

Is 42,176 near the ceiling for 3×3, or can a longer run with more polish push it further? Unclear.
Solver upgrades that might reshape the landscape: forward-checking before committing; degree-heuristic tie-breaking inside MRV; symmetry breaking (which shrinks both numerator and denominator of annoyance together and so doesn’t change the score, but would speed up evaluation).

If you tried any of the puzzles on this post and you’re now upset, please refer to the notice at the top of the post.