Week 40 — Scrabble Lookup · From Zero to AI Hero

You can download the official Scrabble word list (TWL or SOWPODS) and you'll get a text file with about 270,000 entries, one per line. Validating a player's move means: is this word in there? The naive approach — read the file once, then search the list line by line for each move — works but is slow enough that the program feels slow. With ten cross-words to validate per turn, every move chews through millions of string comparisons.

Hash maps fix that. A hash table, built once at startup, lets you check membership in roughly constant time regardless of how many words there are. The whole "is this a word?" question becomes a single hash and a single string compare. This is exactly the problem hash maps were invented for, and it's why Week 31 felt important.

Loading the dictionary

Read the file once into a std::unordered_set<std::string>. That gives you O(1) average-case membership tests for the entire game.

class Dictionary {
public:
    explicit Dictionary(const std::string& path) {
        std::ifstream in(path);
        if (!in) throw std::runtime_error("can't open dictionary");
        std::string word;
        while (in >> word) {
            for (auto& c : word) c = std::toupper(c);
            words_.insert(word);
        }
    }

    bool is_word(const std::string& w) const {
        return words_.count(w) > 0;
    }

private:
    std::unordered_set<std::string> words_;
};

Cost of construction: about 120 ms on a modern laptop. Cost of is_word("HELLO"): a few hundred nanoseconds. You can call it a million times and not notice. That's the win.

The bigger question

"Is HELLO a word?" is the easy question. The hard question — the one a Scrabble AI needs answered — is: given these 7 tiles on my rack, what valid words can I spell?

The naive way: try every permutation of every subset of the rack, look each one up. With 7 letters that's 7! × subsets ≈ 13,700 permutations. Multiply by 270,000 words you might check against, then by the dozens of board positions you'd want to score against, and you've blown the budget.

Better: turn the question into a lookup, not a search. The trick is the sorted-letter signature.

The sorted-letter trick

Two words are anagrams of each other iff they have the same multiset of letters. EAT, ATE, TEA all sort to AET. So: take every word in the dictionary, sort its letters, and use that as the key. Build a map from sorted-letter string → list of real words.

std::string signature(std::string w) {
    std::sort(w.begin(), w.end());
    return w;
}

// build at load time:
std::unordered_map<std::string, std::vector<std::string>> anagrams;
for (auto& w : words_) anagrams[signature(w)].push_back(w);

// lookup: "I have AEILNST. What words can I spell?"
// answer: every dictionary word whose signature is a SUBSET of AEILNST.

Now "what words contain exactly these letters?" is a single map lookup. anagrams["AET"] returns {"ATE", "EAT", "ETA", "TAE", "TEA"} in microseconds.

"What words can I spell with these 7 letters or any subset of them?" requires checking the 2⁷ = 128 subsets of your rack, each a tiny lookup — still milliseconds, not seconds.

Why this is fast

A hash map turns "search through 270,000 strings" into "compute one hash, follow one pointer, compare one string." That's a couple of dozen CPU instructions — maybe 100 ns. Searching the whole list, even with early-exit on a sorted vector and binary search, is closer to 18 comparisons of 5–10 characters each, plus cache misses on every probe. The hash map wins by a factor of roughly 100×, and the win grows as the dictionary grows.

This is exactly the speedup Big-O analysis predicted in Week 30. O(1) versus O(log n) for binary search, versus O(n) for linear scan. For 270,000 words: 1 vs. 18 vs. 270,000 operations. The data structure choice is doing real work.

The blank tile

Scrabble has two blank tiles that the player assigns any letter. For move scoring this doesn't change much (blanks are worth 0 either way) but for "what can I spell?" it expands every signature: a rack of EILNTRS + blank can spell every 8-letter word whose signature differs from yours by exactly one extra letter.

The clean way to handle this: when computing subsets of the rack, treat each blank as a wildcard that adds the alphabet's 26 letters as candidates. The combinatorics stays modest because blanks are rare.

The scary alternative

There's a better, scarier data structure for Scrabble: the GADDAG (a kind of generalised trie). It supports "what words can I make that pass through this anchor square, fitting these constraints?" in linear time. Real Scrabble engines (like Quackle) use it. It's beautiful and complicated; a properly tuned GADDAG can find the highest-scoring move on any board in milliseconds.

For our learning project, the hash-map approach is more than fast enough. The point isn't to beat Quackle — the point is to feel why a data structure changes whether a problem is "easy" or "impossible."

The same problem can take a microsecond or a minute, depending on how you store the data.

Why it matters for AI

Hash-keyed lookup of pre-computed signatures is one of the deepest tricks in computing. Transposition tables in chess engines (next week's territory): hash a board position, look up "have I already analysed this?". Caching in LLM inference: hash a prompt prefix, reuse the attention computation. Vector databases for retrieval-augmented generation: a high-dimensional analogue of "find the nearest signature." Same principle every time — turn a search into a lookup by computing the right key.

Try it yourself

Download a free Scrabble word list (search "TWL word list" or "SOWPODS"). Load it with the code above. Check is_word against a few inputs — including some that should fail.
Build the anagram map. Print anagrams["AEINRST"] — you should see at least RETAINS, STAINER, RETSINA, etc. There are about 8 anagrams for that signature; it's a famously rich set.
Write a function best_anagrams(const std::string& rack) that returns every dictionary word formable from any subset of the rack, sorted by score. This is the core of a (very simple) Scrabble AI.
Time the dictionary load. On a modern machine, 270,000 words should load in < 200 ms. If yours takes much longer, you're probably reallocating the underlying hash table — use reserve() first.

What's next

Two projects down. Now the giant. Chess has 32 pieces, 6 piece types, 64 squares, three special moves (castling, en passant, promotion), and a search space larger than the number of atoms in the observable universe. The OOP design of the chess engine is real engineering, and the AI in Week 43 is the first place we'll meet alpha-beta pruning — the optimisation that makes Minimax practical for any non-trivial game.

Week 41 is Project 3: The Chess Engine — modelling the pieces.

Quick check

1. What is the average-case complexity of hash-set membership tests?

O(N)
O(log N)
O(1)

Reveal Answer

Answer: C. That's the whole reason we load the dictionary into a hash set instead of searching it linearly.

2. What does the 'sorted-letter signature' trick let you do?

Sort the dictionary alphabetically
Map all anagrams of a word to a single key — turning anagram search into a hash lookup
Compress the dictionary

Reveal Answer

Answer: B. EAT, ATE, TEA all sort to AET. Use AET as the key, store all real words at that key. Lookup is O(1).

3. Roughly how many words are in a competitive Scrabble dictionary like SOWPODS?

2,000
27,000
270,000

Reveal Answer

Answer: C. Big enough that the data structure choice matters. With a hash set, lookup is still microseconds.