March 5, 2025

How to build a search engine (and what I learned from zero)

I had no idea how search engines worked, so I looked it up and built one. Here's the research, the design, and the code.

I wanted to build a search engine. Not Google. Just something that could take a bunch of documents and return the ones that match a query. I had no idea where to start. So I did what everyone does: I googled it.

Step one: figure out what I was even building

My first search was something like "how do search engines work." That gave me the usual high-level stuff: crawl the web, index the pages, rank them when someone searches. Okay, but that's the product. I needed to know the mechanism. What data structure makes "find all documents that contain this word" fast? So I searched again: "how to build a search engine from scratch."

That’s when I found a good entry point: Search Engine from Scratch — Part 1: The Inverted Index on DEV. The author explains that the core idea is an inverted index. Instead of storing "document A has words X, Y, Z" (which would force you to scan every document for every query), you store "word X appears in documents A, B, C." So when someone searches for X, you do one lookup and you get the list of documents. That made sense. I bookmarked it and kept digging.

I also hit Wikipedia’s Inverted index page. It’s dry but it confirmed the same idea: a mapping from terms to postings (document ID plus optional stuff like position or frequency). So my mental model became: build that map, then at query time look up each term and combine the results. The rest was details.

What I decided to build (and what I skipped)

I wasn’t trying to crawl the whole web. I wanted to keep the problem small so I could finish. So I scoped it to:

Input: A set of text documents (e.g. text files, or maybe later a few hundred web pages).
Output: For a given query string, a list of documents that contain the query terms, ideally in some order of "relevance."

I explicitly skipped (for the first version):

Crawling. I’d assume documents were already on disk or in memory.
Fancy ranking. I’d start with "does the document contain the word?" and maybe a simple score.
Stemming, stop words, and fancy NLP. I could add those once the basics worked.

So the pipeline in my head was: get documents, turn each into a list of words, build the inverted index (word to list of doc IDs), then for a query: split the query into words, look each up in the index, combine the results, maybe score them, return.

Here’s that pipeline as a flowchart:

flowchart TD
  A[Load documents] --> B[Tokenize each doc]
  B --> C[Build inverted index]
  C --> D[(Index: word to doc IDs)]
  D --> E[User query]
  E --> F[Tokenize query]
  F --> G[Lookup terms in index]
  G --> H[Combine result sets]
  H --> I[Score and sort]
  I --> J[Return ranked results]

Breaking it down into steps

I wrote this on a whiteboard (well, in a note file):

Document loading. Read the documents. Each document gets an ID. I’d store the raw text or at least the list of words per document so I could use it later for ranking.
Tokenization. For each document, turn the raw string into a list of tokens (words). That meant splitting on spaces, maybe stripping punctuation, maybe lowercasing so "Hello" and "hello" match. I’d keep it simple at first: split on non-alphanumeric, lowercase, throw away empty strings.
Building the index. For each token, I need to record "this document contains this token." So the index is a map: token -> Set<docId> or token -> [docId1, docId2, ...]. While I was at it, I could store how many times the token appeared in each document (term frequency). That would be useful for ranking later.
Query processing. User types "foo bar." I tokenize the query the same way I tokenized the documents. So I get ["foo", "bar"].
Retrieval. For each token, look up the set of document IDs in the index. If I want "documents that contain ALL terms," I intersect those sets. If I want "documents that contain ANY term," I union them. I started with intersection (AND) because it’s what people usually mean for multi-word search.
Ranking (simple version). I had a list of doc IDs. I could return them in any order, but "relevance" is better. A simple heuristic: count how many query terms appear in the document, and how often. More matches = higher score. That’s basically term frequency. I read a bit about TF-IDF (term frequency times inverse document frequency) and decided I’d add that next; for v1 I’d just use term count or TF.

That was the plan. Now I had to code it.

Implementation: tokenization first

I needed a function that takes a string and returns a list of normalized words. Same function for documents and for the query. Here’s what I wrote (in JavaScript, since that’s what I use most):

function tokenize(text) {
  return text
    .toLowerCase()
    .replace(/[^\w\s]/g, ' ')
    .split(/\s+/)
    .filter(Boolean);
}

So "Hello, world!" becomes ["hello", "world"]. No stemming, no stop-word removal. Just: lowercase, strip punctuation (replace with space so we don’t glue words together), split on whitespace, drop empties. Good enough for a first pass.

Building the inverted index

I needed a data structure. In code I thought of it as:

index: a Map where the key is a token (string) and the value is something like a list of { docId, count } so I could use count for ranking later.
Optionally I’d keep a separate array or map of documents (docId to content or to token list) so I could show snippets or titles.

I iterated over each document, tokenized it, then for each token I updated the index: "this docId has this token, and here’s the count." I used a Map for the index and for each token I had a Map of docId to count (or a list of postings). Something like:

const index = new Map(); // token -> Map(docId -> count)

function addToIndex(docId, text) {
  const tokens = tokenize(text);
  const counts = new Map();

  for (const token of tokens) {
    counts.set(token, (counts.get(token) ?? 0) + 1);
  }

  for (const [token, count] of counts) {
    if (!index.has(token)) {
      index.set(token, new Map());
    }
    index.get(token).set(docId, count);
  }
}

So for each document I first count how many times each token appears in that doc (to avoid storing duplicate doc IDs and to have TF for later). Then I push those (docId, count) pairs into the index under each token. After processing all documents, index.get("foo") gives me a Map of docId to how many times "foo" appeared in that doc. That’s the inverted index.

The structure looks like this: each term points to the documents that contain it (and optionally how often), not the other way around.

flowchart LR
  foo[term: foo] --> d1[doc 1]
  foo --> d2[doc 2]
  bar[term: bar] --> d1
  bar --> d3[doc 3]
  baz[term: baz] --> d2
  baz --> d3

So a query for "foo" does one lookup and gets back doc 1 and doc 2. A query for "foo bar" means we look up both, then intersect: only doc 1 has both.

Running a query

Query string comes in. I tokenize it the same way. Then for each token I get the set (or map) of doc IDs from the index. For an AND query I need documents that contain every token. So I take the intersection of those sets. In code:

function search(query, index, mode = 'and') {
  const terms = tokenize(query);
  if (terms.length === 0) return [];

  let docSets = terms.map(term => {
    const postings = index.get(term);
    return postings ? new Set(postings.keys()) : new Set();
  });

  let resultSet;
  if (mode === 'and') {
    resultSet = docSets.reduce((a, b) => new Set([...a].filter(x => b.has(x))));
  } else {
    resultSet = docSets.reduce((a, b) => new Set([...a, ...b]));
  }

  return Array.from(resultSet);
}

So I get one Set per term (the doc IDs that contain that term). For AND I intersect them; for OR I union. The result is a list of doc IDs. No ranking yet, just filtering.

Adding a simple rank

I had doc IDs. I wanted to sort them so the "most relevant" came first. A simple idea: score each doc by how many query terms it contains and how often. So for each doc in the result set, I’d sum up the term frequencies (the counts I stored in the index) for each query term. Higher total = more relevant.

function scoreDocument(docId, terms, index) {
  let score = 0;
  for (const term of terms) {
    const postings = index.get(term);
    if (postings && postings.has(docId)) {
      score += postings.get(docId);
    }
  }
  return score;
}

function searchWithRanking(query, index, mode = 'and') {
  const terms = tokenize(query);
  if (terms.length === 0) return [];

  const docIds = search(query, index, mode);
  const scored = docIds.map(docId => ({
    docId,
    score: scoreDocument(docId, terms, index)
  }));

  return scored.sort((a, b) => b.score - a.score).map(x => x.docId);
}

So I reuse the same search to get the candidate set, then score each candidate by summing TF for the query terms, then sort by score descending. That gave me a minimal but usable ranking.

What I ran into (and what I’d do next)

Empty or single-term queries. If the user types nothing or just punctuation, tokenize returns [] and I had to return an empty result instead of blowing up. Small thing but easy to miss.
Terms that don’t exist in the index. When I looked up a term and it wasn’t in the index, I had to treat it as "no documents" (empty set). So for AND queries, one missing term makes the whole result empty. That’s correct but worth testing.
Same tokenization for query and docs. If I tokenize documents with one strategy and the query with another, I can get no matches even when the word is there. So I made sure the same tokenize function was used everywhere.
Scale. My index was in memory. For a lot of documents I’d need to think about persistence (write the index to disk, load on startup), maybe compression, and maybe something like an LRU cache for hot queries. For the prototype I didn’t bother.
Better ranking. TF-IDF would downweight terms that appear in almost every document (like "the") and upweight terms that are specific to a few docs. I’d add that next: compute document frequency (how many docs contain each term), then score = TF * log(N / DF) or similar, and use that for sorting.
Crawling. To turn this into a "web" search engine I’d need a crawler: start from a URL, fetch the page, parse HTML, extract text, add to index, follow links, respect robots.txt and rate limits. That’s a whole other project. The index and query part would stay the same; only the "document loading" step would change.

Putting it together

I wired it up so I could load a few text files from a folder, build the index once, and then run queries in a loop. The flow was: read all files, assign each a docId, call addToIndex(docId, fileContent) for each, then for each query string call searchWithRanking(query, index) and print the doc IDs (or titles) in order. That was my v1.

So the full picture: I started by googling, found the inverted index idea (with the DEV article and Wikipedia as anchors), scoped the problem to "index and search a fixed set of documents," then implemented tokenization, index build, query (AND/OR), and a simple TF-based rank. No crawler, no TF-IDF yet, but it was a real search engine in the sense that you could ask "which documents contain these words?" and get back a ranked list. If you’re starting from zero like I was, that’s the path I’d recommend: one good reference (like the inverted index post), then break the problem into load, tokenize, index, query, rank, and implement one step at a time.