For weeks, my feed has been screaming "vector databases". Every VC, every "AI influencer", every SaaS company is suddenly an expert on Pinecone, Weaviate, or Chroma. It felt like the early days of "Big Data" all over again—a solution looking for a problem, wrapped in a lot of hype.
My default setting is skepticism. I've seen too many tech waves in India promise a revolution and deliver a slightly better dashboard. But this one felt different. The claims were specific: build apps that *understand* user intent, not just keywords. So I blocked out a week, brewed some ridiculously strong coffee, and went down the rabbit hole.
Turns out, focusing on the database is like admiring the bookshelf instead of reading the books. The real story isn't the storage; it's how we translate messy, human concepts into cold, hard math in the first place. The database is just the enabler. The embedding model is the magic.
What I Explored
My starting point was a simple, selfish problem: I want to search my own newsletter archive. Not with Ctrl+F, but by asking a question. If I search for "AI infra costs," I want to find the post where I talked about "the brutal price of GPU inference," even if the exact phrase "AI infra costs" never appeared.
Keyword search fails here. A traditional database using a LIKE '%infra costs%' query is dumb. It's a glorified string-matcher. It has zero understanding that "costs," "price," and "budget" are related. It doesn't know that "inference" is a part of "AI infra."
This is where semantic search comes in. The core idea is to stop treating words as strings and start treating them as points in a conceptual space.
Step 1: Turning Words into Numbers (Embeddings)
This is the part everyone glosses over, but it's the entire foundation. You take a piece of text—a sentence, a paragraph, a whole document—and feed it into a special kind of neural network called an embedding model.
The model's job is to read the text and output a list of numbers. This list is a vector. For a model like all-MiniLM-L6-v2 (a popular, lightweight one), this vector has 384 dimensions (384 numbers).
Think of it like this: imagine a giant, multi-dimensional map. The embedding model is the cartographer. It places related concepts near each other on this map.
- The vector for "king" would be close to the vector for "queen."
- The vector for "Bangalore traffic" would be close to "gridlock on Outer Ring Road."
I played around with this myself using the sentence-transformers library in Python. It's surprisingly simple to get started.
from sentence_transformers import SentenceTransformer
# Load a pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')
# My sample sentences
sentences = [
"The cost of running AI models in production is high.",
"Startups are struggling with GPU inference budgets.",
"India's tech ecosystem is booming in Bangalore."
]
# Generate the embeddings
embeddings = model.encode(sentences)
# Let's see what we got
for sentence, embedding in zip(sentences, embeddings):
print("Sentence:", sentence)
print("Embedding shape:", embedding.shape) # This will be (384,)
print("-" * 20)Running this, you get a 384-element array of floating-point numbers for each sentence. This is the "meaning" of the sentence, captured mathematically.
Step 2: The Search Problem
Okay, so I have a vector for my query ("AI infra costs") and a vector for every paragraph in my newsletter archive. How do I find the most relevant paragraphs?
The answer is math: vector similarity. You calculate the "distance" between the query vector and all the other vectors. The ones with the smallest distance (the "nearest neighbors") are the most semantically similar. A common way to do this is with Cosine Similarity, which measures the angle between two vectors. A smaller angle means they're pointing in a similar "conceptual direction."
Step 3: The Scaling Nightmare (and Why Vector DBs Exist)
This is where I hit the wall. Calculating the cosine similarity between my query vector and a few dozen other vectors is instant. But what about a million? Or a hundred million?
If you have 1,000,000 documents, a single search would require 1,000,000 distance calculations. That's way too slow for a real-time application. You can't just for loop your way out of this one.
*This* is the problem that vector databases solve.
They are specialized databases built for one job: finding the approximate nearest neighbors (ANN) for a given vector, incredibly fast. They don't store your text; they store the *vectors*. They use clever indexing algorithms like HNSW (Hierarchical Navigable Small World) to create a sort of searchable map of your vector space. Instead of checking every single point, they can navigate this map efficiently to find the closest matches without scanning the entire dataset.
So the workflow is:
- Ingestion: Take all your documents, run them through an embedding model, and store the resulting vectors (and a reference to the original document) in a vector DB like Chroma, Weaviate, or Pinecone. This is a one-time (or ongoing) process.
- Query: Take the user's search query, run it through the *same* embedding model to get a query vector.
- Search: Hand that query vector to the database and say, "Give me the top 5 vectors closest to this one."
- Retrieve: The database returns the IDs of the most similar documents. You then fetch the original text from a regular database (like Postgres or even a text file) to show the user.
The vector DB is a high-performance index, not the source of truth for your content. It's a critical piece of infrastructure, but it's not where the "intelligence" comes from. The intelligence is baked into the vectors by the embedding model.
What This Means
After getting my head around the mechanics, I started thinking about the second-order effects. What does this actually change for someone building in India?
First, this is a game-changer for vernacular India. Keyword search is a disaster for Indic languages. Transliterations (kaise vs kese), synonyms, and dialectical differences make it impossible to build a good search experience with string matching. Semantic search blows past this. If you have an embedding model that understands Hindi, a user searching for "GST kaise file karein" can find a document that explains the "Goods and Services Tax filing process" in Hinglish, even if the exact keywords don't match. This unlocks huge amounts of content for Tier-2 and Tier-3 audiences. Every government portal, every e-commerce site, every content platform needs this, yesterday.
Second, the moat is not your vector database. Using Pinecone is not a competitive advantage. It's a dependency. It's a commodity. Your real moat is twofold:
- Your proprietary data: The unique dataset you create embeddings from.
- Your choice of (or fine-tuned) embedding model: A generic model is good, but a model fine-tuned on your specific domain's language (e.g., Indian legal jargon, or medical terminology) will be exponentially better. The startups that win will be the ones that master the data and the models, not the ones who are best at configuring a managed database.
Third, this democratizes "AI-powered features." For years, building a good recommendation engine or a conceptual search required a team of PhDs. Now, a single developer in Bangalore can spin up a ChromaDB instance, pull a model from Hugging Face, and build a surprisingly powerful semantic search for their app in a weekend. This lowers the barrier to entry for intelligence. Think about a small D2C brand on Shopify. They can now have product search that understands user intent. A user searching for "something comfortable to wear at home" might find a pair of cotton pajamas, even if the product description just says "loungewear set." This was previously only available to Amazon or Flipkart.
The caveat? Cost. Generating embeddings (especially using paid APIs like OpenAI's) and running a managed vector DB isn't free. For a bootstrapped Indian startup, every API call adds up. The tension between using a powerful-but-expensive proprietary model versus a good-enough-and-free open-source one will define the architecture of many early-stage products here.
What I'm Doing With This
Theory is cheap. The only way to really learn is to build.
My immediate next step is to actually build the semantic search for my own newsletter archive. I'm not going to over-engineer it. I'm taking the simplest path possible to get a feel for the end-to-end process.
My stack will be:
- Embedding Model:
sentence-transformerswith theall-MiniLM-L6-v2model. It's open-source, runs on my laptop's CPU, and is surprisingly good for English text. No API keys, no costs. - Vector Database: ChromaDB. I'll run it locally in-memory or using a Docker container. It's perfect for small-to-medium projects and the developer experience seems straightforward. I don't need a managed service that can handle a billion vectors yet. I have a few hundred paragraphs.
- Application: A simple Streamlit or Flask front-end where I can type in a query and see the top 3 most relevant posts from my archive.
My goal is to document the process: the code, the setup, the little "gotchas" I run into. How good is the out-of-the-box model? What kind of queries work well, and which ones fail? How do I structure my text before embedding it—by sentence, by paragraph? These are the practical questions that you only answer by getting your hands dirty.
Beyond that, I'm already thinking about the next level. What about fine-tuning an embedding model on my specific writing style? Or exploring multimodal models that can embed both the text and the images from my posts? That feels like where the real leverage is—moving from a generic understanding of language to a specific, domain-aware intelligence.
But first, I need to make the damn thing work. I'll report back.