Hybrid RAG with Semantic Chunking

Scaling RAG Hybrid Search on 780K Pages

Piotr Chlebek · 2026-4-(work in progress)

Abstract: in progress..

Keywords: in progress..

(work in progress)

Architecting a Private, Vertical Search Engine for a Library of 6,000 Documents

The Library I Always Wanted: Fast, Local, and Under Control

This project was born from my love for non-fiction books and knowledge-dense industry articles. For a long time, I dreamed of building a system that could "understand" and process my large collection of documents on topics I care about.

This text was born from a mix of three sparks. First, I wanted to expand on the ideas that there won't be enough time for during my presentation: AI Tinkerers Gdańsk Meetup – April 23rd. Second, writing is pure joy for me and the best way to organize my thoughts. But the most important catalyst was my colleagues in the industry – it was their encouragement and curiosity about my struggles with new technologies that finally got me to sit down at the keyboard. Since they felt these experiences were worth writing about, I couldn't let them down.

The 5 Million Prompts

After processing ~6k PDFs, I ended up with 780,000 pages. I decided to use several prompts to extract metadata from each page, so I ended up having about 5 million LLM queries to process.

Processing 5 million prompts in the cloud is not cheap. Even with modern features like prompt caching or batch pricing for slower results, the costs are still too high. To save money, I decided to use my own local hardware instead. As a major bonus, running everything locally ensures total privacy for my data. You can find out how I managed to do it with vLLM here: Serving the LLM locally with vLLM.

Of course, there are challenges. Local AI usually has higher latency (slower response times) than big cloud providers. Because of this, I have to be careful about how I use LLMs for certain real-time tasks, such as reranking search results.

In this post:

(work in progress)

References:

[.] karpathy/llm-wiki.md - Andrej Karpathy
[.] -

Images Source: Google DALL-E 3 (04.2026).

SharkTime Software

Boost Your Productivity with our Windows Apps