Case study · 2026

DeepScholar

AI-powered RAG copilot that converts academic PDFs into a grounded, citable knowledge base.

Overview

DeepScholar is a research assistant that ingests academic PDFs and exposes them as a semantic knowledge base. Users submit questions and receive grounded answers backed by exact source passages—with inline, expandable citations. The architecture decouples ingestion from retrieval, keeping each stage independently testable and scalable.

Problem

Manual literature review is slow and hard to scale across dozens of papers
Keyword search misses semantic relationships between concepts
General-purpose LLMs hallucinate citations and fabricate references
Outputs lack verifiable sourcing, making them unreliable for academic work

Architecture

FastAPI handles PDF ingestion through a multi-stage pipeline: PyMuPDF extracts raw text, a sentence-aware chunker splits content with configurable overlap, and OpenAI's embedding API generates vector representations stored in PostgreSQL with pgvector on Supabase. At query time, the RAG pipeline performs cosine similarity search over indexed embeddings and constructs a prompt that instructs the model to answer strictly from retrieved context—emitting structured JSON citations for every claim.

Technical Highlights

Magic-byte validation rejects non-PDF payloads before any processing begins
Sentence-aware chunking with configurable overlap preserves cross-boundary context
RAG prompt contract: model must cite retrieved passages or explicitly decline—no fabrication
Citations emitted as structured JSON: source filename, chunk_id, and verbatim passage
pgvector cosine similarity search with IVFFlat indexing for sub-100 ms P99 retrieval
Supabase schema designed for multi-document workspaces and per-user isolation
Async ingestion pipeline keeps the API responsive under concurrent uploads
Deterministic document fingerprinting prevents duplicate vector entries on re-upload

Frontend

Drag-and-drop PDF upload with live ingestion progress feedback
Chat interface streams answers with inline expandable citation cards
Each citation surfaces the source passage and document origin
Single-workspace UI optimized for focused, distraction-free research sessions

Tech stack

Next.js 14TypeScriptTailwind CSSFastAPIPythonOpenAIPostgreSQLpgvectorSupabasePyMuPDF