EK
Back to projects

Case study · 2026

DeepScholar

AI-powered RAG copilot that converts academic PDFs into a grounded, citable knowledge base.

Overview

DeepScholar is a research assistant that ingests academic PDFs and exposes them as a semantic knowledge base. Users submit questions and receive grounded answers backed by exact source passages—with inline, expandable citations. The architecture decouples ingestion from retrieval, keeping each stage independently testable and scalable.

Problem

  • Manual literature review is slow and hard to scale across dozens of papers
  • Keyword search misses semantic relationships between concepts
  • General-purpose LLMs hallucinate citations and fabricate references
  • Outputs lack verifiable sourcing, making them unreliable for academic work

Architecture

FastAPI handles PDF ingestion through a multi-stage pipeline: PyMuPDF extracts raw text, a sentence-aware chunker splits content with configurable overlap, and OpenAI's embedding API generates vector representations stored in PostgreSQL with pgvector on Supabase. At query time, the RAG pipeline performs cosine similarity search over indexed embeddings and constructs a prompt that instructs the model to answer strictly from retrieved context—emitting structured JSON citations for every claim.

Technical Highlights

  • Magic-byte validation rejects non-PDF payloads before any processing begins
  • Sentence-aware chunking with configurable overlap preserves cross-boundary context
  • RAG prompt contract: model must cite retrieved passages or explicitly decline—no fabrication
  • Citations emitted as structured JSON: source filename, chunk_id, and verbatim passage
  • pgvector cosine similarity search with IVFFlat indexing for sub-100 ms P99 retrieval
  • Supabase schema designed for multi-document workspaces and per-user isolation
  • Async ingestion pipeline keeps the API responsive under concurrent uploads
  • Deterministic document fingerprinting prevents duplicate vector entries on re-upload

Frontend

  • Drag-and-drop PDF upload with live ingestion progress feedback
  • Chat interface streams answers with inline expandable citation cards
  • Each citation surfaces the source passage and document origin
  • Single-workspace UI optimized for focused, distraction-free research sessions

Tech stack

Next.js 14TypeScriptTailwind CSSFastAPIPythonOpenAIPostgreSQLpgvectorSupabasePyMuPDF