Scientific Chatbot - Quick Start Guide

Get your scientific chatbot API running in minutes!


πŸ“‹ Prerequisites

  • Python 3.9+ or Docker
  • 4GB+ RAM recommended
  • 10GB+ disk space for documents and embeddings

πŸš€ Quick Start (3 Methods)

Method 3: Docker (Manual)

# 1. Build image
docker build -t scientific-chatbot-api .

# 2. Run container
docker run -d \
  -p 8000:8000 \
  -v $(pwd)/data:/app/data \
  --name chatbot-api \
  scientific-chatbot-api

# API is now running

πŸ“š First Steps

1. Check API Health

# Using cURL
curl http://localhost:8000/health

# Using Python client
python api_client.py health

2. Upload Your First Document

# Using cURL
curl -X POST http://localhost:8000/ingest/document \
  -F "file=@your_paper.pdf"

# Using Python client
python api_client.py ingest your_paper.pdf

# With metadata
python api_client.py ingest your_paper.pdf \
  --metadata '{"author":"Smith et al.","journal":"Nature"}'

3. Search the Knowledge Base

# Using cURL
curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{"query":"chronic pain treatment","k":5}'

# Using Python client
python api_client.py search "chronic pain treatment"

4. Ask a Question

# Using Python client
python api_client.py qa "What are effective treatments for chronic pain?"

5. View Statistics

# Using Python client
python api_client.py stats

πŸ“ Project Structure

scientific-chatbot/
β”œβ”€β”€ api.py                      # FastAPI server
β”œβ”€β”€ rag_pipeline.py             # RAG implementation
β”œβ”€β”€ api_client.py               # CLI client
β”œβ”€β”€ requirements.txt            # Python dependencies
β”œβ”€β”€ README.md                   # Main documentation
β”œβ”€β”€ API_DOCUMENTATION.md        # API reference
β”œβ”€β”€ QUICKSTART.md              # This file
β”œβ”€β”€ Dockerfile                  # Docker configuration
β”œβ”€β”€ docker-compose.yml          # Docker Compose setup
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ chroma_db/             # Vector database
β”‚   β”œβ”€β”€ uploads/               # Uploaded documents
β”‚   β”œβ”€β”€ temp/                  # Temporary files
β”‚   └── exports/               # Exported data
β”œβ”€β”€ logs/
β”‚   └── api.log                # API logs
└── tests/
    └── test_api.py            # Test suite

πŸ’‘ Common Use Cases

Use Case 1: Build a Research Library

# Ingest entire directory of PDFs
python api_client.py ingest ./papers --recursive

# Or specific file types
python api_client.py ingest ./documents \
  --extensions .pdf,.txt,.md

Use Case 2: Search for Specific Topics

# search_example.py
import requests

response = requests.post(
    "http://localhost:8000/search",
    json={
        "query": "cognitive behavioral therapy for pain",
        "k": 5,
        "filter_metadata": {"document_type": "research_paper"}
    }
)

for result in response.json():
    print(f"Source: {result['filename']}")
    print(f"Score: {result['relevance_score']:.3f}")
    print(f"Preview: {result['content'][:200]}...\n")

Use Case 3: Question Answering System

# qa_example.py
import requests

response = requests.post(
    "http://localhost:8000/qa",
    json={
        "question": "What is the recommended first-line treatment for chronic back pain?",
        "k_context": 5,
        "return_sources": True
    }
)

qa_result = response.json()
print(f"Question: {qa_result['question']}")
print(f"Confidence: {qa_result['confidence']:.2f}")
print(f"\nSources:")
for source in qa_result['sources']:
    print(f"  [{source['source_number']}] {source['filename']}")

Use Case 4: Batch Processing

# batch_ingest.py
import requests
from pathlib import Path

files_to_upload = Path('./papers').glob('*.pdf')

for file_path in files_to_upload:
    with open(file_path, 'rb') as f:
        files = {'file': (file_path.name, f)}
        response = requests.post(
            "http://localhost:8000/ingest/document",
            files=files
        )
        result = response.json()
        print(f"{file_path.name}: {result['status']}")

πŸ”§ Configuration

Environment Variables

Create a .env file:

# Embedding Model
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2

# Vector Store
CHROMA_PERSIST_DIR=./data/chroma_db

# Chunking
CHUNK_SIZE=1000
CHUNK_OVERLAP=200

# Retrieval
K_RETRIEVE=5

# API Server
API_HOST=0.0.0.0
API_PORT=8000
API_RELOAD=False

Model Selection

For medical/clinical documents:

EMBEDDING_MODEL=dmis-lab/biobert-base-cased-v1.1

For scientific papers:

EMBEDDING_MODEL=allenai/scibert_scivocab_uncased

For general purpose (default):

EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2

πŸ§ͺ Testing

# Run all tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=api --cov=rag_pipeline

# Run specific test
pytest tests/test_api.py::test_search_basic -v

# Run only fast tests (skip slow/performance tests)
pytest tests/ -v -m "not slow"

πŸ“Š Monitoring

View Logs

# Docker
docker-compose logs -f api

# Local
tail -f logs/api.log

Check Statistics

# Total documents and chunks
python api_client.py stats

# List all documents
python api_client.py list

# Filter by type
python api_client.py list --type research_paper --limit 20

πŸ” Troubleshooting

Problem: API won’t start

Solution:

# Check if port 8000 is already in use
lsof -i :8000  # Linux/Mac
netstat -ano | findstr :8000  # Windows

# Use a different port
API_PORT=8080 python api.py

Problem: Model download is slow

Solution:

# Pre-download the model
python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')"

Problem: Out of memory during ingestion

Solution: - Reduce chunk size: CHUNK_SIZE=500 - Process documents one at a time instead of batch - Increase system swap space

Problem: Search returns no results

Solution:

# Check if documents are ingested
python api_client.py stats

# Verify vector store
ls -lh data/chroma_db/

# Try broader search query
python api_client.py search "pain" -k 10

🎯 Next Steps

  1. Read the full documentation: README.md and API_DOCUMENTATION.md
  2. Explore the API: Visit http://localhost:8000/docs
  3. Customize the system: Edit config.env for your needs
  4. Integrate with frontend: Use the API with React, R Shiny, etc.
  5. Add LLM integration: Connect Claude API for answer generation

πŸ“ž Support

  • Documentation: See README.md and API_DOCUMENTATION.md
  • API Reference: http://localhost:8000/docs (when running)
  • Tests: Run pytest tests/ to verify setup

πŸš€ Production Checklist

Before deploying to production:


Version: 1.0.0
Last Updated: November 2024

Happy building! πŸŽ‰