Scientific Chatbot - Quick Start Guide

Get your scientific chatbot API running in minutes!

📋 Prerequisites

Python 3.9+ or Docker
4GB+ RAM recommended
10GB+ disk space for documents and embeddings

🚀 Quick Start (3 Methods)

Method 1: Local Python (Recommended for Development)

# 1. Clone/download the project
cd scientific-chatbot

# 2. Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Start the API server
python api.py

# 5. Open your browser
# Visit: http://localhost:8000/docs

That’s it! The API is now running.

Method 2: Docker (Recommended for Production)

# 1. Build and run with Docker Compose
docker-compose up -d

# 2. Check status
docker-compose ps

# 3. View logs
docker-compose logs -f api

# API is now available at http://localhost:8000

Method 3: Docker (Manual)

# 1. Build image
docker build -t scientific-chatbot-api .

# 2. Run container
docker run -d \
  -p 8000:8000 \
  -v $(pwd)/data:/app/data \
  --name chatbot-api \
  scientific-chatbot-api

# API is now running

📚 First Steps

1. Check API Health

# Using cURL
curl http://localhost:8000/health

# Using Python client
python api_client.py health

2. Upload Your First Document

# Using cURL
curl -X POST http://localhost:8000/ingest/document \
  -F "file=@your_paper.pdf"

# Using Python client
python api_client.py ingest your_paper.pdf

# With metadata
python api_client.py ingest your_paper.pdf \
  --metadata '{"author":"Smith et al.","journal":"Nature"}'

3. Search the Knowledge Base

# Using cURL
curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{"query":"chronic pain treatment","k":5}'

# Using Python client
python api_client.py search "chronic pain treatment"

4. Ask a Question

# Using Python client
python api_client.py qa "What are effective treatments for chronic pain?"

5. View Statistics

# Using Python client
python api_client.py stats

📁 Project Structure

scientific-chatbot/
├── api.py                      # FastAPI server
├── rag_pipeline.py             # RAG implementation
├── api_client.py               # CLI client
├── requirements.txt            # Python dependencies
├── README.md                   # Main documentation
├── API_DOCUMENTATION.md        # API reference
├── QUICKSTART.md              # This file
├── Dockerfile                  # Docker configuration
├── docker-compose.yml          # Docker Compose setup
├── data/
│   ├── chroma_db/             # Vector database
│   ├── uploads/               # Uploaded documents
│   ├── temp/                  # Temporary files
│   └── exports/               # Exported data
├── logs/
│   └── api.log                # API logs
└── tests/
    └── test_api.py            # Test suite

💡 Common Use Cases

Use Case 1: Build a Research Library

# Ingest entire directory of PDFs
python api_client.py ingest ./papers --recursive

# Or specific file types
python api_client.py ingest ./documents \
  --extensions .pdf,.txt,.md

Use Case 2: Search for Specific Topics

# search_example.py
import requests

response = requests.post(
    "http://localhost:8000/search",
    json={
        "query": "cognitive behavioral therapy for pain",
        "k": 5,
        "filter_metadata": {"document_type": "research_paper"}
    }
)

for result in response.json():
    print(f"Source: {result['filename']}")
    print(f"Score: {result['relevance_score']:.3f}")
    print(f"Preview: {result['content'][:200]}...\n")

Use Case 3: Question Answering System

# qa_example.py
import requests

response = requests.post(
    "http://localhost:8000/qa",
    json={
        "question": "What is the recommended first-line treatment for chronic back pain?",
        "k_context": 5,
        "return_sources": True
    }
)

qa_result = response.json()
print(f"Question: {qa_result['question']}")
print(f"Confidence: {qa_result['confidence']:.2f}")
print(f"\nSources:")
for source in qa_result['sources']:
    print(f"  [{source['source_number']}] {source['filename']}")

Use Case 4: Batch Processing

# batch_ingest.py
import requests
from pathlib import Path

files_to_upload = Path('./papers').glob('*.pdf')

for file_path in files_to_upload:
    with open(file_path, 'rb') as f:
        files = {'file': (file_path.name, f)}
        response = requests.post(
            "http://localhost:8000/ingest/document",
            files=files
        )
        result = response.json()
        print(f"{file_path.name}: {result['status']}")

🔧 Configuration

Environment Variables

Create a .env file:

# Embedding Model
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2

# Vector Store
CHROMA_PERSIST_DIR=./data/chroma_db

# Chunking
CHUNK_SIZE=1000
CHUNK_OVERLAP=200

# Retrieval
K_RETRIEVE=5

# API Server
API_HOST=0.0.0.0
API_PORT=8000
API_RELOAD=False

Model Selection

For medical/clinical documents:

EMBEDDING_MODEL=dmis-lab/biobert-base-cased-v1.1

For scientific papers:

EMBEDDING_MODEL=allenai/scibert_scivocab_uncased

For general purpose (default):

EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2

🧪 Testing

# Run all tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=api --cov=rag_pipeline

# Run specific test
pytest tests/test_api.py::test_search_basic -v

# Run only fast tests (skip slow/performance tests)
pytest tests/ -v -m "not slow"

📊 Monitoring

View Logs

# Docker
docker-compose logs -f api

# Local
tail -f logs/api.log

Check Statistics

# Total documents and chunks
python api_client.py stats

# List all documents
python api_client.py list

# Filter by type
python api_client.py list --type research_paper --limit 20

🔍 Troubleshooting

Problem: API won’t start

Solution:

# Check if port 8000 is already in use
lsof -i :8000  # Linux/Mac
netstat -ano | findstr :8000  # Windows

# Use a different port
API_PORT=8080 python api.py

Problem: Model download is slow

Solution:

# Pre-download the model
python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')"

Problem: Out of memory during ingestion

Solution: - Reduce chunk size: CHUNK_SIZE=500 - Process documents one at a time instead of batch - Increase system swap space

Problem: Search returns no results

Solution:

# Check if documents are ingested
python api_client.py stats

# Verify vector store
ls -lh data/chroma_db/

# Try broader search query
python api_client.py search "pain" -k 10

🎯 Next Steps

Read the full documentation: README.md and API_DOCUMENTATION.md
Explore the API: Visit http://localhost:8000/docs
Customize the system: Edit config.env for your needs
Integrate with frontend: Use the API with React, R Shiny, etc.
Add LLM integration: Connect Claude API for answer generation

📞 Support

Documentation: See README.md and API_DOCUMENTATION.md
API Reference: http://localhost:8000/docs (when running)
Tests: Run pytest tests/ to verify setup

Scientific Chatbot - Quick Start Guide

📋 Prerequisites

🚀 Quick Start (3 Methods)

Method 1: Local Python (Recommended for Development)

Method 2: Docker (Recommended for Production)

Method 3: Docker (Manual)

📚 First Steps

1. Check API Health

2. Upload Your First Document

3. Search the Knowledge Base

4. Ask a Question

5. View Statistics

📁 Project Structure

💡 Common Use Cases

Use Case 1: Build a Research Library

Use Case 2: Search for Specific Topics

Use Case 3: Question Answering System

Use Case 4: Batch Processing

🔧 Configuration

Environment Variables

Model Selection

🧪 Testing

📊 Monitoring

View Logs

Check Statistics

🔍 Troubleshooting

Problem: API won’t start

Problem: Model download is slow

Problem: Out of memory during ingestion

Problem: Search returns no results

🎯 Next Steps

📞 Support

🚀 Production Checklist