71 lines
1.8 KiB
Markdown
71 lines
1.8 KiB
Markdown
# PSE2 - Pitchbook Extraction Webapplication
|
|
|
|
A microservices platform for processing pitchbook PDFs using OCR and entity extraction services. Combines SpaCy NLP and GPT-based (ExxetaGPT) extraction of kpi in Pitchbooks.
|
|
|
|
```
|
|
## Quick Start
|
|
|
|
### 1. Environment Setup
|
|
Create a `.env` file in the project root:
|
|
|
|
# Database
|
|
DATABASE_URL=url
|
|
POSTGRES_USER=admin
|
|
POSTGRES_PASSWORD=password
|
|
|
|
# API Key (required for ExxetaGPT service)
|
|
API_KEY=your_exxeta_jwt_token_here
|
|
```
|
|
|
|
### 2. Start Application
|
|
```bash
|
|
# Build and start all services
|
|
docker-compose up --build
|
|
|
|
# Run in background
|
|
docker-compose up --build -d
|
|
|
|
# Stop services
|
|
docker-compose down
|
|
```
|
|
|
|
### 3. Access Application
|
|
- **Frontend:** http://localhost:8080
|
|
- **API:** http://localhost:5050
|
|
|
|
## Services Overview
|
|
|
|
| Service | Port | Purpose |
|
|
|---------|------|---------|
|
|
| **Frontend** | 8080 | React UI for file upload and results display |
|
|
| **Coordinator** | 5050 | Main API, file storage, database management |
|
|
| **OCR** | 5051 | PDF text extraction using OCRmyPDF |
|
|
| **ExxetaGPT** | 5053 | AI entity extraction using GPT-4o-mini |
|
|
| **SpaCy** | 5052 | NLP entity extraction using custom model |
|
|
| **Validate** | 5054 | Merges and validates results from both extractors |
|
|
| **Database** | 5432 | PostgreSQL for data persistence |
|
|
|
|
## Usage Flow
|
|
|
|
1. Upload PDF via web interface
|
|
2. OCR service extracts text from PDF
|
|
3. Both ExxetaGPT and SpaCy services extract kpi's entities
|
|
4. Validate service merges and validates results
|
|
5. View extracted kpi's and original PDF side-by-side
|
|
|
|
## Troubleshooting
|
|
|
|
**Services won't start:**
|
|
```bash
|
|
# Check logs
|
|
docker-compose logs
|
|
```
|
|
|
|
**ExxetaGPT errors:**
|
|
- Ensure `API_KEY` is set in `.env` file
|
|
- Check API key validity and network access
|
|
|
|
**Database connection issues:**
|
|
- Wait for database health check to pass
|
|
- Verify `DATABASE_URL` format in `.env`
|