191 lines
5.8 KiB
Markdown
191 lines
5.8 KiB
Markdown
|
# Project Setup Guide
|
||
|
|
||
|
This guide outlines the steps to set up and run the project. Please follow the instructions in the order they are provided.
|
||
|
|
||
|
## Prerequisites
|
||
|
|
||
|
- Ensure you have Conda installed on your system.
|
||
|
- Docker should be installed and running.
|
||
|
- NVIDIA Drivers installed
|
||
|
|
||
|
## Requirements
|
||
|
|
||
|
- If you run all compose services including GROBID and LLaMA7B, you need up to 48 GB VRAM for generating LLaMA2 Embeddings. If you use LLaMA only for text generation and not for generating Embeddings, you can first run the data_service scripts and then shut down GROBID, and load the LLaMA Model. In that case ~28 GB VRAM for loading LLaMA7B is enough.
|
||
|
|
||
|
## Installation Steps
|
||
|
|
||
|
## 1. Create and Activate Conda Environment
|
||
|
|
||
|
Create a new Conda environment using the provided `environment.yml` file:
|
||
|
|
||
|
```bash
|
||
|
conda env create -f environment.yml
|
||
|
```
|
||
|
|
||
|
Activate the newly created environment:
|
||
|
|
||
|
```bash
|
||
|
conda activate chatbot_env
|
||
|
```
|
||
|
|
||
|
## 2. Train Rasa Model
|
||
|
|
||
|
Navigate to the chatbot directory and train the Rasa model:
|
||
|
|
||
|
```bash
|
||
|
cd chatbot
|
||
|
rasa train
|
||
|
cd ..
|
||
|
```
|
||
|
|
||
|
## 3. Start Docker Services
|
||
|
|
||
|
Start all required services using Docker Compose:
|
||
|
|
||
|
```bash
|
||
|
docker-compose up
|
||
|
```
|
||
|
|
||
|
### Note on Using OpenAI Models:
|
||
|
|
||
|
If you want to use OpenAI models for embedding generation or text generation, you need to provide an API key in the following files:
|
||
|
|
||
|
- `backend/app.py`
|
||
|
- `data_service/data_indexer.py`
|
||
|
- `model_service/openai_models.py`
|
||
|
|
||
|
To provide the API key, insert your OpenAI API key in the appropriate places in these files.
|
||
|
|
||
|
If you do not plan to use OpenAI models, you must adjust the configuration template (`config1` template) to avoid using GPT. Also the indexing script to not use ada emb. model. This involves modifying the settings to use alternative models or disabling certain features that rely on OpenAI models.
|
||
|
|
||
|
### Note on Using LlaMA Models:
|
||
|
|
||
|
If you want to use HF Transformer Models like LLaMA2 then you have to download and save it in `model_service/models`
|
||
|
|
||
|
## 4. Data Indexing
|
||
|
|
||
|
### 4.1 Create and Activate Conda Environment
|
||
|
|
||
|
Create a new Conda environment using the provided `data_service.yml` file:
|
||
|
|
||
|
```bash
|
||
|
conda env create -f data_service.yml
|
||
|
```
|
||
|
|
||
|
Activate the newly created environment:
|
||
|
|
||
|
```bash
|
||
|
conda activate data_service
|
||
|
```
|
||
|
|
||
|
### 4.2
|
||
|
|
||
|
After all services are up and running, navigate to the `/data_service` directory:
|
||
|
|
||
|
```bash
|
||
|
cd data_service
|
||
|
```
|
||
|
|
||
|
Run the `data_indexing.py` script to index your data:
|
||
|
|
||
|
```bash
|
||
|
python data_indexing.py
|
||
|
```
|
||
|
|
||
|
If the `/data_service/data` directory is empty, you need to manually download the necessary documents and place them in the appropriate directories as outlined below:
|
||
|
|
||
|
```
|
||
|
/data_service/data
|
||
|
│ ├── modulhandbuch-ib.pdf
|
||
|
│ └── Juni_2023_SPO_Bachelor.pdf
|
||
|
├── papers
|
||
|
│ ├── Wolf
|
||
|
│ │ ├── paper_title.pdf
|
||
|
│ │ └── ...
|
||
|
│ ├── Hummel
|
||
|
│ │ ├── paper_title.pdf
|
||
|
│ │ └── ...
|
||
|
│ └── ...
|
||
|
└── other_documents
|
||
|
└── ...
|
||
|
```
|
||
|
|
||
|
### Notes on Data Structure
|
||
|
|
||
|
- For papers, the structure should be `/data_service/data/paper/{AUTHOR}/{PAPER_NAME}.pdf`.
|
||
|
- Make sure to follow the same structure for other documents like module handbooks and study regulations.
|
||
|
|
||
|
### Author Mapping in `expert_search.py` and `reader.py`
|
||
|
|
||
|
In the `expert_search.py` and `reader.py` scripts, there is an author mapping that associates short names with full names. This is crucial for correctly identifying authors in the data processing steps.
|
||
|
|
||
|
#### Current Author Mapping
|
||
|
|
||
|
The current mapping is as follows:
|
||
|
|
||
|
```python
|
||
|
AUTHOR_MAPPING = {
|
||
|
"Wolf": "Prof. Dr. Ivo Wolf",
|
||
|
"Hummel": "Prof. Dr. Oliver Hummel",
|
||
|
"Fimmel": "Prof. Dr. Elena Fimmel",
|
||
|
"Eckert": "Prof. Dr. rer. nat. Kai Eckert",
|
||
|
"Fischer": "Prof. Dr. Jörn Fischer",
|
||
|
"Gröschel": "Prof. Dr. Michael Gröschel",
|
||
|
"Gumbel": "Prof. Dr. Markus Gumbel",
|
||
|
"Nagel": "Prof. Dr. Till Nagel",
|
||
|
"Specht": "Prof. Dr. Thomas Specht",
|
||
|
"Steinberger": "Prof. Dr. Jessica Steinberger",
|
||
|
"Dietrich": "Prof. Dr. Gabriele Roth-Dietrich",
|
||
|
"Dopatka": "Prof. Dr. rer. nat. Frank Dopatka",
|
||
|
"Kraus": "Prof. Dr. Stefan Kraus",
|
||
|
"Leuchter": "Prof. Dr.-Ing. Sandro Leuchter",
|
||
|
"Paulus": "Prof. Dr. Sachar Paulus",
|
||
|
}
|
||
|
```
|
||
|
|
||
|
#### Updating the Author Mapping
|
||
|
|
||
|
- If new authors are added to the `data/paper` directory, you will need to update the `AUTHOR_MAPPING` in both `expert_search.py` and `reader.py` to reflect these changes.
|
||
|
- Ensure that the short name used in the directory structure matches the key used in the `AUTHOR_MAPPING`.
|
||
|
|
||
|
**Note:** Keeping the author mapping updated is essential for the accuracy of the expert search and data processing functionalities.
|
||
|
|
||
|
### Running the Web Crawler
|
||
|
|
||
|
The project includes web crawlers for collecting data from specific sources. Follow these steps to run the crawlers:
|
||
|
|
||
|
#### 1. Crawling Available URLs
|
||
|
|
||
|
To crawl all available URLs for crawling from the HS Mannheim domain, use the following command:
|
||
|
|
||
|
```bash
|
||
|
scrapy runspider hsma_url_crawler.py
|
||
|
```
|
||
|
|
||
|
This command runs the `hsma_url_crawler.py` script, which gathers URLs from the specified domain.
|
||
|
|
||
|
#### 2. Crawling Content from URLs
|
||
|
|
||
|
After gathering URLs, you can crawl the content from these URLs:
|
||
|
|
||
|
- First, make sure you have executed `hsma_content_crawler.py` as described above.
|
||
|
- Then, run the content crawler with the following command:
|
||
|
|
||
|
```bash
|
||
|
scrapy runspider hsma_content_crawler.py
|
||
|
```
|
||
|
|
||
|
- This command runs the `hsma_content_crawler.py` script, which collects content from the list of URLs obtained in the previous step.
|
||
|
|
||
|
#### 3. Post-Crawling Steps
|
||
|
|
||
|
After crawling, move the generated `url_texts.json` file into the `/data` directory and rename it to `crawled_hsma_web.json`.
|
||
|
|
||
|
```bash
|
||
|
mv url_texts.json /path/to/data_service/data/crawled_hsma_web.json
|
||
|
```
|
||
|
|
||
|
Replace `/path/to/data_service/data` with the actual path to your `data_service/data` directory.
|
||
|
|
||
|
---
|