BA-Chatbot/README.md

191 lines
5.8 KiB
Markdown
Raw Normal View History

2023-11-15 14:28:48 +01:00
# Project Setup Guide
This guide outlines the steps to set up and run the project. Please follow the instructions in the order they are provided.
## Prerequisites
- Ensure you have Conda installed on your system.
- Docker should be installed and running.
- NVIDIA Drivers installed
## Requirements
- If you run all compose services including GROBID and LLaMA7B, you need up to 48 GB VRAM for generating LLaMA2 Embeddings. If you use LLaMA only for text generation and not for generating Embeddings, you can first run the data_service scripts and then shut down GROBID, and load the LLaMA Model. In that case ~28 GB VRAM for loading LLaMA7B is enough.
## Installation Steps
## 1. Create and Activate Conda Environment
Create a new Conda environment using the provided `environment.yml` file:
```bash
conda env create -f environment.yml
```
Activate the newly created environment:
```bash
conda activate chatbot_env
```
## 2. Train Rasa Model
Navigate to the chatbot directory and train the Rasa model:
```bash
cd chatbot
rasa train
cd ..
```
## 3. Start Docker Services
Start all required services using Docker Compose:
```bash
docker-compose up
```
### Note on Using OpenAI Models:
If you want to use OpenAI models for embedding generation or text generation, you need to provide an API key in the following files:
- `backend/app.py`
- `data_service/data_indexer.py`
- `model_service/openai_models.py`
To provide the API key, insert your OpenAI API key in the appropriate places in these files.
If you do not plan to use OpenAI models, you must adjust the configuration template (`config1` template) to avoid using GPT. Also the indexing script to not use ada emb. model. This involves modifying the settings to use alternative models or disabling certain features that rely on OpenAI models.
### Note on Using LlaMA Models:
If you want to use HF Transformer Models like LLaMA2 then you have to download and save it in `model_service/models`
## 4. Data Indexing
### 4.1 Create and Activate Conda Environment
Create a new Conda environment using the provided `data_service.yml` file:
```bash
conda env create -f data_service.yml
```
Activate the newly created environment:
```bash
conda activate data_service
```
### 4.2
After all services are up and running, navigate to the `/data_service` directory:
```bash
cd data_service
```
Run the `data_indexing.py` script to index your data:
```bash
python data_indexing.py
```
If the `/data_service/data` directory is empty, you need to manually download the necessary documents and place them in the appropriate directories as outlined below:
```
/data_service/data
│ ├── modulhandbuch-ib.pdf
│ └── Juni_2023_SPO_Bachelor.pdf
├── papers
│ ├── Wolf
│ │ ├── paper_title.pdf
│ │ └── ...
│ ├── Hummel
│ │ ├── paper_title.pdf
│ │ └── ...
│ └── ...
└── other_documents
└── ...
```
### Notes on Data Structure
- For papers, the structure should be `/data_service/data/paper/{AUTHOR}/{PAPER_NAME}.pdf`.
- Make sure to follow the same structure for other documents like module handbooks and study regulations.
### Author Mapping in `expert_search.py` and `reader.py`
In the `expert_search.py` and `reader.py` scripts, there is an author mapping that associates short names with full names. This is crucial for correctly identifying authors in the data processing steps.
#### Current Author Mapping
The current mapping is as follows:
```python
AUTHOR_MAPPING = {
"Wolf": "Prof. Dr. Ivo Wolf",
"Hummel": "Prof. Dr. Oliver Hummel",
"Fimmel": "Prof. Dr. Elena Fimmel",
"Eckert": "Prof. Dr. rer. nat. Kai Eckert",
"Fischer": "Prof. Dr. Jörn Fischer",
"Gröschel": "Prof. Dr. Michael Gröschel",
"Gumbel": "Prof. Dr. Markus Gumbel",
"Nagel": "Prof. Dr. Till Nagel",
"Specht": "Prof. Dr. Thomas Specht",
"Steinberger": "Prof. Dr. Jessica Steinberger",
"Dietrich": "Prof. Dr. Gabriele Roth-Dietrich",
"Dopatka": "Prof. Dr. rer. nat. Frank Dopatka",
"Kraus": "Prof. Dr. Stefan Kraus",
"Leuchter": "Prof. Dr.-Ing. Sandro Leuchter",
"Paulus": "Prof. Dr. Sachar Paulus",
}
```
#### Updating the Author Mapping
- If new authors are added to the `data/paper` directory, you will need to update the `AUTHOR_MAPPING` in both `expert_search.py` and `reader.py` to reflect these changes.
- Ensure that the short name used in the directory structure matches the key used in the `AUTHOR_MAPPING`.
**Note:** Keeping the author mapping updated is essential for the accuracy of the expert search and data processing functionalities.
### Running the Web Crawler
The project includes web crawlers for collecting data from specific sources. Follow these steps to run the crawlers:
#### 1. Crawling Available URLs
To crawl all available URLs for crawling from the HS Mannheim domain, use the following command:
```bash
scrapy runspider hsma_url_crawler.py
```
This command runs the `hsma_url_crawler.py` script, which gathers URLs from the specified domain.
#### 2. Crawling Content from URLs
After gathering URLs, you can crawl the content from these URLs:
- First, make sure you have executed `hsma_content_crawler.py` as described above.
- Then, run the content crawler with the following command:
```bash
scrapy runspider hsma_content_crawler.py
```
- This command runs the `hsma_content_crawler.py` script, which collects content from the list of URLs obtained in the previous step.
#### 3. Post-Crawling Steps
After crawling, move the generated `url_texts.json` file into the `/data` directory and rename it to `crawled_hsma_web.json`.
```bash
mv url_texts.json /path/to/data_service/data/crawled_hsma_web.json
```
Replace `/path/to/data_service/data` with the actual path to your `data_service/data` directory.
---