5.8 KiB
Executable File
Project Setup Guide
This guide outlines the steps to set up and run the project. Please follow the instructions in the order they are provided.
Prerequisites
- Ensure you have Conda installed on your system.
- Docker should be installed and running.
- NVIDIA Drivers installed
Requirements
- If you run all compose services including GROBID and LLaMA7B, you need up to 48 GB VRAM for generating LLaMA2 Embeddings. If you use LLaMA only for text generation and not for generating Embeddings, you can first run the data_service scripts and then shut down GROBID, and load the LLaMA Model. In that case ~28 GB VRAM for loading LLaMA7B is enough.
Installation Steps
1. Create and Activate Conda Environment
Create a new Conda environment using the provided environment.yml
file:
conda env create -f environment.yml
Activate the newly created environment:
conda activate chatbot_env
2. Train Rasa Model
Navigate to the chatbot directory and train the Rasa model:
cd chatbot
rasa train
cd ..
3. Start Docker Services
Start all required services using Docker Compose:
docker-compose up
Note on Using OpenAI Models:
If you want to use OpenAI models for embedding generation or text generation, you need to provide an API key in the following files:
backend/app.py
data_service/data_indexer.py
model_service/openai_models.py
To provide the API key, insert your OpenAI API key in the appropriate places in these files.
If you do not plan to use OpenAI models, you must adjust the configuration template (config1
template) to avoid using GPT. Also the indexing script to not use ada emb. model. This involves modifying the settings to use alternative models or disabling certain features that rely on OpenAI models.
Note on Using LlaMA Models:
If you want to use HF Transformer Models like LLaMA2 then you have to download and save it in model_service/models
4. Data Indexing
4.1 Create and Activate Conda Environment
Create a new Conda environment using the provided data_service.yml
file:
conda env create -f data_service.yml
Activate the newly created environment:
conda activate data_service
4.2
After all services are up and running, navigate to the /data_service
directory:
cd data_service
Run the data_indexing.py
script to index your data:
python data_indexing.py
If the /data_service/data
directory is empty, you need to manually download the necessary documents and place them in the appropriate directories as outlined below:
/data_service/data
│ ├── modulhandbuch-ib.pdf
│ └── Juni_2023_SPO_Bachelor.pdf
├── papers
│ ├── Wolf
│ │ ├── paper_title.pdf
│ │ └── ...
│ ├── Hummel
│ │ ├── paper_title.pdf
│ │ └── ...
│ └── ...
└── other_documents
└── ...
Notes on Data Structure
- For papers, the structure should be
/data_service/data/paper/{AUTHOR}/{PAPER_NAME}.pdf
. - Make sure to follow the same structure for other documents like module handbooks and study regulations.
Author Mapping in expert_search.py
and reader.py
In the expert_search.py
and reader.py
scripts, there is an author mapping that associates short names with full names. This is crucial for correctly identifying authors in the data processing steps.
Current Author Mapping
The current mapping is as follows:
AUTHOR_MAPPING = {
"Wolf": "Prof. Dr. Ivo Wolf",
"Hummel": "Prof. Dr. Oliver Hummel",
"Fimmel": "Prof. Dr. Elena Fimmel",
"Eckert": "Prof. Dr. rer. nat. Kai Eckert",
"Fischer": "Prof. Dr. Jörn Fischer",
"Gröschel": "Prof. Dr. Michael Gröschel",
"Gumbel": "Prof. Dr. Markus Gumbel",
"Nagel": "Prof. Dr. Till Nagel",
"Specht": "Prof. Dr. Thomas Specht",
"Steinberger": "Prof. Dr. Jessica Steinberger",
"Dietrich": "Prof. Dr. Gabriele Roth-Dietrich",
"Dopatka": "Prof. Dr. rer. nat. Frank Dopatka",
"Kraus": "Prof. Dr. Stefan Kraus",
"Leuchter": "Prof. Dr.-Ing. Sandro Leuchter",
"Paulus": "Prof. Dr. Sachar Paulus",
}
Updating the Author Mapping
- If new authors are added to the
data/paper
directory, you will need to update theAUTHOR_MAPPING
in bothexpert_search.py
andreader.py
to reflect these changes. - Ensure that the short name used in the directory structure matches the key used in the
AUTHOR_MAPPING
.
Note: Keeping the author mapping updated is essential for the accuracy of the expert search and data processing functionalities.
Running the Web Crawler
The project includes web crawlers for collecting data from specific sources. Follow these steps to run the crawlers:
1. Crawling Available URLs
To crawl all available URLs for crawling from the HS Mannheim domain, use the following command:
scrapy runspider hsma_url_crawler.py
This command runs the hsma_url_crawler.py
script, which gathers URLs from the specified domain.
2. Crawling Content from URLs
After gathering URLs, you can crawl the content from these URLs:
-
First, make sure you have executed
hsma_content_crawler.py
as described above. -
Then, run the content crawler with the following command:
scrapy runspider hsma_content_crawler.py
-
This command runs the
hsma_content_crawler.py
script, which collects content from the list of URLs obtained in the previous step.
3. Post-Crawling Steps
After crawling, move the generated url_texts.json
file into the /data
directory and rename it to crawled_hsma_web.json
.
mv url_texts.json /path/to/data_service/data/crawled_hsma_web.json
Replace /path/to/data_service/data
with the actual path to your data_service/data
directory.