1
0
Fork 0
 
 
 
 
Go to file
Ali Babaoglu fabe6c7048 initial 2023-11-15 14:28:48 +01:00
.vscode initial 2023-11-15 14:28:48 +01:00
backend initial 2023-11-15 14:28:48 +01:00
chatbot initial 2023-11-15 14:28:48 +01:00
data_service initial 2023-11-15 14:28:48 +01:00
model_service initial 2023-11-15 14:28:48 +01:00
.gitignore initial 2023-11-15 14:28:48 +01:00
README.md initial 2023-11-15 14:28:48 +01:00
chatbot_env.yml initial 2023-11-15 14:28:48 +01:00
data_service.yml initial 2023-11-15 14:28:48 +01:00
db_rebuild.sh initial 2023-11-15 14:28:48 +01:00
docker-compose.yml initial 2023-11-15 14:28:48 +01:00
full_rebuild.sh initial 2023-11-15 14:28:48 +01:00

README.md

Project Setup Guide

This guide outlines the steps to set up and run the project. Please follow the instructions in the order they are provided.

Prerequisites

  • Ensure you have Conda installed on your system.
  • Docker should be installed and running.
  • NVIDIA Drivers installed

Requirements

  • If you run all compose services including GROBID and LLaMA7B, you need up to 48 GB VRAM for generating LLaMA2 Embeddings. If you use LLaMA only for text generation and not for generating Embeddings, you can first run the data_service scripts and then shut down GROBID, and load the LLaMA Model. In that case ~28 GB VRAM for loading LLaMA7B is enough.

Installation Steps

1. Create and Activate Conda Environment

Create a new Conda environment using the provided environment.yml file:

conda env create -f environment.yml

Activate the newly created environment:

conda activate chatbot_env

2. Train Rasa Model

Navigate to the chatbot directory and train the Rasa model:

cd chatbot
rasa train
cd ..

3. Start Docker Services

Start all required services using Docker Compose:

docker-compose up

Note on Using OpenAI Models:

If you want to use OpenAI models for embedding generation or text generation, you need to provide an API key in the following files:

  • backend/app.py
  • data_service/data_indexer.py
  • model_service/openai_models.py

To provide the API key, insert your OpenAI API key in the appropriate places in these files.

If you do not plan to use OpenAI models, you must adjust the configuration template (config1 template) to avoid using GPT. Also the indexing script to not use ada emb. model. This involves modifying the settings to use alternative models or disabling certain features that rely on OpenAI models.

Note on Using LlaMA Models:

If you want to use HF Transformer Models like LLaMA2 then you have to download and save it in model_service/models

4. Data Indexing

4.1 Create and Activate Conda Environment

Create a new Conda environment using the provided data_service.yml file:

conda env create -f data_service.yml

Activate the newly created environment:

conda activate data_service

4.2

After all services are up and running, navigate to the /data_service directory:

cd data_service

Run the data_indexing.py script to index your data:

python data_indexing.py

If the /data_service/data directory is empty, you need to manually download the necessary documents and place them in the appropriate directories as outlined below:

/data_service/data
│   ├── modulhandbuch-ib.pdf
│   └── Juni_2023_SPO_Bachelor.pdf
├── papers
│   ├── Wolf
│   │   ├── paper_title.pdf
│   │   └── ...
│   ├── Hummel
│   │   ├── paper_title.pdf
│   │   └── ...
│   └── ...
└── other_documents
    └── ...

Notes on Data Structure

  • For papers, the structure should be /data_service/data/paper/{AUTHOR}/{PAPER_NAME}.pdf.
  • Make sure to follow the same structure for other documents like module handbooks and study regulations.

Author Mapping in expert_search.py and reader.py

In the expert_search.py and reader.py scripts, there is an author mapping that associates short names with full names. This is crucial for correctly identifying authors in the data processing steps.

Current Author Mapping

The current mapping is as follows:

AUTHOR_MAPPING = {
    "Wolf": "Prof. Dr. Ivo Wolf",
    "Hummel": "Prof. Dr. Oliver Hummel",
    "Fimmel": "Prof. Dr. Elena Fimmel",
    "Eckert": "Prof. Dr. rer. nat. Kai Eckert",
    "Fischer": "Prof. Dr. Jörn Fischer",
    "Gröschel": "Prof. Dr. Michael Gröschel",
    "Gumbel": "Prof. Dr. Markus Gumbel",
    "Nagel": "Prof. Dr. Till Nagel",
    "Specht": "Prof. Dr. Thomas Specht",
    "Steinberger": "Prof. Dr. Jessica Steinberger",
    "Dietrich": "Prof. Dr. Gabriele Roth-Dietrich",
    "Dopatka": "Prof. Dr. rer. nat. Frank Dopatka",
    "Kraus": "Prof. Dr. Stefan Kraus",
    "Leuchter": "Prof. Dr.-Ing. Sandro Leuchter",
    "Paulus": "Prof. Dr. Sachar Paulus",
}

Updating the Author Mapping

  • If new authors are added to the data/paper directory, you will need to update the AUTHOR_MAPPING in both expert_search.py and reader.py to reflect these changes.
  • Ensure that the short name used in the directory structure matches the key used in the AUTHOR_MAPPING.

Note: Keeping the author mapping updated is essential for the accuracy of the expert search and data processing functionalities.

Running the Web Crawler

The project includes web crawlers for collecting data from specific sources. Follow these steps to run the crawlers:

1. Crawling Available URLs

To crawl all available URLs for crawling from the HS Mannheim domain, use the following command:

scrapy runspider hsma_url_crawler.py

This command runs the hsma_url_crawler.py script, which gathers URLs from the specified domain.

2. Crawling Content from URLs

After gathering URLs, you can crawl the content from these URLs:

  • First, make sure you have executed hsma_content_crawler.py as described above.

  • Then, run the content crawler with the following command:

    scrapy runspider hsma_content_crawler.py
    
  • This command runs the hsma_content_crawler.py script, which collects content from the list of URLs obtained in the previous step.

3. Post-Crawling Steps

After crawling, move the generated url_texts.json file into the /data directory and rename it to crawled_hsma_web.json.

mv url_texts.json /path/to/data_service/data/crawled_hsma_web.json

Replace /path/to/data_service/data with the actual path to your data_service/data directory.