# Master MDS Use NLP techniques to analyse texts or to build an application. Document your approach.

Go to file

Felix Jan Michael Mucha 49afaedbd7 del old data		2025-02-17 00:07:11 +01:00
data	del old data	2025-02-17 00:07:11 +01:00
histories	Merge branch 'main' of https://gitty.informatik.hs-mannheim.de/3016498/ANLP_WS24_CA2	2025-02-16 19:09:58 +01:00
.gitignore	added helpfull functionality	2025-02-09 15:33:01 +01:00
BERT.py	bert with history	2025-02-16 19:08:19 +01:00
CNN.py	Bug fix	2025-02-16 19:16:34 +01:00
Datasets.py	!!!WARNING!!! Nuclear refactoring bomb in coming (Now 90% more confusing but 100% cleaner)	2025-02-15 17:16:34 +01:00
EarlyStopping.py	!!!WARNING!!! Nuclear refactoring bomb in coming (Now 90% more confusing but 100% cleaner)	2025-02-15 17:16:34 +01:00
LICENSE	Initial commit	2025-01-17 20:26:51 +01:00
README.md	updated readme	2025-02-16 23:15:33 +01:00
Transformer.py	jetzt kleiner und nicht groesser	2025-02-16 19:13:47 +01:00
data_exploration.ipynb	Merge branch 'main' of https://gitty.informatik.hs-mannheim.de/3016498/ANLP_WS24_CA2	2025-02-15 17:16:42 +01:00
dataset_helper.py	added bootstrap avg / ensemble preds	2025-02-16 03:56:50 +01:00
ml_helper.py	enhanced get_newest_file function to support ensemble file retrieval	2025-02-16 04:17:54 +01:00
ml_history.py	added bootstrap avg / ensemble preds	2025-02-16 03:56:50 +01:00
ml_plots.py	added plots	2025-02-16 11:42:38 +01:00
ml_train.py	added bootstrap avg / ensemble preds	2025-02-16 03:56:50 +01:00
model_compare_types.ipynb	compare different model types	2025-02-16 20:18:31 +01:00
model_comparison.ipynb	model evals	2025-02-16 20:14:25 +01:00
model_comparison_bs_bert.ipynb	Wunderschöne Berechnungen	2025-02-16 21:18:14 +01:00
model_comparison_bs_cnn.ipynb	Wunderschöne Berechnungen	2025-02-16 21:18:14 +01:00
model_comparison_bs_trans.ipynb	ich kann das alles nicht mehr	2025-02-16 20:43:16 +01:00
model_evaluation.ipynb	model evals	2025-02-16 20:14:25 +01:00
requirements.txt	added reqs	2025-02-16 19:15:10 +01:00

README.md

ANLP_WS24_CA2

This repository contains the necessary scripts, data, and notebooks for analyzing and modeling the linguistic and structural features of humor in jokes. The project focuses on leveraging NLP techniques to analyze humor in text data, and aims to predict the humor score numerically using regression models.

Objektive

Leverage advanced NLP techniques (CNN, BERT, and Transformer) to analyze text data and build an application that predicts humor ratings.

Research Question

Can Deep neural networks predict humor ratings with an RMSE greater than or equal to the baseline of 0.8609 ?

Data Source

The data is sourced from the SemEval-2021 Task 7: It contains a dataset of humor and offense ratings for jokes. The jokes are annotated with a humor rating on a scale from 0 to 4.

Traindata: HaHackathon.https://homepages.inf.ed.ac.uk/s1573290/data.html -> associated paper: https://aclanthology.org/2021.semeval-1.9.pdf#:~:text=HaHackathon%20is%20the%20first%20shared%20task%20to%20combine,its%20average%20ratings%20for%20both%20humor%20and%20offense
Testdata: Since no test data was available, the traindata also was used as test data and divided into test, train and validation data

Data embeddings

gloVe 6B tokens: https://nlp.stanford.edu/projects/glove/

Preprocessing Steps

1. Load and clean data: The data set is loaded and all rows with missing humor_rating values are removed. In addition, the target variable for the humor rating is extracted.

2. text embeddings: Pre-trained GloVe embeddings are loaded and converted into a matrix that can be used for modeling.

3. data splitting: The data set is split into training, test and validation data to train and evaluate the models later.

4. ensemble data indices: Various methods for creating data indices are provided to prepare the training data for ensemble methods.

Repository Structure

data/: Contains the dataset hack.csv, which includes raw joke data.
Notebooks:
- Used for data analysis and visualization.
- represent the models

Getting Started

Install Requirements

Run the following command to install the required dependencies:

pip install -r requirements.txt

Preprocess Data

This is carried out automatically when the models are executed

Workflow

1. Preprocessing

The text data is cleaned and transformed into formats suitable for analysis. The preprocessing steps include tokenization, stopword removal, and lemmatization.

2. Model Training

Various machine learning models, including Convolutional Neural Networks (CNNs), Long Short-Term Memory Networks (LSTMs), BERT, and Transformers, are trained to predict the humor rating of jokes based on their linguistic features.

3. Model Evaluation

The trained models are evaluated to determine their performance in predicting humor ratings. Metrics such as RNSE (Root Mean Squared Error) and R² scores are used to assess the models.

4. Classification and Regression

While the primary goal of the project is to predict the numerical humor rating (regression task), we also experiment with classification models for humor detection (e.g., humor vs. non-humor)