# Master MDS Use NLP techniques to analyse texts or to build an application. Document your approach.
 
 
Go to file
Felix Jan Michael Mucha 04fc7e6bea model evals 2025-02-16 20:14:25 +01:00
data 10% preprocessed Reddit Sample 2025-02-16 11:59:00 +01:00
histories Merge branch 'main' of https://gitty.informatik.hs-mannheim.de/3016498/ANLP_WS24_CA2 2025-02-16 19:09:58 +01:00
.gitignore added helpfull functionality 2025-02-09 15:33:01 +01:00
BERT.py bert with history 2025-02-16 19:08:19 +01:00
CNN.py Bug fix 2025-02-16 19:16:34 +01:00
Datasets.py !!!WARNING!!! Nuclear refactoring bomb in coming (Now 90% more confusing but 100% cleaner) 2025-02-15 17:16:34 +01:00
EarlyStopping.py !!!WARNING!!! Nuclear refactoring bomb in coming (Now 90% more confusing but 100% cleaner) 2025-02-15 17:16:34 +01:00
LICENSE Initial commit 2025-01-17 20:26:51 +01:00
LSTM.py now it kinda works 2025-02-16 14:18:31 +01:00
README.md update 2025-02-16 19:02:06 +01:00
Transformer.py jetzt kleiner und nicht groesser 2025-02-16 19:13:47 +01:00
data_exploration.ipynb Merge branch 'main' of https://gitty.informatik.hs-mannheim.de/3016498/ANLP_WS24_CA2 2025-02-15 17:16:42 +01:00
dataset_helper.py added bootstrap avg / ensemble preds 2025-02-16 03:56:50 +01:00
ml_helper.py enhanced get_newest_file function to support ensemble file retrieval 2025-02-16 04:17:54 +01:00
ml_history.py added bootstrap avg / ensemble preds 2025-02-16 03:56:50 +01:00
ml_plots.py added plots 2025-02-16 11:42:38 +01:00
ml_train.py added bootstrap avg / ensemble preds 2025-02-16 03:56:50 +01:00
model_comparison.ipynb model evals 2025-02-16 20:14:25 +01:00
model_evaluation.ipynb model evals 2025-02-16 20:14:25 +01:00
requirements.txt added reqs 2025-02-16 19:15:10 +01:00

README.md

ANLP_WS24_CA2

This repository contains the necessary scripts, data, and notebooks for analyzing and modeling the linguistic and structural features of humor in jokes. The project focuses on leveraging NLP techniques to analyze humor in text data, and aims to predict the humor score numerically using regression models.


Objektive

Leverage advanced NLP techniques (LSTM, CNN, BERT, and Transformer) to analyze text data and build an application that predicts humor ratings.

Research Question

...

Data Source

The data is sourced from the SemEval-2021 Task 7: It contains a dataset of humor and offense ratings for jokes. The jokes are annotated with a humor rating on a scale from 0 to 4.

Data embeddings

Preprocessing Steps

1. Daten laden und bereinigen: Der Datensatz wird geladen und alle Zeilen mit fehlenden humor_rating-Werten werden entfernt. Außerdem wird die Zielvariable für die Humorbewertung extrahiert.

2. Text-Embeddings: Vortrainierte GloVe-Embeddings werden geladen und in eine Matrix umgewandelt, die für die Modellierung genutzt werden kann.

3. Datenaufteilung: Der Datensatz wird in Trainings-, Test- und Validierungsdaten aufgeteilt, um die Modelle später zu trainieren und zu evaluieren.

4. Ensemble-Datenindizes: Verschiedene Methoden zur Erstellung von Datenindizes werden bereitgestellt, um die Trainingsdaten für Ensemble-Methoden aufzubereiten.


Repository Structure

  • data/: Contains the dataset hack.csv, which includes raw joke data.
  • Notebooks:
    • Used for data analysis and visualization.
    • represent the models

Getting Started

Install Requirements

Run the following command to install the required dependencies:

pip install -r requirements.txt

Preprocess Data

This is carried out automatically when the models are executed

Workflow

1. Preprocessing

The text data is cleaned and transformed into formats suitable for analysis. The preprocessing steps include tokenization, stopword removal, and lemmatization.

2. Model Training

Various machine learning models, including Convolutional Neural Networks (CNNs), Long Short-Term Memory Networks (LSTMs), BERT, and Transformers, are trained to predict the humor rating of jokes based on their linguistic features.

3. Model Evaluation

The trained models are evaluated to determine their performance in predicting humor ratings. Metrics such as Mean Squared Error (MSE) and R² scores are used to assess the models.

4. Classification and Regression

While the primary goal of the project is to predict the numerical humor rating (regression task), we also experiment with classification models for humor detection (e.g., humor vs. non-humor)


Research References

Key Papers in Humor Detecion

  1. Humor recognition using deep learning.” Humor recognition using deep learning (https://aclanthology.org/N18-2018.pdf)

  2. ADVERSARIAL TRAINING METHODS FOR SEMI-SUPERVISED TEXT CLASSIFICATION (https://arxiv.org/pdf/1605.07725)

  3. Humor Detection: A Transformer Gets the Last Laugh (https://aclanthology.org/D19-1372/)


Summary

Master MDS Use NLP techniques to analyse texts or to build an application. Document your approach.

Data

https://competitions.codalab.org/competitions/27446

https://aclanthology.org/2021.semeval-1.9.pdf#:~:text=HaHackathon%20is%20the%20first%20shared%20task%20to%20combine,its%20average%20ratings%20for%20both%20humor%20and%20offense.

Not Prioritised (Pun data)