From 3a738f277f88efa84af6c9bdb79bdadbfee3a838 Mon Sep 17 00:00:00 2001 From: klara Date: Sun, 16 Feb 2025 19:35:29 +0100 Subject: [PATCH] update --- README.md | 42 ++++++++---------------------------------- 1 file changed, 8 insertions(+), 34 deletions(-) diff --git a/README.md b/README.md index 76430cb..0a93c9c 100644 --- a/README.md +++ b/README.md @@ -15,7 +15,7 @@ Leverage advanced NLP techniques (LSTM, CNN, BERT, and Transformer) to analyze t The data is sourced from the SemEval-2021 Task 7: It contains a dataset of humor and offense ratings for jokes. The jokes are annotated with a humor rating on a scale from 0 to 4. -- Traindata: HaHackathon.https://homepages.inf.ed.ac.uk/s1573290/data.html +- Traindata: HaHackathon.https://homepages.inf.ed.ac.uk/s1573290/data.html -> associated paper: https://aclanthology.org/2021.semeval-1.9.pdf#:~:text=HaHackathon%20is%20the%20first%20shared%20task%20to%20combine,its%20average%20ratings%20for%20both%20humor%20and%20offense - Testdata: Since no test data was available, the traindata also was used as test data and divided into test, train and validation data @@ -24,14 +24,13 @@ It contains a dataset of humor and offense ratings for jokes. The jokes are anno ### Preprocessing Steps -**1. Daten laden und bereinigen:** Der Datensatz wird geladen und alle Zeilen mit fehlenden humor_rating-Werten werden entfernt. Außerdem wird die Zielvariable für die Humorbewertung extrahiert. +**1. Load and clean data:** The data set is loaded and all rows with missing humor_rating values are removed. In addition, the target variable for the humor rating is extracted. -**2. Text-Embeddings:** Vortrainierte GloVe-Embeddings werden geladen und in eine Matrix umgewandelt, die für die Modellierung genutzt werden kann. +**2. text embeddings:** Pre-trained GloVe embeddings are loaded and converted into a matrix that can be used for modeling. -**3. Datenaufteilung:** Der Datensatz wird in Trainings-, Test- und Validierungsdaten aufgeteilt, um die Modelle später zu trainieren und zu evaluieren. - -**4. Ensemble-Datenindizes:** Verschiedene Methoden zur Erstellung von Datenindizes werden bereitgestellt, um die Trainingsdaten für Ensemble-Methoden aufzubereiten. +**3. data splitting:** The data set is split into training, test and validation data to train and evaluate the models later. +**4. ensemble data indices:** Various methods for creating data indices are provided to prepare the training data for ensemble methods. --- @@ -63,7 +62,7 @@ The text data is cleaned and transformed into formats suitable for analysis. The Various machine learning models, including Convolutional Neural Networks (CNNs), Long Short-Term Memory Networks (LSTMs), BERT, and Transformers, are trained to predict the humor rating of jokes based on their linguistic features. ### 3. Model Evaluation -The trained models are evaluated to determine their performance in predicting humor ratings. Metrics such as Mean Squared Error (MSE) and R² scores are used to assess the models. +The trained models are evaluated to determine their performance in predicting humor ratings. Metrics such as RNSE (Root Mean Squared Error) and R² scores are used to assess the models. ### 4. Classification and Regression While the primary goal of the project is to predict the numerical humor rating (regression task), we also experiment with classification models for humor detection (e.g., humor vs. non-humor) @@ -80,34 +79,9 @@ While the primary goal of the project is to predict the numerical humor rating ( 3. **Humor Detection: A Transformer Gets the Last Laugh** (https://aclanthology.org/D19-1372/) + + --- ## Summary - - -# Master MDS Use NLP techniques to analyse texts or to build an application. Document your approach. - - - - - - -## Data - - -https://competitions.codalab.org/competitions/27446 - -https://aclanthology.org/2021.semeval-1.9.pdf#:~:text=HaHackathon%20is%20the%20first%20shared%20task%20to%20combine,its%20average%20ratings%20for%20both%20humor%20and%20offense. - - -- Hackathon: https://homepages.inf.ed.ac.uk/s1573290/data.html - - - -#### Not Prioritised (Pun data) -- Challenge https://alt.qcri.org/semeval2017/task7/ -- Pun Annotated Amazon (joke not included ...): https://github.com/amazon-science/expunations/tree/main/data - - -