3.4 KiB

Raw Permalink Blame History

ANLP_WS24_CA1

Master MDS

Use NLP techniques you learned so far (N-gram models, basic machine learning, no neural nets) to analyse texts or to build an application. Document your approach.

Data Source

https://github.com/taivop/joke-dataset/tree/master

File	Jokes	Tokens
reddit_jokes.json	195K jokes	7.40M tokens
stupidstuff.json	3.77K jokes	396K tokens
wocka.json	10.0K jokes	1.11M tokens
TOTAL	208K jokes	8.91M tokens

*.csv Files

created with: token_normal.ipynb
done:
- Tokenization
- Stopword removed
- lower case
- consist solely of alphabetic characters
- Lemmatization

Process

Tokenization
(Normalization)
Feature Extraction
Feature analysis
Prediction

Features

N Grams
- (paper: Computationally recognizing wordplay in jokes)
structual patterns
- (paper: Centric Features)
- Questions -> Answer
- Oneliner
- Wordplay
- Dialog
- Knock-Knock Jokes
embeddings
length
punctuation

TODOS:

1. Feature extraction and correlation
- 1a: Structual pattern
  - maybe 2 people?
  - look at structual_pattern.ipynb
  - data: structual pattern -> Sentencization
  - Paper Research on strucutal patterns
- 1b: extented length analysis
  - small task
  - look at token_normal.ipynb
  - distribution normalization
  - Paper Research on strucutal patterns
  - ggf. Bericht Inhaltsverzeichnis,...
- 1c: N-Grams
  - data: csv files
- 1d: Embeddings
  - data: csv files
  - word2vec? (paper: Centric Features)
1. Machine Learning / logistic regression
- (coming soon...)

Topic presentations (graded) (5 min)

Focus:

What is your overall idea?
What kind of data will you use and where do you get the data?
Your approach, which techniques will you use?
Expected results.

Open Questions:

How to evaluate similarity?
How to find structural patterns? (like phrases, setups, punchlines, or wordplay)

Possible Hypothesis:

Similar jokes share more common n-grams, phrases, or structural patterns.
Basic features like word frequency, sentiment, length, or punctuation can predict joke ratings.

other ideas:

The length of a joke (measured in words or characters) is inversely correlated with its average rating, as shortness may enhance comedic impact.
Highly rated jokes follow certain structural patterns (e.g., setups, punchlines, or wordplay).

Possible Tools / Techniques

Text Preprocessing: Tokenization, stopword removal, stemming/lemmatization.
Feature Extraction: Bag-of-Words, n-grams (bigram/trigram analysis), TF-IDF.
Similarity: Cosine similarity for finding similar jokes.

Research

Humor Detection

Humor Detection: A Transformer Gets the Last Laugh

https://arxiv.org/abs/1909.00252

Computationally recognizing wordplay in jokes (N - Grams)

https://www.researchgate.net/publication/229000046_Computationally_recognizing_wordplay_in_jokes

Word2Vec combined with K-NN Human Centric Features

https://www.researchgate.net/publication/301446045_Humor_Recognition_and_Humor_Anchor_Extraction

3.4 KiB Raw Permalink Blame History