3.4 KiB
ANLP_WS24_CA1
Master MDS
Use NLP techniques you learned so far (N-gram models, basic machine learning, no neural nets) to analyse texts or to build an application. Document your approach.
Data Source
https://github.com/taivop/joke-dataset/tree/master
File | Jokes | Tokens |
---|---|---|
reddit_jokes.json | 195K jokes | 7.40M tokens |
stupidstuff.json | 3.77K jokes | 396K tokens |
wocka.json | 10.0K jokes | 1.11M tokens |
TOTAL | 208K jokes | 8.91M tokens |
*.csv Files
- created with: token_normal.ipynb
- done:
- Tokenization
- Stopword removed
- lower case
- consist solely of alphabetic characters
- Lemmatization
Process
- Tokenization
- (Normalization)
- Feature Extraction
- Feature analysis
- Prediction
Features
-
N Grams
- (paper: Computationally recognizing wordplay in jokes)
-
structual patterns
-
(paper: Centric Features)
-
Questions -> Answer
-
Oneliner
-
Wordplay
-
Dialog
-
Knock-Knock Jokes
-
-
embeddings
-
length
-
punctuation
TODOS:
-
- Feature extraction and correlation
-
1a: Structual pattern
- maybe 2 people?
- look at structual_pattern.ipynb
- data: structual pattern -> Sentencization
- Paper Research on strucutal patterns
-
1b: extented length analysis
- small task
- look at token_normal.ipynb
- distribution normalization
- Paper Research on strucutal patterns
- ggf. Bericht Inhaltsverzeichnis,...
-
1c: N-Grams
- data: csv files
-
1d: Embeddings
- data: csv files
- word2vec? (paper: Centric Features)
-
- Machine Learning / logistic regression
- (coming soon...)
Topic presentations (graded) (5 min)
Focus:
- What is your overall idea?
- What kind of data will you use and where do you get the data?
- Your approach, which techniques will you use?
- Expected results.
Open Questions:
- How to evaluate similarity?
- How to find structural patterns? (like phrases, setups, punchlines, or wordplay)
Possible Hypothesis:
- Similar jokes share more common n-grams, phrases, or structural patterns.
- Basic features like word frequency, sentiment, length, or punctuation can predict joke ratings.
other ideas:
-
The length of a joke (measured in words or characters) is inversely correlated with its average rating, as shortness may enhance comedic impact.
-
Highly rated jokes follow certain structural patterns (e.g., setups, punchlines, or wordplay).
Possible Tools / Techniques
-
Text Preprocessing: Tokenization, stopword removal, stemming/lemmatization.
-
Feature Extraction: Bag-of-Words, n-grams (bigram/trigram analysis), TF-IDF.
-
Similarity: Cosine similarity for finding similar jokes.
Research
Humor Detection
Humor Detection: A Transformer Gets the Last Laugh
Computationally recognizing wordplay in jokes (N - Grams)
Word2Vec combined with K-NN Human Centric Features