120 lines
3.4 KiB
Markdown
120 lines
3.4 KiB
Markdown
# ANLP_WS24_CA1
|
|
|
|
# Master MDS
|
|
Use NLP techniques you learned so far (N-gram models, basic machine learning, no neural nets) to analyse texts or to build an application. Document
|
|
your approach.
|
|
|
|
|
|
# Data Source
|
|
https://github.com/taivop/joke-dataset/tree/master
|
|
|
|
| File | Jokes | Tokens |
|
|
|--------------------|------------|-------------|
|
|
| reddit_jokes.json | 195K jokes | 7.40M tokens|
|
|
| stupidstuff.json | 3.77K jokes| 396K tokens |
|
|
| wocka.json | 10.0K jokes| 1.11M tokens|
|
|
| __TOTAL__ | __208K jokes__ | __8.91M tokens__|
|
|
|
|
## *.csv Files
|
|
- created with: token_normal.ipynb
|
|
- done:
|
|
- Tokenization
|
|
- Stopword removed
|
|
- lower case
|
|
- consist solely of alphabetic characters
|
|
- Lemmatization
|
|
|
|
|
|
# Process
|
|
- Tokenization
|
|
- (Normalization)
|
|
- Feature Extraction
|
|
- Feature analysis
|
|
- Prediction
|
|
|
|
# Features
|
|
|
|
- N Grams
|
|
- (paper: Computationally recognizing wordplay in jokes)
|
|
- structual patterns
|
|
- (paper: Centric Features)
|
|
|
|
- Questions -> Answer
|
|
- Oneliner
|
|
- Wordplay
|
|
- Dialog
|
|
- Knock-Knock Jokes
|
|
|
|
- embeddings
|
|
- length
|
|
- punctuation
|
|
|
|
# TODOS:
|
|
- 1. __Feature extraction and correlation__
|
|
- 1a: Structual pattern
|
|
- maybe 2 people?
|
|
- look at structual_pattern.ipynb
|
|
- data: structual pattern -> Sentencization
|
|
- Paper Research on strucutal patterns
|
|
- 1b: extented length analysis
|
|
- small task
|
|
- look at token_normal.ipynb
|
|
- distribution normalization
|
|
- Paper Research on strucutal patterns
|
|
- ggf. Bericht Inhaltsverzeichnis,...
|
|
|
|
- 1c: N-Grams
|
|
- data: csv files
|
|
- 1d: Embeddings
|
|
- data: csv files
|
|
- word2vec? (paper: Centric Features)
|
|
|
|
|
|
- 2. Machine Learning / logistic regression
|
|
- (coming soon...)
|
|
|
|
|
|
|
|
# Topic presentations (graded) (5 min)
|
|
## Focus:
|
|
- What is your overall idea?
|
|
- What kind of data will you use and where do you get the data?
|
|
- Your approach, which techniques will you use?
|
|
- Expected results.
|
|
|
|
## Open Questions:
|
|
- How to evaluate similarity?
|
|
- How to find structural patterns? (like phrases, setups, punchlines, or wordplay)
|
|
|
|
|
|
## Possible Hypothesis:
|
|
- Similar jokes share more common n-grams, phrases, or structural patterns.
|
|
- Basic features like word frequency, sentiment, length, or punctuation can predict joke ratings.
|
|
|
|
other ideas:
|
|
- The length of a joke (measured in words or characters) is inversely correlated with its average rating, as shortness may enhance comedic impact.
|
|
|
|
- Highly rated jokes follow certain structural patterns (e.g., setups, punchlines, or wordplay).
|
|
|
|
## Possible Tools / Techniques
|
|
|
|
- __Text Preprocessing:__ Tokenization, stopword removal, stemming/lemmatization.
|
|
- __Feature Extraction:__ Bag-of-Words, n-grams (bigram/trigram analysis), TF-IDF.
|
|
|
|
- __Similarity:__ Cosine similarity for finding similar jokes.
|
|
|
|
|
|
|
|
## Research
|
|
|
|
### Humor Detection
|
|
Humor Detection: A Transformer Gets the Last Laugh
|
|
- https://arxiv.org/abs/1909.00252
|
|
|
|
|
|
Computationally recognizing wordplay in jokes (N - Grams)
|
|
- https://www.researchgate.net/publication/229000046_Computationally_recognizing_wordplay_in_jokes
|
|
|
|
Word2Vec combined with K-NN Human
|
|
Centric Features
|
|
- https://www.researchgate.net/publication/301446045_Humor_Recognition_and_Humor_Anchor_Extraction |