ANLP_WS24_CA1/README.md

61 lines
2.1 KiB
Markdown

# ANLP_WS24_CA1
# Master MDS
Use NLP techniques you learned so far (N-gram models, basic machine learning, no neural nets) to analyse texts or to build an application. Document
your approach.
# Data Source
https://github.com/taivop/joke-dataset/tree/master
| File | Jokes | Tokens |
|--------------------|------------|-------------|
| reddit_jokes.json | 195K jokes | 7.40M tokens|
| stupidstuff.json | 3.77K jokes| 396K tokens |
| wocka.json | 10.0K jokes| 1.11M tokens|
| __TOTAL__ | __208K jokes__ | __8.91M tokens__|
# Topic presentations (graded) (5 min)
## Focus:
- What is your overall idea?
- What kind of data will you use and where do you get the data?
- Your approach, which techniques will you use?
- Expected results.
## Open Questions:
- How to evaluate similarity?
- How to find structural patterns? (like phrases, setups, punchlines, or wordplay)
## Possible Hypothesis:
- Similar jokes share more common n-grams, phrases, or structural patterns.
- Basic features like word frequency, sentiment, length, or punctuation can predict joke ratings.
other ideas:
- The length of a joke (measured in words or characters) is inversely correlated with its average rating, as shortness may enhance comedic impact.
- Highly rated jokes follow certain structural patterns (e.g., setups, punchlines, or wordplay).
## Possible Tools / Techniques
- __Text Preprocessing:__ Tokenization, stopword removal, stemming/lemmatization.
- __Feature Extraction:__ Bag-of-Words, n-grams (bigram/trigram analysis), TF-IDF.
- __Similarity:__ Cosine similarity for finding similar jokes.
## Research
### Humor Detection
Humor Detection: A Transformer Gets the Last Laugh
- https://arxiv.org/abs/1909.00252
Computationally recognizing wordplay in jokes (N - Grams)
- https://www.researchgate.net/publication/229000046_Computationally_recognizing_wordplay_in_jokes
Word2Vec combined with K-NN Human
Centric Features
- https://www.researchgate.net/publication/301446045_Humor_Recognition_and_Humor_Anchor_Extraction