# ANLP_WS24_CA1 # Master MDS Use NLP techniques you learned so far (N-gram models, basic machine learning, no neural nets) to analyse texts or to build an application. Document your approach. # Data Source https://github.com/taivop/joke-dataset/tree/master | File | Jokes | Tokens | |--------------------|------------|-------------| | reddit_jokes.json | 195K jokes | 7.40M tokens| | stupidstuff.json | 3.77K jokes| 396K tokens | | wocka.json | 10.0K jokes| 1.11M tokens| | __TOTAL__ | __208K jokes__ | __8.91M tokens__| # Topic presentations (graded) (5 min) ## Focus: - What is your overall idea? - What kind of data will you use and where do you get the data? - Your approach, which techniques will you use? - Expected results. ## Open Questions: - How to evaluate similarity? - How to find structural patterns? (like phrases, setups, punchlines, or wordplay) ## Possible Hypothesis: - Similar jokes share more common n-grams, phrases, or structural patterns. - Basic features like word frequency, sentiment, length, or punctuation can predict joke ratings. other ideas: - The length of a joke (measured in words or characters) is inversely correlated with its average rating, as shortness may enhance comedic impact. - Highly rated jokes follow certain structural patterns (e.g., setups, punchlines, or wordplay). ## Possible Tools / Techniques - __Text Preprocessing:__ Tokenization, stopword removal, stemming/lemmatization. - __Feature Extraction:__ Bag-of-Words, n-grams (bigram/trigram analysis), TF-IDF. - __Similarity:__ Cosine similarity for finding similar jokes.