# ANLP_WS24_CA1 # Master MDS Use NLP techniques you learned so far (N-gram models, basic machine learning, no neural nets) to analyse texts or to build an application. Document your approach. # Data Source https://github.com/taivop/joke-dataset/tree/master | File | Jokes | Tokens | |--------------------|------------|-------------| | reddit_jokes.json | 195K jokes | 7.40M tokens| | stupidstuff.json | 3.77K jokes| 396K tokens | | wocka.json | 10.0K jokes| 1.11M tokens| | __TOTAL__ | __208K jokes__ | __8.91M tokens__| ## *.csv Files - created with: token_normal.ipynb - done: - Tokenization - Stopword removed - lower case - consist solely of alphabetic characters - Lemmatization # Process - Tokenization - (Normalization) - Feature Extraction - Feature analysis - Prediction # Features - N Grams - (paper: Computationally recognizing wordplay in jokes) - structual patterns - (paper: Centric Features) - Questions -> Answer - Oneliner - Wordplay - Dialog - Knock-Knock Jokes - embeddings - length - punctuation # TODOS: - 1. __Feature extraction and correlation__ - 1a: Structual pattern - maybe 2 people? - look at structual_pattern.ipynb - data: structual pattern -> Sentencization - Paper Research on strucutal patterns - 1b: extented length analysis - small task - look at token_normal.ipynb - distribution normalization - Paper Research on strucutal patterns - ggf. Bericht Inhaltsverzeichnis,... - 1c: N-Grams - data: csv files - 1d: Embeddings - data: csv files - word2vec? (paper: Centric Features) - 2. Machine Learning / logistic regression - (coming soon...) # Topic presentations (graded) (5 min) ## Focus: - What is your overall idea? - What kind of data will you use and where do you get the data? - Your approach, which techniques will you use? - Expected results. ## Open Questions: - How to evaluate similarity? - How to find structural patterns? (like phrases, setups, punchlines, or wordplay) ## Possible Hypothesis: - Similar jokes share more common n-grams, phrases, or structural patterns. - Basic features like word frequency, sentiment, length, or punctuation can predict joke ratings. other ideas: - The length of a joke (measured in words or characters) is inversely correlated with its average rating, as shortness may enhance comedic impact. - Highly rated jokes follow certain structural patterns (e.g., setups, punchlines, or wordplay). ## Possible Tools / Techniques - __Text Preprocessing:__ Tokenization, stopword removal, stemming/lemmatization. - __Feature Extraction:__ Bag-of-Words, n-grams (bigram/trigram analysis), TF-IDF. - __Similarity:__ Cosine similarity for finding similar jokes. ## Research ### Humor Detection Humor Detection: A Transformer Gets the Last Laugh - https://arxiv.org/abs/1909.00252 Computationally recognizing wordplay in jokes (N - Grams) - https://www.researchgate.net/publication/229000046_Computationally_recognizing_wordplay_in_jokes Word2Vec combined with K-NN Human Centric Features - https://www.researchgate.net/publication/301446045_Humor_Recognition_and_Humor_Anchor_Extraction