# Master MDS Use NLP techniques you learned so far (N-gram models, basic machine learn- ing, no neural nets) to analyse texts or to build an application. Document your approach.

Go to file

Felix Jan Michael Mucha bd1a5c62f9 extended data exploration		2024-11-20 11:52:27 +01:00
data	init project, first data exploration	2024-11-19 14:42:25 +01:00
LICENSE	Initial commit	2024-11-08 10:04:58 +01:00
README.md	extended data exploration	2024-11-20 11:52:27 +01:00
data_explo_reddit.ipynb	extended data exploration	2024-11-20 11:52:27 +01:00
data_explo_stuff.ipynb	extended data exploration	2024-11-20 11:52:27 +01:00
data_explo_wocka.ipynb	extended data exploration	2024-11-20 11:52:27 +01:00

README.md

ANLP_WS24_CA1

Master MDS

Use NLP techniques you learned so far (N-gram models, basic machine learning, no neural nets) to analyse texts or to build an application. Document your approach.

Data Source

https://github.com/taivop/joke-dataset/tree/master

File	Jokes	Tokens
reddit_jokes.json	195K jokes	7.40M tokens
stupidstuff.json	3.77K jokes	396K tokens
wocka.json	10.0K jokes	1.11M tokens
TOTAL	208K jokes	8.91M tokens

Topic presentations (graded) (5 min)

Focus:

What is your overall idea?
What kind of data will you use and where do you get the data?
Your approach, which techniques will you use?
Expected results.

Open Questions:

How to evaluate similarity?
How to find structural patterns? (like phrases, setups, punchlines, or wordplay)

Possible Hypothesis:

Similar jokes share more common n-grams, phrases, or structural patterns.
Basic features like word frequency, sentiment, length, or punctuation can predict joke ratings.

other ideas:

The length of a joke (measured in words or characters) is inversely correlated with its average rating, as shortness may enhance comedic impact.
Highly rated jokes follow certain structural patterns (e.g., setups, punchlines, or wordplay).

Possible Tools / Techniques

Text Preprocessing: Tokenization, stopword removal, stemming/lemmatization.
Feature Extraction: Bag-of-Words, n-grams (bigram/trigram analysis), TF-IDF.
Similarity: Cosine similarity for finding similar jokes.