extended data exploration

2024-11-20 11:52:27 +01:00 · 2024-11-20 11:52:27 +01:00 · bd1a5c62f9
parent 1998cc29d7
commit bd1a5c62f9
5 changed files with 1398 additions and 459 deletions
--- a/README.md
+++ b/README.md
@ -1,10 +1,45 @@
 # ANLP_WS24_CA1
 # Master MDS
-Use NLP techniques you learned so far (N-gram models, basic machine learn-
+Use NLP techniques you learned so far (N-gram models, basic machine learning, no neural nets) to analyse texts or to build an application. Document
 ing, no neural nets) to analyse texts or to build an application. Document
 your approach.
 # Data Source
 https://github.com/taivop/joke-dataset/tree/master
 | File               | Jokes      | Tokens      |
 |--------------------|------------|-------------|
 | reddit_jokes.json  | 195K jokes | 7.40M tokens|
 | stupidstuff.json   | 3.77K jokes| 396K tokens |
 | wocka.json         | 10.0K jokes| 1.11M tokens|
 | __TOTAL__              | __208K jokes__ | __8.91M tokens__|
 # Topic presentations (graded) (5 min)
 ## Focus:
 - What is your overall idea?
 - What kind of data will you use and where do you get the data?
 - Your approach, which techniques will you use?
 - Expected results.
 ## Open Questions:
 - How to evaluate similarity?
 - How to find structural patterns? (like phrases, setups, punchlines, or wordplay)
 ## Possible Hypothesis:
 - Similar jokes share more common n-grams, phrases, or structural patterns.
 - Basic features like word frequency, sentiment, length, or punctuation can predict joke ratings.
 other ideas:
 - The length of a joke (measured in words or characters) is inversely correlated with its average rating, as shortness may enhance comedic impact.
 - Highly rated jokes follow certain structural patterns (e.g., setups, punchlines, or wordplay).
 ## Possible Tools / Techniques
 - __Text Preprocessing:__ Tokenization, stopword removal, stemming/lemmatization.
 - __Feature Extraction:__ Bag-of-Words, n-grams (bigram/trigram analysis), TF-IDF.
 - __Similarity:__ Cosine similarity for finding similar jokes.
--- a/data_explo_reddit.ipynb
+++ b/data_explo_reddit.ipynb
--- a/data_explo_stuff.ipynb
+++ b/data_explo_stuff.ipynb
--- a/data_explo_wocka.ipynb
+++ b/data_explo_wocka.ipynb
--- a/data_exploration.ipynb
+++ b/data_exploration.ipynb