extended data exploration

2024-11-20 11:52:27 +01:00 · 2024-11-20 11:52:27 +01:00 · bd1a5c62f9
parent 1998cc29d7
commit bd1a5c62f9
5 changed files with 1398 additions and 459 deletions
--- a/README.md
+++ b/README.md
@ -1,10 +1,45 @@
 # ANLP_WS24_CA1

 # Master MDS
-Use NLP techniques you learned so far (N-gram models, basic machine learn-
-ing, no neural nets) to analyse texts or to build an application. Document
+Use NLP techniques you learned so far (N-gram models, basic machine learning, no neural nets) to analyse texts or to build an application. Document
 your approach.


 # Data Source
 https://github.com/taivop/joke-dataset/tree/master
+
+| File               | Jokes      | Tokens      |
+|--------------------|------------|-------------|
+| reddit_jokes.json  | 195K jokes | 7.40M tokens|
+| stupidstuff.json   | 3.77K jokes| 396K tokens |
+| wocka.json         | 10.0K jokes| 1.11M tokens|
+| __TOTAL__              | __208K jokes__ | __8.91M tokens__|
+
+
+# Topic presentations (graded) (5 min)
+## Focus:
+- What is your overall idea?
+- What kind of data will you use and where do you get the data?
+- Your approach, which techniques will you use?
+- Expected results.
+
+## Open Questions:
+- How to evaluate similarity?
+- How to find structural patterns? (like phrases, setups, punchlines, or wordplay)
+
+
+## Possible Hypothesis:
+- Similar jokes share more common n-grams, phrases, or structural patterns.
+- Basic features like word frequency, sentiment, length, or punctuation can predict joke ratings.
+
+other ideas:
+- The length of a joke (measured in words or characters) is inversely correlated with its average rating, as shortness may enhance comedic impact.
+
+- Highly rated jokes follow certain structural patterns (e.g., setups, punchlines, or wordplay).
+
+## Possible Tools / Techniques
+
+- __Text Preprocessing:__ Tokenization, stopword removal, stemming/lemmatization.
+- __Feature Extraction:__ Bag-of-Words, n-grams (bigram/trigram analysis), TF-IDF.
+
+- __Similarity:__ Cosine similarity for finding similar jokes.
--- a/data_explo_reddit.ipynb
+++ b/data_explo_reddit.ipynb
--- a/data_explo_stuff.ipynb
+++ b/data_explo_stuff.ipynb
--- a/data_explo_wocka.ipynb
+++ b/data_explo_wocka.ipynb
--- a/data_exploration.ipynb
+++ b/data_exploration.ipynb