extended data exploration

main
Felix Jan Michael Mucha 2024-11-20 11:52:27 +01:00
parent 1998cc29d7
commit bd1a5c62f9
5 changed files with 1398 additions and 459 deletions

View File

@ -1,10 +1,45 @@
# ANLP_WS24_CA1
# Master MDS
Use NLP techniques you learned so far (N-gram models, basic machine learn-
ing, no neural nets) to analyse texts or to build an application. Document
Use NLP techniques you learned so far (N-gram models, basic machine learning, no neural nets) to analyse texts or to build an application. Document
your approach.
# Data Source
https://github.com/taivop/joke-dataset/tree/master
| File | Jokes | Tokens |
|--------------------|------------|-------------|
| reddit_jokes.json | 195K jokes | 7.40M tokens|
| stupidstuff.json | 3.77K jokes| 396K tokens |
| wocka.json | 10.0K jokes| 1.11M tokens|
| __TOTAL__ | __208K jokes__ | __8.91M tokens__|
# Topic presentations (graded) (5 min)
## Focus:
- What is your overall idea?
- What kind of data will you use and where do you get the data?
- Your approach, which techniques will you use?
- Expected results.
## Open Questions:
- How to evaluate similarity?
- How to find structural patterns? (like phrases, setups, punchlines, or wordplay)
## Possible Hypothesis:
- Similar jokes share more common n-grams, phrases, or structural patterns.
- Basic features like word frequency, sentiment, length, or punctuation can predict joke ratings.
other ideas:
- The length of a joke (measured in words or characters) is inversely correlated with its average rating, as shortness may enhance comedic impact.
- Highly rated jokes follow certain structural patterns (e.g., setups, punchlines, or wordplay).
## Possible Tools / Techniques
- __Text Preprocessing:__ Tokenization, stopword removal, stemming/lemmatization.
- __Feature Extraction:__ Bag-of-Words, n-grams (bigram/trigram analysis), TF-IDF.
- __Similarity:__ Cosine similarity for finding similar jokes.

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long