extended data exploration
parent
1998cc29d7
commit
bd1a5c62f9
39
README.md
39
README.md
|
@ -1,10 +1,45 @@
|
||||||
# ANLP_WS24_CA1
|
# ANLP_WS24_CA1
|
||||||
|
|
||||||
# Master MDS
|
# Master MDS
|
||||||
Use NLP techniques you learned so far (N-gram models, basic machine learn-
|
Use NLP techniques you learned so far (N-gram models, basic machine learning, no neural nets) to analyse texts or to build an application. Document
|
||||||
ing, no neural nets) to analyse texts or to build an application. Document
|
|
||||||
your approach.
|
your approach.
|
||||||
|
|
||||||
|
|
||||||
# Data Source
|
# Data Source
|
||||||
https://github.com/taivop/joke-dataset/tree/master
|
https://github.com/taivop/joke-dataset/tree/master
|
||||||
|
|
||||||
|
| File | Jokes | Tokens |
|
||||||
|
|--------------------|------------|-------------|
|
||||||
|
| reddit_jokes.json | 195K jokes | 7.40M tokens|
|
||||||
|
| stupidstuff.json | 3.77K jokes| 396K tokens |
|
||||||
|
| wocka.json | 10.0K jokes| 1.11M tokens|
|
||||||
|
| __TOTAL__ | __208K jokes__ | __8.91M tokens__|
|
||||||
|
|
||||||
|
|
||||||
|
# Topic presentations (graded) (5 min)
|
||||||
|
## Focus:
|
||||||
|
- What is your overall idea?
|
||||||
|
- What kind of data will you use and where do you get the data?
|
||||||
|
- Your approach, which techniques will you use?
|
||||||
|
- Expected results.
|
||||||
|
|
||||||
|
## Open Questions:
|
||||||
|
- How to evaluate similarity?
|
||||||
|
- How to find structural patterns? (like phrases, setups, punchlines, or wordplay)
|
||||||
|
|
||||||
|
|
||||||
|
## Possible Hypothesis:
|
||||||
|
- Similar jokes share more common n-grams, phrases, or structural patterns.
|
||||||
|
- Basic features like word frequency, sentiment, length, or punctuation can predict joke ratings.
|
||||||
|
|
||||||
|
other ideas:
|
||||||
|
- The length of a joke (measured in words or characters) is inversely correlated with its average rating, as shortness may enhance comedic impact.
|
||||||
|
|
||||||
|
- Highly rated jokes follow certain structural patterns (e.g., setups, punchlines, or wordplay).
|
||||||
|
|
||||||
|
## Possible Tools / Techniques
|
||||||
|
|
||||||
|
- __Text Preprocessing:__ Tokenization, stopword removal, stemming/lemmatization.
|
||||||
|
- __Feature Extraction:__ Bag-of-Words, n-grams (bigram/trigram analysis), TF-IDF.
|
||||||
|
|
||||||
|
- __Similarity:__ Cosine similarity for finding similar jokes.
|
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
Loading…
Reference in New Issue