ANLP_WS24_CA1/README.md

# ANLP_WS24_CA1

# Master MDS
Use NLP techniques you learned so far (N-gram models, basic machine learning, no neural nets) to analyse texts or to build an application. Document
your approach.


# Data Source
https://github.com/taivop/joke-dataset/tree/master

| File               | Jokes      | Tokens      |
|--------------------|------------|-------------|
| reddit_jokes.json  | 195K jokes | 7.40M tokens|
| stupidstuff.json   | 3.77K jokes| 396K tokens |
| wocka.json         | 10.0K jokes| 1.11M tokens|
| __TOTAL__              | __208K jokes__ | __8.91M tokens__|

## *.csv Files
- created with: token_normal.ipynb
- done:
    - Tokenization
    - Stopword removed
    - lower case
    - consist solely of alphabetic characters
    - Lemmatization


# Process 
- Tokenization
- (Normalization)
- Feature Extraction
- Feature analysis
- Prediction

# Features

- N Grams 
    - (paper: Computationally recognizing wordplay in jokes)
- structual patterns 
    - (paper: Centric Features)

    - Questions -> Answer
    - Oneliner
    - Wordplay 
    - Dialog
    - Knock-Knock Jokes

- embeddings
- length
- punctuation

# TODOS:
- 1. __Feature extraction and correlation__
    - 1a: Structual pattern 
        - maybe 2 people? 
        - look at structual_pattern.ipynb
        - data: structual pattern -> Sentencization
        -  Paper Research on strucutal patterns
    - 1b: extented length analysis 
        - small task 
        - look at token_normal.ipynb
        - distribution normalization
        -  Paper Research on strucutal patterns
        - ggf. Bericht Inhaltsverzeichnis,...
        
    - 1c: N-Grams
        - data: csv files
    - 1d: Embeddings
        - data: csv files
        - word2vec? (paper: Centric Features)

   
- 2. Machine Learning / logistic regression
    - (coming soon...)


# Topic presentations (graded) (5 min)
## Focus:
- What is your overall idea?
- What kind of data will you use and where do you get the data?
- Your approach, which techniques will you use?
- Expected results.

## Open Questions:
- How to evaluate similarity?
- How to find structural patterns? (like phrases, setups, punchlines, or wordplay)


## Possible Hypothesis:
- Similar jokes share more common n-grams, phrases, or structural patterns.
- Basic features like word frequency, sentiment, length, or punctuation can predict joke ratings.

other ideas:
- The length of a joke (measured in words or characters) is inversely correlated with its average rating, as shortness may enhance comedic impact.

- Highly rated jokes follow certain structural patterns (e.g., setups, punchlines, or wordplay).

## Possible Tools / Techniques

- __Text Preprocessing:__ Tokenization, stopword removal, stemming/lemmatization.
- __Feature Extraction:__ Bag-of-Words, n-grams (bigram/trigram analysis), TF-IDF.

- __Similarity:__ Cosine similarity for finding similar jokes.


## Research

### Humor Detection
Humor Detection: A Transformer Gets the Last Laugh
- https://arxiv.org/abs/1909.00252


Computationally recognizing wordplay in jokes (N - Grams)
- https://www.researchgate.net/publication/229000046_Computationally_recognizing_wordplay_in_jokes

Word2Vec combined with K-NN Human
Centric Features
- https://www.researchgate.net/publication/301446045_Humor_Recognition_and_Humor_Anchor_Extraction
init project, first data exploration 2024-11-19 14:42:25 +01:00			`# ANLP_WS24_CA1`

Initial commit 2024-11-08 10:04:58 +01:00			`# Master MDS`
extended data exploration 2024-11-20 11:52:27 +01:00			`Use NLP techniques you learned so far (N-gram models, basic machine learning, no neural nets) to analyse texts or to build an application. Document`
init project, first data exploration 2024-11-19 14:42:25 +01:00			`your approach.`


			`# Data Source`
extended data exploration 2024-11-20 11:52:27 +01:00			`https://github.com/taivop/joke-dataset/tree/master`

			`\| File \| Jokes \| Tokens \|`
			`\|--------------------\|------------\|-------------\|`
			`\| reddit_jokes.json \| 195K jokes \| 7.40M tokens\|`
			`\| stupidstuff.json \| 3.77K jokes\| 396K tokens \|`
			`\| wocka.json \| 10.0K jokes\| 1.11M tokens\|`
			`\| __TOTAL__ \| __208K jokes__ \| __8.91M tokens__\|`

added Tokenization, Normalization, Sentencization 2024-11-21 19:29:23 +01:00			`## *.csv Files`
			`- created with: token_normal.ipynb`
			`- done:`
			`- Tokenization`
			`- Stopword removed`
			`- lower case`
			`- consist solely of alphabetic characters`
			`- Lemmatization`


			`# Process`
			`- Tokenization`
			`- (Normalization)`
			`- Feature Extraction`
			`- Feature analysis`
			`- Prediction`

			`# Features`

			`- N Grams`
			`- (paper: Computationally recognizing wordplay in jokes)`
			`- structual patterns`
			`- (paper: Centric Features)`

			`- Questions -> Answer`
			`- Oneliner`
			`- Wordplay`
			`- Dialog`
			`- Knock-Knock Jokes`

			`- embeddings`
			`- length`
			`- punctuation`

			`# TODOS:`
			`- 1. __Feature extraction and correlation__`
			`- 1a: Structual pattern`
			`- maybe 2 people?`
			`- look at structual_pattern.ipynb`
			`- data: structual pattern -> Sentencization`
			`- Paper Research on strucutal patterns`
			`- 1b: extented length analysis`
			`- small task`
			`- look at token_normal.ipynb`
			`- distribution normalization`
			`- Paper Research on strucutal patterns`
			`- ggf. Bericht Inhaltsverzeichnis,...`

			`- 1c: N-Grams`
			`- data: csv files`
			`- 1d: Embeddings`
			`- data: csv files`
			`- word2vec? (paper: Centric Features)`


			`- 2. Machine Learning / logistic regression`
			`- (coming soon...)`


extended data exploration 2024-11-20 11:52:27 +01:00
			`# Topic presentations (graded) (5 min)`
			`## Focus:`
			`- What is your overall idea?`
			`- What kind of data will you use and where do you get the data?`
			`- Your approach, which techniques will you use?`
			`- Expected results.`

			`## Open Questions:`
			`- How to evaluate similarity?`
			`- How to find structural patterns? (like phrases, setups, punchlines, or wordplay)`


			`## Possible Hypothesis:`
			`- Similar jokes share more common n-grams, phrases, or structural patterns.`
			`- Basic features like word frequency, sentiment, length, or punctuation can predict joke ratings.`

			`other ideas:`
			`- The length of a joke (measured in words or characters) is inversely correlated with its average rating, as shortness may enhance comedic impact.`

			`- Highly rated jokes follow certain structural patterns (e.g., setups, punchlines, or wordplay).`

			`## Possible Tools / Techniques`

			`- __Text Preprocessing:__ Tokenization, stopword removal, stemming/lemmatization.`
			`- __Feature Extraction:__ Bag-of-Words, n-grams (bigram/trigram analysis), TF-IDF.`

added presentation diagram 2024-11-20 23:20:17 +01:00			`- __Similarity:__ Cosine similarity for finding similar jokes.`



			`## Research`

			`### Humor Detection`
			`Humor Detection: A Transformer Gets the Last Laugh`
			`- https://arxiv.org/abs/1909.00252`


			`Computationally recognizing wordplay in jokes (N - Grams)`
			`- https://www.researchgate.net/publication/229000046_Computationally_recognizing_wordplay_in_jokes`

			`Word2Vec combined with K-NN Human`
			`Centric Features`
			`- https://www.researchgate.net/publication/301446045_Humor_Recognition_and_Humor_Anchor_Extraction`