added Tokenization, Normalization, Sentencization

main
Felix Jan Michael Mucha 2024-11-21 19:29:23 +01:00
parent 98af5c91cb
commit df9d72f49a
5 changed files with 11802 additions and 0 deletions

View File

@ -15,6 +15,65 @@ https://github.com/taivop/joke-dataset/tree/master
| wocka.json | 10.0K jokes| 1.11M tokens|
| __TOTAL__ | __208K jokes__ | __8.91M tokens__|
## *.csv Files
- created with: token_normal.ipynb
- done:
- Tokenization
- Stopword removed
- lower case
- consist solely of alphabetic characters
- Lemmatization
# Process
- Tokenization
- (Normalization)
- Feature Extraction
- Feature analysis
- Prediction
# Features
- N Grams
- (paper: Computationally recognizing wordplay in jokes)
- structual patterns
- (paper: Centric Features)
- Questions -> Answer
- Oneliner
- Wordplay
- Dialog
- Knock-Knock Jokes
- embeddings
- length
- punctuation
# TODOS:
- 1. __Feature extraction and correlation__
- 1a: Structual pattern
- maybe 2 people?
- look at structual_pattern.ipynb
- data: structual pattern -> Sentencization
- Paper Research on strucutal patterns
- 1b: extented length analysis
- small task
- look at token_normal.ipynb
- distribution normalization
- Paper Research on strucutal patterns
- ggf. Bericht Inhaltsverzeichnis,...
- 1c: N-Grams
- data: csv files
- 1d: Embeddings
- data: csv files
- word2vec? (paper: Centric Features)
- 2. Machine Learning / logistic regression
- (coming soon...)
# Topic presentations (graded) (5 min)
## Focus:

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

553
token_normal.ipynb 100644

File diff suppressed because one or more lines are too long