added Tokenization, Normalization, Sentencization
parent
98af5c91cb
commit
df9d72f49a
59
README.md
59
README.md
|
@ -15,6 +15,65 @@ https://github.com/taivop/joke-dataset/tree/master
|
||||||
| wocka.json | 10.0K jokes| 1.11M tokens|
|
| wocka.json | 10.0K jokes| 1.11M tokens|
|
||||||
| __TOTAL__ | __208K jokes__ | __8.91M tokens__|
|
| __TOTAL__ | __208K jokes__ | __8.91M tokens__|
|
||||||
|
|
||||||
|
## *.csv Files
|
||||||
|
- created with: token_normal.ipynb
|
||||||
|
- done:
|
||||||
|
- Tokenization
|
||||||
|
- Stopword removed
|
||||||
|
- lower case
|
||||||
|
- consist solely of alphabetic characters
|
||||||
|
- Lemmatization
|
||||||
|
|
||||||
|
|
||||||
|
# Process
|
||||||
|
- Tokenization
|
||||||
|
- (Normalization)
|
||||||
|
- Feature Extraction
|
||||||
|
- Feature analysis
|
||||||
|
- Prediction
|
||||||
|
|
||||||
|
# Features
|
||||||
|
|
||||||
|
- N Grams
|
||||||
|
- (paper: Computationally recognizing wordplay in jokes)
|
||||||
|
- structual patterns
|
||||||
|
- (paper: Centric Features)
|
||||||
|
|
||||||
|
- Questions -> Answer
|
||||||
|
- Oneliner
|
||||||
|
- Wordplay
|
||||||
|
- Dialog
|
||||||
|
- Knock-Knock Jokes
|
||||||
|
|
||||||
|
- embeddings
|
||||||
|
- length
|
||||||
|
- punctuation
|
||||||
|
|
||||||
|
# TODOS:
|
||||||
|
- 1. __Feature extraction and correlation__
|
||||||
|
- 1a: Structual pattern
|
||||||
|
- maybe 2 people?
|
||||||
|
- look at structual_pattern.ipynb
|
||||||
|
- data: structual pattern -> Sentencization
|
||||||
|
- Paper Research on strucutal patterns
|
||||||
|
- 1b: extented length analysis
|
||||||
|
- small task
|
||||||
|
- look at token_normal.ipynb
|
||||||
|
- distribution normalization
|
||||||
|
- Paper Research on strucutal patterns
|
||||||
|
- ggf. Bericht Inhaltsverzeichnis,...
|
||||||
|
|
||||||
|
- 1c: N-Grams
|
||||||
|
- data: csv files
|
||||||
|
- 1d: Embeddings
|
||||||
|
- data: csv files
|
||||||
|
- word2vec? (paper: Centric Features)
|
||||||
|
|
||||||
|
|
||||||
|
- 2. Machine Learning / logistic regression
|
||||||
|
- (coming soon...)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
# Topic presentations (graded) (5 min)
|
# Topic presentations (graded) (5 min)
|
||||||
## Focus:
|
## Focus:
|
||||||
|
|
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
Loading…
Reference in New Issue