added Tokenization, Normalization, Sentencization

2024-11-21 19:29:23 +01:00 · 2024-11-21 19:29:23 +01:00 · df9d72f49a
parent 98af5c91cb
commit df9d72f49a
5 changed files with 11802 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -15,6 +15,65 @@ https://github.com/taivop/joke-dataset/tree/master
 | wocka.json         | 10.0K jokes| 1.11M tokens|
 | __TOTAL__              | __208K jokes__ | __8.91M tokens__|

+## *.csv Files
+- created with: token_normal.ipynb
+- done:
+    - Tokenization
+    - Stopword removed
+    - lower case
+    - consist solely of alphabetic characters
+    - Lemmatization
+
+
+# Process 
+- Tokenization
+- (Normalization)
+- Feature Extraction
+- Feature analysis
+- Prediction
+
+# Features
+
+- N Grams 
+    - (paper: Computationally recognizing wordplay in jokes)
+- structual patterns 
+    - (paper: Centric Features)
+
+    - Questions -> Answer
+    - Oneliner
+    - Wordplay 
+    - Dialog
+    - Knock-Knock Jokes
+
+- embeddings
+- length
+- punctuation
+
+# TODOS:
+- 1. __Feature extraction and correlation__
+    - 1a: Structual pattern 
+        - maybe 2 people? 
+        - look at structual_pattern.ipynb
+        - data: structual pattern -> Sentencization
+        -  Paper Research on strucutal patterns
+    - 1b: extented length analysis 
+        - small task 
+        - look at token_normal.ipynb
+        - distribution normalization
+        -  Paper Research on strucutal patterns
+        - ggf. Bericht Inhaltsverzeichnis,...
+        
+    - 1c: N-Grams
+        - data: csv files
+    - 1d: Embeddings
+        - data: csv files
+        - word2vec? (paper: Centric Features)
+
+   
+- 2. Machine Learning / logistic regression
+    - (coming soon...)
+
+

 # Topic presentations (graded) (5 min)
 ## Focus:
--- a/data/tok_bad_jokes.csv
+++ b/data/tok_bad_jokes.csv
--- a/data/tok_good_jokes.csv
+++ b/data/tok_good_jokes.csv
--- a/structual_pattern.ipynb
+++ b/structual_pattern.ipynb
--- a/token_normal.ipynb
+++ b/token_normal.ipynb