ANLP_WS24_CA1/README.md

3.4 KiB

ANLP_WS24_CA1

Master MDS

Use NLP techniques you learned so far (N-gram models, basic machine learning, no neural nets) to analyse texts or to build an application. Document your approach.

Data Source

https://github.com/taivop/joke-dataset/tree/master

File Jokes Tokens
reddit_jokes.json 195K jokes 7.40M tokens
stupidstuff.json 3.77K jokes 396K tokens
wocka.json 10.0K jokes 1.11M tokens
TOTAL 208K jokes 8.91M tokens

*.csv Files

  • created with: token_normal.ipynb
  • done:
    • Tokenization
    • Stopword removed
    • lower case
    • consist solely of alphabetic characters
    • Lemmatization

Process

  • Tokenization
  • (Normalization)
  • Feature Extraction
  • Feature analysis
  • Prediction

Features

  • N Grams

    • (paper: Computationally recognizing wordplay in jokes)
  • structual patterns

    • (paper: Centric Features)

    • Questions -> Answer

    • Oneliner

    • Wordplay

    • Dialog

    • Knock-Knock Jokes

  • embeddings

  • length

  • punctuation

TODOS:

    1. Feature extraction and correlation
    • 1a: Structual pattern

      • maybe 2 people?
      • look at structual_pattern.ipynb
      • data: structual pattern -> Sentencization
      • Paper Research on strucutal patterns
    • 1b: extented length analysis

      • small task
      • look at token_normal.ipynb
      • distribution normalization
      • Paper Research on strucutal patterns
      • ggf. Bericht Inhaltsverzeichnis,...
    • 1c: N-Grams

      • data: csv files
    • 1d: Embeddings

      • data: csv files
      • word2vec? (paper: Centric Features)
    1. Machine Learning / logistic regression
    • (coming soon...)

Topic presentations (graded) (5 min)

Focus:

  • What is your overall idea?
  • What kind of data will you use and where do you get the data?
  • Your approach, which techniques will you use?
  • Expected results.

Open Questions:

  • How to evaluate similarity?
  • How to find structural patterns? (like phrases, setups, punchlines, or wordplay)

Possible Hypothesis:

  • Similar jokes share more common n-grams, phrases, or structural patterns.
  • Basic features like word frequency, sentiment, length, or punctuation can predict joke ratings.

other ideas:

  • The length of a joke (measured in words or characters) is inversely correlated with its average rating, as shortness may enhance comedic impact.

  • Highly rated jokes follow certain structural patterns (e.g., setups, punchlines, or wordplay).

Possible Tools / Techniques

  • Text Preprocessing: Tokenization, stopword removal, stemming/lemmatization.

  • Feature Extraction: Bag-of-Words, n-grams (bigram/trigram analysis), TF-IDF.

  • Similarity: Cosine similarity for finding similar jokes.

Research

Humor Detection

Humor Detection: A Transformer Gets the Last Laugh

Computationally recognizing wordplay in jokes (N - Grams)

Word2Vec combined with K-NN Human Centric Features