Text Tokenizer & Stemmer

Analyze text with tokenization, stemming, and word statistics

Input Text

Options

How to Use
  • Input Text: Enter or paste any text you want to analyze in the text area.
  • Options:
    • Convert to lowercase: Converts all tokens to lowercase for case-insensitive analysis.
    • Remove punctuation: Strips punctuation marks from tokens.
    • Remove numbers: Filters out tokens that are purely numeric.
    • Apply Porter Stemmer: Reduces words to their root form (e.g., "running" becomes "run").
  • Tokenization: The text is split into individual words (tokens) based on whitespace and punctuation.
  • Statistics: View comprehensive statistics including total tokens, unique tokens, average length, and token frequency distribution.
About Porter Stemmer

The Porter Stemming Algorithm (or 'Porter Stemmer') is a process for removing common morphological and inflexional endings from words in English. It reduces words to their root form:

  • running → run
  • flies → fli
  • happily → happili
  • connection → connect
  • nationalizing → nation

Note: The stemmer produces stems (root forms) which may not always be valid English words, but groups related words together for analysis.