| prepare_and_tokenize | Split Text on Spaces |
| prepare_text | Prepare Text for Tokenization |
| remove_control_characters | Remove Non-Character Characters |
| remove_diacritics | Remove Diacritical Marks on Characters |
| remove_replacement_characters | Remove the Unicode Replacement Character |
| space_cjk | Add Spaces Around CJK Ideographs |
| space_punctuation | Add Spaces Around Punctuation |
| squish_whitespace | Remove Extra Whitespace |
| tokenize_space | Break Text at Spaces |
| validate_utf8 | Clean Up Text to UTF-8 |