Recently I've combined various functions which I've been using in other projects (e.g. my personal PKM toolchain) and published them as new library https://thi.ng/text-analysis for better re-use:
- customizable, composable & extensible tokenization (transducer based)
- ngram generation
- Porter-stemming & stopword removal
- vocabulary (bi-directional index) creation
- dense & sparse multi-hot vector encoding/decoding
- histograms (incl. sorted versions)
- tf-idf (term frequency & inverse document frequency), multiple strategies
- k-means clustering (with k-means++ initialization & customizable distance metrics)
- similarity/distance functions (dense & sparse versions)
- central terms extraction
The attached code example (also in the project readme) uses this package to creeate a clustering of all ~210 #ThingUmbrella packages, based on their assigned tags/keywords...
The library is not intended to be a full-blown NLP solution, but I keep on finding myself running into these functions/concepts quite often, and maybe you'll find them useful too...