In this post, I am dipping my toes into the world of compute shaders in WebGPU. This is the first of a series on building a particle simulation with collision detection using the GPU.
Read MoreIf you're working with text data and building a Natural Language Processing (NLP) model, one important task you will be confronted with is extracting features from the text.
This usually means transforming the text into a numerical format that machine learning algorithms can understand.
You might be familiar with using word counts or TF-IDF to extract features from text data.
Scikit-learn provides classes out-of-the-box for both of these to transform text data samples into a feature matrix that can be fed to machine learning algorithms.
These transformers can be easily incorporated into a Scikit-learn pipeline, which might look something like this:
from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import SGDClassifier text_clf = Pipeline([ ('vect', TfidfVectorizer()), ('clf', SGDClassifier()), ])
Substitute any other chosen algorithm for SGDClassifier
.
Then you can just pass the list of raw text documents, along with corresponding training labels, to the pipeline.
Using word counts as features can often be useful in training a machine learning model, but sometimes word counts do not provide enough information.
With certain problems like translation from one language to another, information about the meaning of words and their context is helpful.
In this post we will look at representing text documents with word vectors, which are vectors of numbers that represent the meaning of a word.
Then we will write a custom Scikit-learn transformer class for the word vector features - similar to TfidfVectorizer or CountVectorizer - which can be plugged into a pipeline.
Word vectors, or word embeddings, are vectors of numbers that provide information about the meaning of a word, as well as its context.
You can get the semantic similarity of two words by comparing their word vectors.
Even if you're not familiar with word vectors, you may have heard of a couple of popular algorithms for obtaining vectors for various words.
There are pre-trained models that you can download to access word vectors, and if you are using Spacy, GloVe vectors are made available in the larger models.
With Spacy you can easily get vectors of words, as well as sentences.
I'm assuming at least some familiarity with Spacy in this post.
Note that a small Spacy model - ending in sm, such as en_core_web_sm
, will not have built-in vectors, so you will need a larger model to use them.
python -m spacy download en_core_web_lg
Vectors are made available in Spacy Token, Doc and Span objects.
import spacy nlp = spacy.load("en_core_web_lg")
With Spacy, you can get vectors for individual words, as well as sentences.
The vector will be a one-dimensional Numpy array of float numbers.
For example, take the word hat.
First you could check if the word has a vector.
hat = nlp("hat") hat.has_vector True
If it has a vector, you can retrieve it from the vector
attribute.
hat.vector array([ 0.25681 , -0.35552 , -0.18733 , -0.16592 , -0.68094 , 0.60802 , 0.16501 , 0.17907 , 0.17855 , 1.2894 , -0.46481 , -0.22667 , 0.035198 , -0.45087 , 0.71845 , ... -0.94376 , -0.10265 , 0.4415 , 0.37775 , -0.24274 , -0.42695 , 0.18544 , 0.16044 , -0.63395 , -0.074032 , -0.038969 , 0.30813 , -0.069243 , 0.13493 , 0.37585 ], dtype=float32)
The full vector has 300 dimensions.
hat.shape (300,)
A vector for a sentence is similar and has the same shape.
sent = nlp("He wore a red shirt with gray pants.").vector array([ 8.16512257e-02, -8.81854445e-02, -1.21790558e-01, -7.65599236e-02, 8.34635943e-02, 5.33326678e-02, -1.63263362e-02, -3.44585180e-01, -1.27936229e-01, 1.74646115e+00, -1.88558996e-01, 6.99177757e-02, ... 1.32453769e-01, -1.40210897e-01, -5.84307760e-02, 3.93804982e-02, 1.89477772e-01, -1.38648778e-01, -1.60174996e-01, 2.84267794e-02, 2.16686666e-01, 1.05772227e-01, 1.48718446e-01, 9.56766680e-02], dtype=float32)
The sentence vector is the same shape as the word vector because it is made up of the average of the word vectors over each word in the sentence.
Ultimately the goal is to turn a list of text samples into a feature matrix, where there is a row for each text sample, and a column for each feature.
A word vector is initially a 1 x 300 column, but we want to transform it into a 300 x 1 row.
So the first step is to reshape the word vector.
sent = sent.reshape(1,-1) sent.shape (300,)
Then the rows are all concatenated together to create the full feature matrix.
Say you have a corpus like the one below, with the goal of classifying the sentences as either talking about some item of clothing or not.
corpus = [ "I went outside yesterday and picked some flowers.", "She wore a red hat with a dress to the party.", "I think he was wearing athletic clothes and sneakers of some sort.", "I took my dog for a walk at the park.", "I found a hot pink hat on sale over the weekend.", "The dog has brown fur with white spots." ] labels = [0,1,1,0,1,0]
In just a few steps, we can create the feature matrix from these data samples.
numpy.concatenate
.import numpy as np data_list = [nlp(doc).vector.reshape(1,-1) for doc in corpus] data = np.concatenate(data_list) array([[ 0.08162278, 0.15696655, -0.32472467, ..., 0.01618122, 0.01810523, 0.2212121 ], [ 0.1315948 , -0.0819225 , -0.08803785, ..., -0.01854067, 0.09653309, 0.1096675 ], [ 0.07139538, 0.09503647, -0.14292692, ..., 0.01818248, 0.10714766, 0.07863422], [ 0.14246173, 0.18372808, -0.18847175, ..., 0.174818 , -0.07943812, 0.20305632], [ 0.08148216, 0.09574908, -0.13909541, ..., -0.10646044, -0.03817916, 0.22827934], [-0.09829144, -0.02671766, -0.07231866, ..., -0.00786566, 0.00078378, 0.12298879]], dtype=float32)
At this point the data is in the correct input format for many Scikit-learn algorithms.
Now we will package up this code into a reusable class that can be used in a pipeline.
We can write a custom transformer class to be used just as Scikit-learn's TfidfVectorizer
or CountVectorizer
that we saw earlier.
import numpy as np import spacy from sklearn.base import BaseEstimator, TransformerMixin class WordVectorTransformer(TransformerMixin,BaseEstimator): def __init__(self, model="en_core_web_lg"): self.model = model def fit(self,X,y=None): return self def transform(self,X): nlp = spacy.load(self.model) return np.concatenate([nlp(doc).vector.reshape(1,-1) for doc in X])
fit
and a transform
method.This transformer initializes the Spacy model that we're using, and then I have pretty much copied and pasted the code from earlier to create the feature matrix from the raw text samples.
One important thing to keep in mind, is that the parameters that you pass to __init__
should not be altered or changed.
In this case, I just passed the name of the Spacy model to be used, en_core_web_lg
, and then the model is actually loaded in thetransform
method.
At first (before reading the docs more thoroughly...) I tried to load the model in __init__
and assigned that to self.model
, but that won't work if you are using GridSearchCV
with multiprocessing.
This has to do with cloning.
You can read the coding guidelines to properly build Scikit-learn components here.
So now the transformer is ready to use.
transformer = WordVectorTransformer() transformer.fit_transform(corpus)
The transformer can also be used in a pipeline.
text_clf = Pipeline([ ('vect', WordVectorTransformer()), ('clf', SGDClassifier()), ])
This is the exact same pipeline as we saw earlier in the post, only with WordVectorTransformer
instead of TfidfVectorizer
.
text_clf.fit(corpus,labels)
Then call .fit()
with the data samples and labels, and otherwise go about your training and testing process as usual.
Let me know if you have questions or comments!
Write them below or feel free to reach out on Twitter @LVNGD.
In this post, I am dipping my toes into the world of compute shaders in WebGPU. This is the first of a series on building a particle simulation with collision detection using the GPU.
Read MoreFinding the Lowest Common Ancestor of a pair of nodes in a tree can be helpful in a variety of problems in areas such as information retrieval, where it is used with suffix trees for string matching. Read on for the basics of this in Python.
Read MoreThis blog post walks through the process of writing a fragment shader in GLSL, and using it within the three.js library for working with WebGL. We will render a visually appealing grid of rotating rectangles that can be used as a website background.
Read More