the spatula

This is part of an FAQ for AI concepts.

Turning words into digits

When turning words into vectors one of the simplest techniques for doing so is to use a method known as one-hot encoding. With one hot encoding each word is encoded as a binary sequence N digits long, where N represents the number of unique words being encoded.

For example the sentence “This is my boom stick” would have five encodings laid out as in the following table:

Word	this	is	my	boom	stick
this	1	0	0	0	0
is	0	1	0	0	0
my	0	0	1	0	0
boom	0	0	0	1	0
stick	0	0	0	0	1

The vectors for the above encodings would be <1, 0, 0, 0, 0>, <0, 1, 0, 0, 0> and so on. If the sentence were changed to “This is my boom boom stick” it would not change the number of encodings or vectors.

There are a number of issues with this method. For instance, in a large dataset the vectors are taking up significat space, filled mostly with zeroes (N-1 to be precise). These N-dimensional vectors would end up forming a very sparse matrix, aka, one that is mostly empty. This is not an efficient way to store the information and will make any computations done with the matrix equally inefficient.

The biggest issue, though, is that the one-hot does not provide any information on how the words relate to one another. There are words, like medicine and doctor, which more closely related than say octopus and chocolate. A good method of encoding would provide some way to determine if words are related, and in which way.

To see more methods for turning words into vectors see word vectors.