Embedding of categorical variables for Deep Learning model — explained

Piotr Janusz
The Startup
Published in
5 min readNov 9, 2019

--

Photo by Unsplash.com

Deep Learning models handles numerical data very well, after all Neural Network (NN) is function approximation. This means that no meter how bumpy the function is there is a NN which solves f(x) for any given x (or close approximation). Things gets bit more complicated when you try to feed non numerical data aka categories. Quick examples of those are colors, job names or even geographical places. Those are strings which simply NN can’t handle, it’s like meet for vegetarian, NO CAN’T DO.

This is when embedding comes handy.

Let’s begin with short reminder on what is Categorical type in the context of other data types.

In the world of data we generally divide data types in to below types:

  • Numerical
  • Categorical
  • other types related to the type of the task the data is associated with e.g. Time Series, Text, Genomic, etc.

Numerical data are obviously numerical values for which mathematical operations make sense and we can further split this type into Continues and Discrete. Examples of Continues variables are temperature, height, width etc. so anything that can fall within a range like width of 1.7 or average of 32.5 … Discrete values are whole numbers 4, 22, 41 like number of oranges in the basket, students in the class (you can’t have 22.5 students) etc.

Categorical values represent data without numerical relation or meaning like hair color, type of game, players position etc. Note that categorical variable can be represented by numbers if e.g. colors are converted into numbers like 1-green, 2-yellow etc. but this does not change the fact that there is no numerical relation therefore should still be consider as categorical.

To be precise I have to mention that there is also mix of Numerical and Categorical, variable type called Ordinal and examples of it are all sort of ratings like movie rating which is still category but there is ordering attached to it (e.g. 5 starts is the highest and it’s more than 4 stars).

Why Embedding ?

There are mainly two reasons, first one I’ve mention already meaning categories are problematic for NN as it’s not a number and there is no numerical simple relation between categories. The second one is kind of related to the first one. I just wrote that there is no numerical relation between categories meaning you can’t say one is greater or smaller then the other but this does not mean there is no relation at all. In fact between categorical classes there are multidimensional relations which we would want to “revile” by embedding them. Let me explain what I mean right below.

How does it work exactly?

Category Embedding is a process of creating single vector representation for each individual category. This means that assuming we have a type called Furniture and 3 classes there being Chair, Table, and Wardrobe we assign a vector of length (n) to each of the categories. Let’s assume following vectors:

  1. Chairs == [0.3, -0.7, 0.1]
  2. Table == [0.2, -0.5, 0.3]
  3. Wardrobe == [0.5, -0.1, 0.6]

Our (n) here is equal to 3 chance 3 numbers in the vector. The interpretation of the above vectors may be that first number in each vector 0.3 , 0.2 and 0.5 represents it’s place in the house where Chair and Table are placed usually closer to each other than Wardrobe, the second -0.7, -0.5 and -0.1 may represent what material was used to create those furniture meaning more Chairs and Tables made out of plastic in this dataset vs. Wardrobes made out of wood, and the last one 0.1, 0.3 and 0.6 represents the size of the furniture type and no wonder that Wardrobe is bigger than Chair and Table . Note that I should not even use word ‘bigger’ but instead further away in the vector space. Also, embedding are learned base on the data you have in the dataset meaning if you change the data your embedding will most likely change e.g. more wooden furniture or even all of them wooden means second value in the vector will be pretty similar for all classes.

How should you choose value of (n), how long your embedding vector should be? It should be a number of unique classes divided by two but no greater than 50.

min(unique_classes/2, 50) 

This means that if you would have country sport dataset and you have 21 unique club names in it than your vector length representing unique club should be 10, BUT if you have world sport dataset with 1021 unique club names in it the vector length should be no more than 50.

How NN learns embedding ?

The concept is as follows, you take your categorical variable, translate it into inputs by one hot encoding (technically it’s integers because it’s faster) and let NN train the same way it normally would by back propagation. This way it learns weights which becomes embedded representation of each individual category.

Deep Learning categorical embedding

On the picture above blue, green and yellow represents our categories, note if you had more than tree unique classes in Furniture then you should have more inputs. Green neurons connected to categories represent embedding and here we decided to have 3 dimensional vector but in case more classes lets say 22, according to our equation min(unique_classes/2, 50) we should get min(22/2, 50) that's 11 green neurons right after embedding input. Hopefully you can tell from the picture above that embedding vector is actually a weights vector associated to each individual class input.

Visualization of Keras model for the “Furniture dataset problem” may look like this:

Hope this helps you understand Categorical Embedding for Deep Learning.

Thanks for reading.

Feel free to connect with me on Linkedin

--

--