Encoding Categorical Features

Categorical features such as gender or location can't be directly inputted into most models, as most models require numerical inputs

  • This means that we have to represent our categorical features numerically
  • One of the most popular methods this encoding method is called one-hot encoding

One-hot encoding

  • The process of one-hot encoding involves transforming a categorical value into a vector representation using 1s and 0s
  • For example, if you have three classes: "USA", "MEX", and "CAN", you can represent "USA" as [1, 0, 0]
    • the first position, a 1, represents a "yes" for USA,
    • the second position, a 0, represents a "no" for MEX,
    • the third position, a 0, represents a "no" for CAN
  • Notice that a one-hot vector will always have one value of 1, and the rest of the values will be 0.
  • If you have a columns of values such as ["USA", "USA", "MEX", "CAN", "USA"], you could represent it as three new columns
    • "USA": [1, 1, 0, 0, 1]
    • "MEX": [0, 0, 1, 0, 0]
    • "CAN": [0, 0, 0, 1, 0]
  • Note: You would retain the same amount of information by dropping one of these columns at random because there are two degrees of freedom
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Create some data

In [2]:
data = {
    'State': ['FL', 'FL', 'MA', 'MA', 'MA'],
    'Education': ['HS', 'BA', 'BA', 'MA', 'PHD'],
    'Salary': [25000, 70000, 63000, 105000, 130000]
}

df = pd.DataFrame(data)
df
Out[2]:
State Education Salary
0 FL HS 25000
1 FL BA 70000
2 MA BA 63000
3 MA MA 105000
4 MA PHD 130000

As is, you can't train a model with this data because you have text fields

  • You need to preprocess this data by encoding the categorical features "State" and "Education"
In [3]:
pd.get_dummies(df)
Out[3]:
Salary State_FL State_MA Education_BA Education_HS Education_MA Education_PHD
0 25000 1 0 0 1 0 0
1 70000 1 0 1 0 0 0
2 63000 0 1 1 0 0 0
3 105000 0 1 0 0 1 0
4 130000 0 1 0 0 0 1

You can also drop the first category of each categorical column to reduce the size of your data while retaining the same amount of information

In [4]:
pd.get_dummies(df, drop_first=True)
Out[4]:
Salary State_MA Education_HS Education_MA Education_PHD
0 25000 0 1 0 0
1 70000 0 0 0 0
2 63000 1 0 0 0
3 105000 1 0 1 0
4 130000 1 0 0 1

Now all your data is represented numerically and can be input into a model for training