Encoding Categorical Features¶

Categorical features such as gender or location can't be directly inputted into most models, as most models require numerical inputs¶

This means that we have to represent our categorical features numerically
One of the most popular methods this encoding method is called one-hot encoding

One-hot encoding¶

The process of one-hot encoding involves transforming a categorical value into a vector representation using 1s and 0s
For example, if you have three classes: "USA", "MEX", and "CAN", you can represent "USA" as [1, 0, 0]
- the first position, a 1, represents a "yes" for USA,
- the second position, a 0, represents a "no" for MEX,
- the third position, a 0, represents a "no" for CAN
Notice that a one-hot vector will always have one value of 1, and the rest of the values will be 0.
If you have a columns of values such as ["USA", "USA", "MEX", "CAN", "USA"], you could represent it as three new columns
- "USA": [1, 1, 0, 0, 1]
- "MEX": [0, 0, 1, 0, 0]
- "CAN": [0, 0, 0, 1, 0]
Note: You would retain the same amount of information by dropping one of these columns at random because there are two degrees of freedom

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Create some data¶

data = {
    'State': ['FL', 'FL', 'MA', 'MA', 'MA'],
    'Education': ['HS', 'BA', 'BA', 'MA', 'PHD'],
    'Salary': [25000, 70000, 63000, 105000, 130000]
}

df = pd.DataFrame(data)
df

As is, you can't train a model with this data because you have text fields¶

You need to preprocess this data by encoding the categorical features "State" and "Education"

pd.get_dummies(df)

You can also drop the first category of each categorical column to reduce the size of your data while retaining the same amount of information¶

pd.get_dummies(df, drop_first=True)

	Salary	State_FL	State_MA	Education_BA	Education_HS	Education_MA	Education_PHD
0	25000	1	0	0	1	0	0
1	70000	1	0	1	0	0	0
2	63000	0	1	1	0	0	0
3	105000	0	1	0	0	1	0
4	130000	0	1	0	0	0	1

Encoding Categorical Features¶

Categorical features such as gender or location can't be directly inputted into most models, as most models require numerical inputs¶

One-hot encoding¶

Create some data¶

As is, you can't train a model with this data because you have text fields¶

You can also drop the first category of each categorical column to reduce the size of your data while retaining the same amount of information¶

Now all your data is represented numerically and can be input into a model for training¶