Label Encoding

There are different types of categorical variables

  • Nominal
    • Nominal categorical variables and have no real ordering to them, for example: Florida, Washington, Nebraska
  • Ordinal
    • Ordinal categorical variables do have an intrinsic ordering, for example: Small, Medium, Large
    • This could be represented as [1, 2, 3].
      • However it is slightly arbitrary that we selected to seperate each category high by 1.
    • It can get unclear how to encode series such as education level: ["elementary", "middle", "high", "bachelors", "masters", "phd"]
      • We could represent this as [1, 2, 3, 4, 5, 6]
      • But it may not make sense to have the difference between elementary and high (3-1=2) equal to the difference between high and masters (5-3=2)
      • The above would be a case where it may not be the best idea to use label encoding even though the it is an ordinal category
In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
In [2]:
clothes_data = {
    'item':['shirt', 'shirt', 'shorts', 'shoes', 'shoes', 'pants'],
    'size':['small', 'small', 'medium', 'large', 'large', 'medium'],
    'cost':[10, 12, 20, 30, 25, 18]
}

clothes_df = pd.DataFrame(clothes_data)
clothes_df
Out[2]:
item size cost
0 shirt small 10
1 shirt small 12
2 shorts medium 20
3 shoes large 30
4 shoes large 25
5 pants medium 18

Here, it could be useful to label encode the "size" column (as opposed to one-hot encode) since small is closer to medium then small is to large

In [3]:
le = LabelEncoder()
le.fit(clothes_df['size'])
print(le.classes_)
['large' 'medium' 'small']
In [4]:
clothes_df['size_LE'] = le.transform(clothes_df['size'])
clothes_df
Out[4]:
item size cost size_LE
0 shirt small 10 2
1 shirt small 12 2
2 shorts medium 20 1
3 shoes large 30 0
4 shoes large 25 0
5 pants medium 18 1
In [5]:
clothes_df.drop('size', axis=1)
Out[5]:
item cost size_LE
0 shirt 10 2
1 shirt 12 2
2 shorts 20 1
3 shoes 30 0
4 shoes 25 0
5 pants 18 1