Ensembling

What is ensembling?

  • Model ensembling is using a combination of models in order to increase performance beyond any of the individual models

  • An ensemble is an type of supervised algorithm where we give a weight to the output of each model under consideration. In practice, we tune the weights in order to maximize performance on a validation set. We also set a restriction on the weights making them sum to one.

Why use ensembling?

  • Different types of models have different strengths and weakness

  • Some models are better at seperating certain types of noise from signal

  • Some models are better at avoiding overfitting/underfitting to certain types of data

  • Having multiple models will likely reduce the variance in our predictions, reducing overfitting.

Example of ensembling two models with the Boston housing data set

  • Ensemble of a support vector machine and a random forest model
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
%matplotlib inline

Load data and split into training, validation, and test sets

In [2]:
np.random.seed(1)

X, y = load_boston(return_X_y=True)

# 50% train, 25% val, 25% test
num_train = int(round(len(X)*0.6, 0))
num_val = int(round(len(X)*0.2, 0))

random_indices = np.random.permutation(len(X))

train_indices = random_indices[:num_train]
val_indices = random_indices[num_train:(num_train+num_val)]
test_indices = random_indices[(num_train+num_val):]

X_train = X[train_indices]
y_train = y[train_indices]
X_val = X[val_indices]
y_val = y[val_indices]
X_test = X[test_indices]
y_test = y[test_indices]

print('X_train shape:', X_train.shape)
print('y_train shape:', y_train.shape)
print('X_val shape:', X_val.shape)
print('y_val shape:', y_val.shape)
print('X_test shape:', X_test.shape)
print('y_test shape:', y_test.shape)
X_train shape: (304, 13)
y_train shape: (304,)
X_val shape: (101, 13)
y_val shape: (101,)
X_test shape: (101, 13)
y_test shape: (101,)
In [3]:
pd.DataFrame(X_train).head()
Out[3]:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 0.04932 33.0 2.18 0.0 0.472 6.849 70.3 3.1827 7.0 222.0 18.4 396.90 7.53
1 0.02543 55.0 3.78 0.0 0.484 6.696 56.4 5.7321 5.0 370.0 17.6 396.90 7.18
2 0.22927 0.0 6.91 0.0 0.448 6.030 85.5 5.6894 3.0 233.0 17.9 392.74 18.80
3 0.05789 12.5 6.07 0.0 0.409 5.878 21.4 6.4980 4.0 345.0 18.9 396.21 8.10
4 3.67822 0.0 18.10 0.0 0.770 5.362 96.2 2.1036 24.0 666.0 20.2 380.79 10.19

Scale inputs

  • based on the mean and standard deviation of training data
In [4]:
X_train = (X_train - X_train.mean(axis=0))/X_train.std(axis=0)
X_val = (X_val - X_train.mean(axis=0))/X_train.std(axis=0)
X_test = (X_test - X_train.mean(axis=0))/X_train.std(axis=0)
In [5]:
pd.DataFrame(X_train).head()
Out[5]:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 -0.414926 0.988782 -1.318844 -0.265372 -0.733757 0.747471 0.066200 -0.267409 -0.312360 -1.124586 -0.072537 0.459241 -0.685655
1 -0.417600 1.971393 -1.085768 -0.265372 -0.624920 0.529175 -0.430490 0.993954 -0.538448 -0.252488 -0.451911 0.459241 -0.734783
2 -0.394787 -0.485135 -0.629813 -0.265372 -0.951432 -0.421059 0.609343 0.972827 -0.764537 -1.059768 -0.309645 0.416836 0.896268
3 -0.413967 0.073167 -0.752178 -0.265372 -1.305153 -0.637929 -1.681147 1.372897 -0.651493 -0.399802 0.164572 0.452207 -0.605646
4 -0.008794 -0.485135 1.000262 -0.265372 1.969037 -1.374146 0.991687 -0.801313 1.609396 1.491708 0.781055 0.295026 -0.312282

Define & train models

In [6]:
svm = SVR(kernel='linear', epsilon=8, C=0.0003)
rf = RandomForestRegressor(n_estimators=64, max_depth=2, max_features=3)

svm.fit(X_train, y_train)
rf.fit(X_train, y_train)

print('done training')
done training

Evaluate performance

In [7]:
y_vpred_svm = svm.predict(X_val)
y_vpred_rf = rf.predict(X_val)

mse = []
for i in np.arange(0, 1.1, 0.1):
    # start with 100% svm
    ensemble_pred = y_vpred_svm*(1-i) + y_vpred_rf*i
    mse.append(mean_squared_error(y_val, ensemble_pred))
    
plt.plot(np.arange(0, 1.1, 0.1), mse)
#plt.ylim(0)
plt.xlabel('Weight for random forest model')
plt.ylabel('MSE')
plt.show()

We can see from the plot above and the table below that the MSE is minimized on the validation set when we give the the random forest model's prediction a weight of 0.6 and the support vector machine's prediction a weight of 0.4

In [8]:
ensemble_scores = pd.DataFrame({
    'RF Weight':np.arange(0, 1.1, 0.1),
    'MSE':mse
})
ensemble_scores
Out[8]:
MSE RF Weight
0 96.374916 0.0
1 86.958227 0.1
2 79.163296 0.2
3 72.990123 0.3
4 68.438706 0.4
5 65.509048 0.5
6 64.201146 0.6
7 64.515002 0.7
8 66.450615 0.8
9 70.007986 0.9
10 75.187114 1.0

Evaluate on test set

In [9]:
rf_weight = ensemble_scores['RF Weight'].iloc[ensemble_scores['MSE'].idxmin()]

y_pred_svm = svm.predict(X_test)
y_pred_rf = rf.predict(X_test)

y_pred_ensemble = (1-rf_weight)*y_pred_svm + rf_weight*y_pred_rf

print('SVM MSE on test set:', mean_squared_error(y_test, y_pred_svm))
print('RF MSE on test set:', mean_squared_error(y_test, y_pred_rf))
print('Ensemble MSE on test set:', mean_squared_error(y_test, y_pred_ensemble))
SVM MSE on test set: 95.3476855616
RF MSE on test set: 86.999334848
Ensemble MSE on test set: 70.746944016