Ensembling¶

What is ensembling?¶

Model ensembling is using a combination of models in order to increase performance beyond any of the individual models
An ensemble is an type of supervised algorithm where we give a weight to the output of each model under consideration. In practice, we tune the weights in order to maximize performance on a validation set. We also set a restriction on the weights making them sum to one.

Why use ensembling?¶

Different types of models have different strengths and weakness
Some models are better at seperating certain types of noise from signal
Some models are better at avoiding overfitting/underfitting to certain types of data
Having multiple models will likely reduce the variance in our predictions, reducing overfitting.

Example of ensembling two models with the Boston housing data set¶

Ensemble of a support vector machine and a random forest model

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
%matplotlib inline

Load data and split into training, validation, and test sets¶

np.random.seed(1)

X, y = load_boston(return_X_y=True)

# 50% train, 25% val, 25% test
num_train = int(round(len(X)*0.6, 0))
num_val = int(round(len(X)*0.2, 0))

random_indices = np.random.permutation(len(X))

train_indices = random_indices[:num_train]
val_indices = random_indices[num_train:(num_train+num_val)]
test_indices = random_indices[(num_train+num_val):]

X_train = X[train_indices]
y_train = y[train_indices]
X_val = X[val_indices]
y_val = y[val_indices]
X_test = X[test_indices]
y_test = y[test_indices]

print('X_train shape:', X_train.shape)
print('y_train shape:', y_train.shape)
print('X_val shape:', X_val.shape)
print('y_val shape:', y_val.shape)
print('X_test shape:', X_test.shape)
print('y_test shape:', y_test.shape)

X_train shape: (304, 13)
y_train shape: (304,)
X_val shape: (101, 13)
y_val shape: (101,)
X_test shape: (101, 13)
y_test shape: (101,)

pd.DataFrame(X_train).head()

Scale inputs¶

based on the mean and standard deviation of training data

X_train = (X_train - X_train.mean(axis=0))/X_train.std(axis=0)
X_val = (X_val - X_train.mean(axis=0))/X_train.std(axis=0)
X_test = (X_test - X_train.mean(axis=0))/X_train.std(axis=0)

pd.DataFrame(X_train).head()

Define & train models¶

svm = SVR(kernel='linear', epsilon=8, C=0.0003)
rf = RandomForestRegressor(n_estimators=64, max_depth=2, max_features=3)

svm.fit(X_train, y_train)
rf.fit(X_train, y_train)

print('done training')

done training

Evaluate performance¶

y_vpred_svm = svm.predict(X_val)
y_vpred_rf = rf.predict(X_val)

mse = []
for i in np.arange(0, 1.1, 0.1):
    # start with 100% svm
    ensemble_pred = y_vpred_svm*(1-i) + y_vpred_rf*i
    mse.append(mean_squared_error(y_val, ensemble_pred))
    
plt.plot(np.arange(0, 1.1, 0.1), mse)
#plt.ylim(0)
plt.xlabel('Weight for random forest model')
plt.ylabel('MSE')
plt.show()

We can see from the plot above and the table below that the MSE is minimized on the validation set when we give the the random forest model's prediction a weight of 0.6 and the support vector machine's prediction a weight of 0.4¶

ensemble_scores = pd.DataFrame({
    'RF Weight':np.arange(0, 1.1, 0.1),
    'MSE':mse
})
ensemble_scores

Evaluate on test set¶

rf_weight = ensemble_scores['RF Weight'].iloc[ensemble_scores['MSE'].idxmin()]

y_pred_svm = svm.predict(X_test)
y_pred_rf = rf.predict(X_test)

y_pred_ensemble = (1-rf_weight)*y_pred_svm + rf_weight*y_pred_rf

print('SVM MSE on test set:', mean_squared_error(y_test, y_pred_svm))
print('RF MSE on test set:', mean_squared_error(y_test, y_pred_rf))
print('Ensemble MSE on test set:', mean_squared_error(y_test, y_pred_ensemble))

SVM MSE on test set: 95.3476855616
RF MSE on test set: 86.999334848
Ensemble MSE on test set: 70.746944016

	0	1	2	4	5	6	7	8	9	10	11	12
0	0.04932	33.0	2.18	0.472	6.849	70.3	3.1827	7.0	222.0	18.4	396.90	7.53
1	0.02543	55.0	3.78	0.484	6.696	56.4	5.7321	5.0	370.0	17.6	396.90	7.18
2	0.22927	0.0	6.91	0.448	6.030	85.5	5.6894	3.0	233.0	17.9	392.74	18.80
3	0.05789	12.5	6.07	0.409	5.878	21.4	6.4980	4.0	345.0	18.9	396.21	8.10
4	3.67822	0.0	18.10	0.770	5.362	96.2	2.1036	24.0	666.0	20.2	380.79	10.19

	0	1	2	3	4	5	6	7	8	9	10	11	12
0	-0.414926	0.988782	-1.318844	-0.265372	-0.733757	0.747471	0.066200	-0.267409	-0.312360	-1.124586	-0.072537	0.459241	-0.685655
1	-0.417600	1.971393	-1.085768	-0.265372	-0.624920	0.529175	-0.430490	0.993954	-0.538448	-0.252488	-0.451911	0.459241	-0.734783
2	-0.394787	-0.485135	-0.629813	-0.265372	-0.951432	-0.421059	0.609343	0.972827	-0.764537	-1.059768	-0.309645	0.416836	0.896268
3	-0.413967	0.073167	-0.752178	-0.265372	-1.305153	-0.637929	-1.681147	1.372897	-0.651493	-0.399802	0.164572	0.452207	-0.605646
4	-0.008794	-0.485135	1.000262	-0.265372	1.969037	-1.374146	0.991687	-0.801313	1.609396	1.491708	0.781055	0.295026	-0.312282

	MSE	RF Weight
0	96.374916	0.0
1	86.958227	0.1
2	79.163296	0.2
3	72.990123	0.3
4	68.438706	0.4
5	65.509048	0.5
6	64.201146	0.6
7	64.515002	0.7
8	66.450615	0.8
9	70.007986	0.9
10	75.187114	1.0