Appendix: Gradient Boosting Example

2 min readApr 6, 2021

In the following post

Data Analysis with Linear Models

In this article, we illustrate the IEBAE (pronounced as “ebay”, I/O, Exploration, Benchmark, Analysis, Evaluation)…

kldiv.medium.com

We discussed the IEBAE framework using a simple regression example. However, we did not tune the hyperparameters in the gradient boosting approach. In this post, we expand the discussion on the gradient boosting example using the standard KFold cross-validation approach.

We import all necessary packages (mltools can be downloaded here: https://github.com/kailaix/mltools)

from mltools import *train_idx, test_idx = train_test_split(range(df.shape[0]), test_size = 0.2)
train_idx = np.array(train_idx)
test_idx = np.array(test_idx)kf = KFold(n_splits=10)

The metric is

def metric(y1, y2):
    return np.sqrt(np.mean((y1-y2)**2))

The score function in the cross-validation approach is

def get_score(lr = 1.0):
    pipe = Pipeline([
        ('scale', StandardScaler()),
        ('gbr', GradientBoostingRegressor(learning_rate=lr))
    ])
    t = []
    v = []
    for tid, vid in kf.split(range(len(train_idx))):
        pipe.fit(df.iloc[train_idx[tid], :-1], df.iloc[train_idx[tid], -1])
        y1 = pipe.predict(df.iloc[train_idx[tid], :-1])
        y2 = pipe.predict(df.iloc[train_idx[vid], :-1])
        t.append(metric(y1, df.iloc[train_idx[tid], -1]))
        v.append(metric(y2, df.iloc[train_idx[vid], -1]))
    return np.mean(t), np.std(t), np.mean(v), np.std(v)

The driver code is

lrs = np.linspace(1e-5, 0.04, 50)
vs = np.zeros((len(lrs), 4))
for k, lr in enumerate(lrs):
    vs[k,0], vs[k,1], vs[k,2], vs[k,3] = get_score(lr)

We can visualize the training and validation:

plt.plot(lrs, vs[:,0], label = "Training")
plt.fill_between(lrs, vs[:,0]-vs[:,1], vs[:,0]+vs[:,1], alpha = 0.5)
plt.plot(lrs, vs[:,2], label = "Validation")
plt.fill_between(lrs, vs[:,2]-vs[:,3], vs[:,2]+vs[:,3], alpha = 0.5)
plt.legend()

We see that we need to use small learning rate so that the model does not overfit. We use a learning rate 0.02 here:

pipe = Pipeline([
        ('scale', StandardScaler()),
        ('gbr', GradientBoostingRegressor(learning_rate=0.02))
    ])
X_train, y_train = df.iloc[train_idx, :-1], df.iloc[train_idx, -1]
X_test, y_test = df.iloc[test_idx, :-1], df.iloc[test_idx, -1]
pipe.fit(X_train, y_train)
y1 = pipe.predict(X_train)
y2 = pipe.predict(X_test)
test_score = metric(y_test, y2)
train_score = metric(y_train, y1)
print(train_score, test_score)

The training and testing scores are 0.0581 and 0.0607.

Appendix: Gradient Boosting Example

Data Analysis with Linear Models

In this article, we illustrate the IEBAE (pronounced as “ebay”, I/O, Exploration, Benchmark, Analysis, Evaluation)…

Written by KLDIV