In this article, we illustrate the IEBAE (pronounced as “eBay”, I/O, Exploration, Benchmark, Analysis, Evaluation) framework with a linear model example.
We use the Graduate Admission 2 dataset, which can be obtained from https://www.kaggle.com/mohansacharya/graduate-admissions. Our ultimate goal is to predict “Chance of Admit” from other variables.
We import some useful packages:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import statsmodels.api as smfrom sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, KFold
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.ensemble import GradientBoostingRegressor
I/O
We can glimpse at the dataset using a tabular form with describe
df = pd.read_csv(“Admission_Predict_Ver1.1.csv”, index_col = “Serial No.”)
df.describe().transpose()
We can save the dataset using HDFS
df.to_hdf("data.h5", "linear_model/admission")
Exploration
The straight forward data exploration technique is pairplot
.
sns.pairplot(data = df)
We can see that the chance of admit is highly correlated with GRE/TOFLE, and CGPA score.
Benchmark
Linear models are the de facto baseline models for regression problems. Therefore, we first construct a linear model to predict the Chance of Admit. Before our machine learning analysis, we need to
(1) determine the appropriate metric for evaluating the performance of our models;
(2) design a train-validate-test split strategy.
For the first question, we use the mean squared error between the prediction and true values.
def metric(y1, y2):
return np.sqrt(np.mean((y1-y2)**2))
For the second question, we can use the standard KFold cross-validation for model selection
kf = KFold(n_splits=10)
For model testing, we use the standard 80%-20% train test split
train_idx, test_idx = train_test_split(range(0, df.shape[0]), test_size = 0.2)
Because linear models have strong statistical interpretation, we use the linear regression tools from statsmodels
, which provides more interpretable metrics.
sx = StandardScaler()
sy = StandardScaler()
X = sx.fit_transform(df.iloc[train_idx,:7])
y = sy.fit_transform(df.iloc[train_idx,7:8])X = pd.DataFrame(X, index = df.index[train_idx], columns = df.columns[:7])
X = sm.add_constant(X) # can be omitted
res = sm.OLS(y, X).fit()
res.summary()
We can see that University Rating and SOP do not show statistical significance for predicting chance of admit.
y1 = res.predict(sm.add_constant(sx.transform(df.iloc[train_idx,:7])))
y2 = res.predict(sm.add_constant(sx.transform(df.iloc[test_idx,:7])))
y1 = sy.inverse_transform(y1)
y2 = sy.inverse_transform(y2)
sns.jointplot(x = "True", y = "Prediction", data = pd.DataFrame({"True": df.iloc[train_idx, 7], "Prediction": y1}))
sns.jointplot(x = "True", y = "Prediction", data = pd.DataFrame({"True": df.iloc[test_idx, 7], "Prediction": y2}))
The training and testing scores are 0.0603, 0.0566 respectively.
train_score = metric(y1, df.iloc[train_idx, 7])
test_score = metric(y2, df.iloc[test_idx, 7])
Analysis
We now consider a slightly more advanced (compared with linear regression) model: ridge regression. Because in ridge regression, we have one hyperparameter α, we need to do cross-validation.
def get_score(alpha = 1.0):
pipe = Pipeline([
("scale", StandardScaler()),
("ridge regression", Ridge(alpha = alpha))
])
kf = KFold(n_splits=10) t_mse = []
v_mse = []
for tid, vid in kf.split(range(X_train.shape[0])):
xt = X_train.iloc[tid, :]
yt = y_train.iloc[tid]
xv = X_train.iloc[vid, :]
yv = y_train.iloc[vid] pipe.fit(xt, yt)
yhat = pipe.predict(xv)
t_mse.append(np.sqrt(np.mean((yt - pipe.predict(xt))**2)))
v_mse.append(np.sqrt(np.mean((yhat - yv)**2))) return np.mean(t_mse), np.std(t_mse)
We train with different hyperparameters, and collect the results:
ms = []
ss = []
a = np.linspace(0.0001, 1000.0, 50)
for alpha in a:
m, s = get_score(alpha)
ms.append(m)
ss.append(s)
ms = np.array(ms)
ss = np.array(ss)
plt.plot(a, ms, "g")
plt.fill_between(a, ms - ss, ms + ss, alpha = 0.5)
plt.xscale("log")
In this case, the smaller α, the better. We consider α=10
pipe = Pipeline([
("scale", StandardScaler()),
("ridge regression", Ridge(alpha = 10.0))
])
pipe.fit(X_train, y_train)
yhat = pipe.predict(X_test)
sns.jointplot(x = 'True', y = 'Prediction', data = pd.DataFrame({'True': y_test, 'Prediction': yhat}))
The training and testing scores are 0.0599 and 0.0590, respectively.
The next option is the Gradient Boosting Regression
pipe = Pipeline([
('scale', StandardScaler()),
('gbr', GradientBoostingRegressor())
])
pipe.fit(X_train, y_train)
y1 = pipe.predict(X_train)
y2 = pipe.predict(X_test)
sns.jointplot(x = 'True', y = 'Prediction', data = pd.DataFrame({'True': y_train, 'Prediction': y1}))
sns.jointplot(x = 'True', y = 'Prediction', data = pd.DataFrame({'True': y_test, 'Prediction': y2}))
The training and testing scores are 0.0405 and 0.0594, respectively.
Evaluation
The evaluation step aggregates all the evaluation metrics and presents different models using hbar
.
dat = np.array([
[0.06031010091513192, 0.05659089069370097],
[0.05891919602076211, 0.05989847557550637]
])
d = pd.DataFrame(dat, index = ['Linear Regression', 'Ridge Regression'], columns = ['Training', 'Testing'])
d.plot.barh()
Although gradient boosting has a small training error, the testing error is large, indicating a certain degree of overfitting. The ridge regression has the smallest discrepancy between training and testing error, making it the most generalizable model among the three.