Data Analysis with Linear Models

KLDIV
4 min readApr 6, 2021

In this article, we illustrate the IEBAE (pronounced as “eBay”, I/O, Exploration, Benchmark, Analysis, Evaluation) framework with a linear model example.

We use the Graduate Admission 2 dataset, which can be obtained from https://www.kaggle.com/mohansacharya/graduate-admissions. Our ultimate goal is to predict “Chance of Admit” from other variables.

We import some useful packages:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, KFold
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.ensemble import GradientBoostingRegressor

I/O

We can glimpse at the dataset using a tabular form with describe

df = pd.read_csv(“Admission_Predict_Ver1.1.csv”, index_col = “Serial No.”)
df.describe().transpose()

We can save the dataset using HDFS

df.to_hdf("data.h5", "linear_model/admission")

Exploration

The straight forward data exploration technique is pairplot .

sns.pairplot(data = df)

We can see that the chance of admit is highly correlated with GRE/TOFLE, and CGPA score.

Benchmark

Linear models are the de facto baseline models for regression problems. Therefore, we first construct a linear model to predict the Chance of Admit. Before our machine learning analysis, we need to

(1) determine the appropriate metric for evaluating the performance of our models;

(2) design a train-validate-test split strategy.

For the first question, we use the mean squared error between the prediction and true values.

def metric(y1, y2):
return np.sqrt(np.mean((y1-y2)**2))

For the second question, we can use the standard KFold cross-validation for model selection

kf = KFold(n_splits=10)

For model testing, we use the standard 80%-20% train test split

train_idx, test_idx = train_test_split(range(0, df.shape[0]), test_size = 0.2)

Because linear models have strong statistical interpretation, we use the linear regression tools from statsmodels , which provides more interpretable metrics.

sx = StandardScaler()
sy = StandardScaler()
X = sx.fit_transform(df.iloc[train_idx,:7])
y = sy.fit_transform(df.iloc[train_idx,7:8])
X = pd.DataFrame(X, index = df.index[train_idx], columns = df.columns[:7])
X = sm.add_constant(X) # can be omitted
res = sm.OLS(y, X).fit()
res.summary()

We can see that University Rating and SOP do not show statistical significance for predicting chance of admit.

y1 = res.predict(sm.add_constant(sx.transform(df.iloc[train_idx,:7])))
y2 = res.predict(sm.add_constant(sx.transform(df.iloc[test_idx,:7])))
y1 = sy.inverse_transform(y1)
y2 = sy.inverse_transform(y2)
sns.jointplot(x = "True", y = "Prediction", data = pd.DataFrame({"True": df.iloc[train_idx, 7], "Prediction": y1}))
sns.jointplot(x = "True", y = "Prediction", data = pd.DataFrame({"True": df.iloc[test_idx, 7], "Prediction": y2}))

The training and testing scores are 0.0603, 0.0566 respectively.

train_score = metric(y1, df.iloc[train_idx, 7])
test_score = metric(y2, df.iloc[test_idx, 7])

Analysis

We now consider a slightly more advanced (compared with linear regression) model: ridge regression. Because in ridge regression, we have one hyperparameter α, we need to do cross-validation.

def get_score(alpha = 1.0):
pipe = Pipeline([
("scale", StandardScaler()),
("ridge regression", Ridge(alpha = alpha))
])
kf = KFold(n_splits=10)
t_mse = []
v_mse = []
for tid, vid in kf.split(range(X_train.shape[0])):
xt = X_train.iloc[tid, :]
yt = y_train.iloc[tid]
xv = X_train.iloc[vid, :]
yv = y_train.iloc[vid]
pipe.fit(xt, yt)
yhat = pipe.predict(xv)
t_mse.append(np.sqrt(np.mean((yt - pipe.predict(xt))**2)))
v_mse.append(np.sqrt(np.mean((yhat - yv)**2)))
return np.mean(t_mse), np.std(t_mse)

We train with different hyperparameters, and collect the results:

ms = []
ss = []
a = np.linspace(0.0001, 1000.0, 50)
for alpha in a:
m, s = get_score(alpha)
ms.append(m)
ss.append(s)
ms = np.array(ms)
ss = np.array(ss)
plt.plot(a, ms, "g")
plt.fill_between(a, ms - ss, ms + ss, alpha = 0.5)
plt.xscale("log")

In this case, the smaller α, the better. We consider α=10

pipe = Pipeline([
("scale", StandardScaler()),
("ridge regression", Ridge(alpha = 10.0))
])
pipe.fit(X_train, y_train)
yhat = pipe.predict(X_test)
sns.jointplot(x = 'True', y = 'Prediction', data = pd.DataFrame({'True': y_test, 'Prediction': yhat}))

The training and testing scores are 0.0599 and 0.0590, respectively.

The next option is the Gradient Boosting Regression

pipe = Pipeline([
('scale', StandardScaler()),
('gbr', GradientBoostingRegressor())
])
pipe.fit(X_train, y_train)
y1 = pipe.predict(X_train)
y2 = pipe.predict(X_test)
sns.jointplot(x = 'True', y = 'Prediction', data = pd.DataFrame({'True': y_train, 'Prediction': y1}))
sns.jointplot(x = 'True', y = 'Prediction', data = pd.DataFrame({'True': y_test, 'Prediction': y2}))

The training and testing scores are 0.0405 and 0.0594, respectively.

Evaluation

The evaluation step aggregates all the evaluation metrics and presents different models using hbar .

dat = np.array([
[0.06031010091513192, 0.05659089069370097],
[0.05891919602076211, 0.05989847557550637]
])
d = pd.DataFrame(dat, index = ['Linear Regression', 'Ridge Regression'], columns = ['Training', 'Testing'])
d.plot.barh()

Although gradient boosting has a small training error, the testing error is large, indicating a certain degree of overfitting. The ridge regression has the smallest discrepancy between training and testing error, making it the most generalizable model among the three.

--

--