# Data Analysis with Linear Models

In this article, we illustrate the IEBAE (pronounced as “eBay”, I/O, Exploration, Benchmark, Analysis, Evaluation) framework with a linear model example.

We use the Graduate Admission 2 dataset, which can be obtained from https://www.kaggle.com/mohansacharya/graduate-admissions. Our ultimate goal is to predict “Chance of Admit” from other variables.

We import some useful packages:

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

import numpy as np

import statsmodels.api as smfrom sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split, KFold

from sklearn.pipeline import Pipeline

from sklearn.linear_model import Ridge

from sklearn.ensemble import GradientBoostingRegressor

# I/O

We can glimpse at the dataset **using a tabular form with ****describe**

`df = pd.read_csv(“Admission_Predict_Ver1.1.csv”, index_col = “Serial No.”)`

df.describe().transpose()

We can **save the dataset using HDFS**

`df.to_hdf("data.h5", "linear_model/admission")`

# Exploration

**The straight forward data exploration technique is ****pairplot**** .**

`sns.pairplot(data = df)`

We can see that the chance of admit is highly correlated with GRE/TOFLE, and CGPA score.

# Benchmark

**Linear models are the de facto baseline models for regression problems. **Therefore, we first construct a linear model to predict the Chance of Admit. Before our machine learning analysis, we need to

(1) determine the appropriate metric for evaluating the performance of our models;

(2) design a train-validate-test split strategy.

For the first question, we use the mean squared error between the prediction and true values.

`def metric(y1, y2):`

return np.sqrt(np.mean((y1-y2)**2))

For the second question, we can use the standard KFold cross-validation for model selection

`kf = KFold(n_splits=10)`

For model testing, we use the standard 80%-20% train test split

`train_idx, test_idx = train_test_split(range(0, df.shape[0]), test_size = 0.2)`

Because linear models have strong statistical interpretation, we use the linear regression tools from `statsmodels`

, which provides more interpretable metrics.

sx = StandardScaler()

sy = StandardScaler()

X = sx.fit_transform(df.iloc[train_idx,:7])

y = sy.fit_transform(df.iloc[train_idx,7:8])X = pd.DataFrame(X, index = df.index[train_idx], columns = df.columns[:7])

X = sm.add_constant(X) # can be omitted

res = sm.OLS(y, X).fit()

res.summary()

We can see that University Rating and SOP do not show statistical significance for predicting chance of admit.

`y1 = res.predict(sm.add_constant(sx.transform(df.iloc[train_idx,:7])))`

y2 = res.predict(sm.add_constant(sx.transform(df.iloc[test_idx,:7])))

y1 = sy.inverse_transform(y1)

y2 = sy.inverse_transform(y2)

sns.jointplot(x = "True", y = "Prediction", data = pd.DataFrame({"True": df.iloc[train_idx, 7], "Prediction": y1}))

sns.jointplot(x = "True", y = "Prediction", data = pd.DataFrame({"True": df.iloc[test_idx, 7], "Prediction": y2}))

The training and testing scores are 0.0603, 0.0566 respectively.

`train_score = metric(y1, df.iloc[train_idx, 7])`

test_score = metric(y2, df.iloc[test_idx, 7])

# Analysis

We now consider a slightly more advanced (compared with linear regression) model: *ridge regression*. Because in ridge regression, we have one hyperparameter α, we need to do cross-validation.

def get_score(alpha = 1.0):

pipe = Pipeline([

("scale", StandardScaler()),

("ridge regression", Ridge(alpha = alpha))

])

kf = KFold(n_splits=10) t_mse = []

v_mse = []

for tid, vid in kf.split(range(X_train.shape[0])):

xt = X_train.iloc[tid, :]

yt = y_train.iloc[tid]

xv = X_train.iloc[vid, :]

yv = y_train.iloc[vid] pipe.fit(xt, yt)

yhat = pipe.predict(xv)

t_mse.append(np.sqrt(np.mean((yt - pipe.predict(xt))**2)))

v_mse.append(np.sqrt(np.mean((yhat - yv)**2))) return np.mean(t_mse), np.std(t_mse)

We train with different hyperparameters, and collect the results:

`ms = []`

ss = []

a = np.linspace(0.0001, 1000.0, 50)

for alpha in a:

m, s = get_score(alpha)

ms.append(m)

ss.append(s)

ms = np.array(ms)

ss = np.array(ss)

plt.plot(a, ms, "g")

plt.fill_between(a, ms - ss, ms + ss, alpha = 0.5)

plt.xscale("log")

In this case, the smaller α, the better. We consider α=10

`pipe = Pipeline([`

("scale", StandardScaler()),

("ridge regression", Ridge(alpha = 10.0))

])

pipe.fit(X_train, y_train)

yhat = pipe.predict(X_test)

sns.jointplot(x = 'True', y = 'Prediction', data = pd.DataFrame({'True': y_test, 'Prediction': yhat}))

The training and testing scores are 0.0599 and 0.0590, respectively.

The next option is the *Gradient Boosting Regression*

`pipe = Pipeline([`

('scale', StandardScaler()),

('gbr', GradientBoostingRegressor())

])

pipe.fit(X_train, y_train)

y1 = pipe.predict(X_train)

y2 = pipe.predict(X_test)

sns.jointplot(x = 'True', y = 'Prediction', data = pd.DataFrame({'True': y_train, 'Prediction': y1}))

sns.jointplot(x = 'True', y = 'Prediction', data = pd.DataFrame({'True': y_test, 'Prediction': y2}))

The training and testing scores are 0.0405 and 0.0594, respectively.

# Evaluation

**The evaluation step aggregates all the evaluation metrics and presents different models using ****hbar**** .**

`dat = np.array([`

[0.06031010091513192, 0.05659089069370097],

[0.05891919602076211, 0.05989847557550637]

])

d = pd.DataFrame(dat, index = ['Linear Regression', 'Ridge Regression'], columns = ['Training', 'Testing'])

d.plot.barh()

Although gradient boosting has a small training error, the testing error is large, indicating a certain degree of overfitting. The ridge regression has the smallest discrepancy between training and testing error, making it the most generalizable model among the three.