# Data Analysis with Linear Models

In this article, we illustrate the IEBAE (pronounced as “eBay”, I/O, Exploration, Benchmark, Analysis, Evaluation) framework with a linear model example.

We import some useful packages:

`import pandas as pdimport matplotlib.pyplot as plt import seaborn as snsimport numpy as npimport statsmodels.api as smfrom sklearn.preprocessing import StandardScalerfrom sklearn.model_selection import train_test_split, KFoldfrom sklearn.pipeline import Pipelinefrom sklearn.linear_model import Ridgefrom sklearn.ensemble import GradientBoostingRegressor`

# I/O

We can glimpse at the dataset using a tabular form with `describe`

`df = pd.read_csv(“Admission_Predict_Ver1.1.csv”, index_col = “Serial No.”)df.describe().transpose()`

We can save the dataset using HDFS

`df.to_hdf("data.h5", "linear_model/admission")`

# Exploration

The straight forward data exploration technique is `pairplot` .

`sns.pairplot(data = df)`

We can see that the chance of admit is highly correlated with GRE/TOFLE, and CGPA score.

# Benchmark

Linear models are the de facto baseline models for regression problems. Therefore, we first construct a linear model to predict the Chance of Admit. Before our machine learning analysis, we need to

(1) determine the appropriate metric for evaluating the performance of our models;

(2) design a train-validate-test split strategy.

For the first question, we use the mean squared error between the prediction and true values.

`def metric(y1, y2):    return np.sqrt(np.mean((y1-y2)**2))`

For the second question, we can use the standard KFold cross-validation for model selection

`kf = KFold(n_splits=10)`

For model testing, we use the standard 80%-20% train test split

`train_idx, test_idx = train_test_split(range(0, df.shape), test_size = 0.2)`

Because linear models have strong statistical interpretation, we use the linear regression tools from `statsmodels` , which provides more interpretable metrics.

`sx = StandardScaler()sy = StandardScaler()X = sx.fit_transform(df.iloc[train_idx,:7])y = sy.fit_transform(df.iloc[train_idx,7:8])X = pd.DataFrame(X, index = df.index[train_idx], columns = df.columns[:7])X = sm.add_constant(X) # can be omittedres = sm.OLS(y, X).fit()res.summary()`

We can see that University Rating and SOP do not show statistical significance for predicting chance of admit.

`y1 = res.predict(sm.add_constant(sx.transform(df.iloc[train_idx,:7])))y2 = res.predict(sm.add_constant(sx.transform(df.iloc[test_idx,:7])))y1 = sy.inverse_transform(y1)y2 = sy.inverse_transform(y2)sns.jointplot(x = "True", y = "Prediction", data = pd.DataFrame({"True": df.iloc[train_idx, 7], "Prediction": y1}))sns.jointplot(x = "True", y = "Prediction", data = pd.DataFrame({"True": df.iloc[test_idx, 7], "Prediction": y2}))`

The training and testing scores are 0.0603, 0.0566 respectively.

`train_score = metric(y1, df.iloc[train_idx, 7])test_score = metric(y2, df.iloc[test_idx, 7])`

# Analysis

We now consider a slightly more advanced (compared with linear regression) model: ridge regression. Because in ridge regression, we have one hyperparameter α, we need to do cross-validation.

`def get_score(alpha = 1.0):    pipe = Pipeline([        ("scale", StandardScaler()),        ("ridge regression", Ridge(alpha = alpha))    ])    kf = KFold(n_splits=10)    t_mse = []    v_mse = []    for tid, vid in kf.split(range(X_train.shape)):        xt = X_train.iloc[tid, :]        yt = y_train.iloc[tid]        xv = X_train.iloc[vid, :]        yv = y_train.iloc[vid]        pipe.fit(xt, yt)        yhat = pipe.predict(xv)        t_mse.append(np.sqrt(np.mean((yt - pipe.predict(xt))**2)))        v_mse.append(np.sqrt(np.mean((yhat - yv)**2)))    return np.mean(t_mse), np.std(t_mse)`

We train with different hyperparameters, and collect the results:

`ms = []ss = []a = np.linspace(0.0001, 1000.0, 50)for alpha in a:    m, s = get_score(alpha)    ms.append(m)    ss.append(s)ms = np.array(ms)ss = np.array(ss)plt.plot(a, ms, "g")plt.fill_between(a, ms - ss, ms + ss, alpha = 0.5)plt.xscale("log")`

In this case, the smaller α, the better. We consider α=10

`pipe = Pipeline([        ("scale", StandardScaler()),        ("ridge regression", Ridge(alpha = 10.0))    ])pipe.fit(X_train, y_train)yhat = pipe.predict(X_test)sns.jointplot(x = 'True', y = 'Prediction', data = pd.DataFrame({'True': y_test, 'Prediction': yhat}))`

The training and testing scores are 0.0599 and 0.0590, respectively.

The next option is the Gradient Boosting Regression

`pipe = Pipeline([    ('scale', StandardScaler()),    ('gbr', GradientBoostingRegressor())])pipe.fit(X_train, y_train)y1 = pipe.predict(X_train)y2 = pipe.predict(X_test)sns.jointplot(x = 'True', y = 'Prediction', data = pd.DataFrame({'True': y_train, 'Prediction': y1}))sns.jointplot(x = 'True', y = 'Prediction', data = pd.DataFrame({'True': y_test, 'Prediction': y2}))`

The training and testing scores are 0.0405 and 0.0594, respectively.

# Evaluation

The evaluation step aggregates all the evaluation metrics and presents different models using `hbar` .

`dat = np.array([    [0.06031010091513192, 0.05659089069370097],    [0.05891919602076211, 0.05989847557550637]])d = pd.DataFrame(dat, index = ['Linear Regression', 'Ridge Regression'], columns = ['Training', 'Testing'])d.plot.barh()`

Although gradient boosting has a small training error, the testing error is large, indicating a certain degree of overfitting. The ridge regression has the smallest discrepancy between training and testing error, making it the most generalizable model among the three.