Logistic Regression

Logistic regression is the benchmark method for classification. It serves as one crucial step in the IEBAE (I/O, Exploration, Benchmark, Analysis, Evaluation) framework. In this note, we illustrate how to perform logistic regression using the heart attack dataset: https://www.kaggle.com/nareshbhat/health-care-data-set-on-heart-attack-possibility

Step 1: Import necessary packages

from mltools import *

Step 2: data I/O; I already saved all the data to a HDFS file

with HDFS("data.h5") as store:
df = store.get("linear_model/heart")
X = df.drop('target', axis = 1)
y = df.target

Step 3: Explore the dataset. We use kdeplot

fig, axs = subplots(ncols=len(X.columns), figsize = (100,5))
for (k,c) in enumerate(X.columns):
kdeplot(X[c], ax = axs[k])
plt.figure()
kdeplot(y)

We see that “sex” and “oldpeak” are highly correlated with the target.

Step 4: Training

train_idx, test_idx = train_test_split(range(len(y)))
X = add_constant(X)
y_train = y.iloc[train_idx]
X_train = X.iloc[train_idx,:]
y_test = y.iloc[test_idx]
X_test = X.iloc[test_idx,:]
lg = Logit(y_train.astype("category"), add_constant(X_train))
model = lg.fit()
model.summary()

The LLR p-value is very small, indicating that we reject the null hypothesis that none of the factors explains heart attacks. The pseudo R squared is similar to R squared in linear regression.

To better understand the effect of different factors, we use coefficient_importance to visualize the z score

df = pd.DataFrame({
'value': model.params,
'score': model.tvalues
})
coefficient_importance(df)

We see that CA has the best explanatory power.

Step 5: Cross-validation

Cross-validation has two usages: select hyperparameters of a family of models, or estimate mean/variance of performance on unseen data with a limited dataset. Here we consider the second usage

kf = KFold(10)
train_score = []
val_score = []
for tid, vid in kf.split(range(len(y_train))):
x0 = X_train.iloc[tid,:]
y0 = y_train.iloc[tid]
lg = Logit(y0, x0)
model = lg.fit()
score_tr = binary_classification_score(model, X_train.iloc[tid,:], y_train.iloc[tid])
score_val = binary_classification_score(model, X_train.iloc[vid,:], y_train.iloc[vid])
train_score.append(score_tr)
val_score.append(score_val)
train_score = pd.DataFrame(train_score)
val_score = pd.DataFrame(val_score)
binary_classification_performance(train_score, val_score)

On the test set:

binary_classification_score(model, X_test, y_test)

The output is

Accuracy     0.802632
AUC 0.887955
Log Loss -6.816979
F1 0.835165
Precision 0.775510
Recall 0.904762
dtype: float64

We see that the accuracy, AUC, F1, Precision, and Recall are similar to the training/validation sets. This indicates little overfitting risks.