Logistic regression is the benchmark method for classification. It serves as one crucial step in the IEBAE (I/O, Exploration, Benchmark, Analysis, Evaluation) framework. In this note, we illustrate how to perform logistic regression using the heart attack dataset: https://www.kaggle.com/nareshbhat/health-care-data-set-on-heart-attack-possibility
Step 1: Import necessary packages
from mltools import *
Step 2: data I/O; I already saved all the data to a HDFS file
with HDFS("data.h5") as store:
df = store.get("linear_model/heart")
X = df.drop('target', axis = 1)
y = df.target
Step 3: Explore the dataset. We use kdeplot
fig, axs = subplots(ncols=len(X.columns), figsize = (100,5))
for (k,c) in enumerate(X.columns):
kdeplot(X[c], ax = axs[k])
plt.figure()
kdeplot(y)
We see that “sex” and “oldpeak” are highly correlated with the target.
Step 4: Training
train_idx, test_idx = train_test_split(range(len(y)))
X = add_constant(X)
y_train = y.iloc[train_idx]
X_train = X.iloc[train_idx,:]
y_test = y.iloc[test_idx]
X_test = X.iloc[test_idx,:]lg = Logit(y_train.astype("category"), add_constant(X_train))
model = lg.fit()
model.summary()
The LLR p-value is very small, indicating that we reject the null hypothesis that none of the factors explains heart attacks. The pseudo R squared is similar to R squared in linear regression.
To better understand the effect of different factors, we use coefficient_importance
to visualize the z score
df = pd.DataFrame({
'value': model.params,
'score': model.tvalues
})
coefficient_importance(df)
We see that CA has the best explanatory power.
Step 5: Cross-validation
Cross-validation has two usages: select hyperparameters of a family of models, or estimate mean/variance of performance on unseen data with a limited dataset. Here we consider the second usage
kf = KFold(10)
train_score = []
val_score = []
for tid, vid in kf.split(range(len(y_train))):
x0 = X_train.iloc[tid,:]
y0 = y_train.iloc[tid]
lg = Logit(y0, x0)
model = lg.fit()
score_tr = binary_classification_score(model, X_train.iloc[tid,:], y_train.iloc[tid])
score_val = binary_classification_score(model, X_train.iloc[vid,:], y_train.iloc[vid])
train_score.append(score_tr)
val_score.append(score_val)
train_score = pd.DataFrame(train_score)
val_score = pd.DataFrame(val_score)binary_classification_performance(train_score, val_score)
On the test set:
binary_classification_score(model, X_test, y_test)
The output is
Accuracy 0.802632
AUC 0.887955
Log Loss -6.816979
F1 0.835165
Precision 0.775510
Recall 0.904762
dtype: float64
We see that the accuracy, AUC, F1, Precision, and Recall are similar to the training/validation sets. This indicates little overfitting risks.