# Logistic Regression

Logistic regression is the benchmark method for classification. It serves as one crucial step in the IEBAE (I/O, Exploration, Benchmark, Analysis, Evaluation) framework. In this note, we illustrate how to perform logistic regression using the heart attack dataset: https://www.kaggle.com/nareshbhat/health-care-data-set-on-heart-attack-possibility

Step 1: Import necessary packages

`from mltools import *`

Step 2: data I/O; I already saved all the data to a HDFS file

`with HDFS("data.h5") as store:    df = store.get("linear_model/heart")`
`X = df.drop('target', axis = 1)y = df.target`

Step 3: Explore the dataset. We use `kdeplot`

`fig, axs = subplots(ncols=len(X.columns), figsize = (100,5))for (k,c) in enumerate(X.columns):    kdeplot(X[c], ax = axs[k])plt.figure()kdeplot(y)`

We see that “sex” and “oldpeak” are highly correlated with the target.

Step 4: Training

`train_idx, test_idx = train_test_split(range(len(y)))X = add_constant(X)y_train = y.iloc[train_idx]X_train = X.iloc[train_idx,:]y_test = y.iloc[test_idx]X_test = X.iloc[test_idx,:]lg = Logit(y_train.astype("category"), add_constant(X_train))model = lg.fit()model.summary()`

The LLR p-value is very small, indicating that we reject the null hypothesis that none of the factors explains heart attacks. The pseudo R squared is similar to R squared in linear regression.

To better understand the effect of different factors, we use `coefficient_importance` to visualize the z score

`df = pd.DataFrame({        'value': model.params,        'score': model.tvalues    })coefficient_importance(df)`

We see that CA has the best explanatory power.

Step 5: Cross-validation

Cross-validation has two usages: select hyperparameters of a family of models, or estimate mean/variance of performance on unseen data with a limited dataset. Here we consider the second usage

`kf = KFold(10)train_score = []val_score = []for tid, vid in kf.split(range(len(y_train))):    x0 = X_train.iloc[tid,:]    y0 = y_train.iloc[tid]    lg = Logit(y0, x0)    model = lg.fit()    score_tr = binary_classification_score(model,  X_train.iloc[tid,:], y_train.iloc[tid])    score_val = binary_classification_score(model,    X_train.iloc[vid,:], y_train.iloc[vid])    train_score.append(score_tr)    val_score.append(score_val)train_score = pd.DataFrame(train_score)val_score = pd.DataFrame(val_score)binary_classification_performance(train_score, val_score)`

On the test set:

`binary_classification_score(model, X_test, y_test)`

The output is

`Accuracy     0.802632AUC          0.887955Log Loss    -6.816979F1           0.835165Precision    0.775510Recall       0.904762dtype: float64`

We see that the accuracy, AUC, F1, Precision, and Recall are similar to the training/validation sets. This indicates little overfitting risks.