# Logistic Regression

Logistic regression is the benchmark method for classification. It serves as one crucial step in the IEBAE (I/O, Exploration, Benchmark, Analysis, Evaluation) framework. In this note, we illustrate how to perform logistic regression using the heart attack dataset: https://www.kaggle.com/nareshbhat/health-care-data-set-on-heart-attack-possibility

**Step 1: **Import necessary packages

`from mltools import *`

**Step 2: **data I/O; I already saved all the data to a HDFS file

`with HDFS("data.h5") as store:`

df = store.get("linear_model/heart")

`X = df.drop('target', axis = 1)`

y = df.target

**Step 3: **Explore the dataset. We use `kdeplot`

`fig, axs = subplots(ncols=len(X.columns), figsize = (100,5))`

for (k,c) in enumerate(X.columns):

kdeplot(X[c], ax = axs[k])

plt.figure()

kdeplot(y)

We see that “sex” and “oldpeak” are highly correlated with the target.

**Step 4: **Training

train_idx, test_idx = train_test_split(range(len(y)))

X = add_constant(X)

y_train = y.iloc[train_idx]

X_train = X.iloc[train_idx,:]

y_test = y.iloc[test_idx]

X_test = X.iloc[test_idx,:]lg = Logit(y_train.astype("category"), add_constant(X_train))

model = lg.fit()

model.summary()

The LLR p-value is very small, indicating that we reject the null hypothesis that none of the factors explains heart attacks. The pseudo R squared is similar to R squared in linear regression.

To better understand the effect of different factors, we use `coefficient_importance`

to visualize the z score

`df = pd.DataFrame({`

'value': model.params,

'score': model.tvalues

})

coefficient_importance(df)

We see that CA has the best explanatory power.

**Step 5:** Cross-validation

Cross-validation has two usages: select hyperparameters of a family of models, or estimate mean/variance of performance on unseen data with a limited dataset. Here we consider the second usage

kf = KFold(10)

train_score = []

val_score = []

for tid, vid in kf.split(range(len(y_train))):

x0 = X_train.iloc[tid,:]

y0 = y_train.iloc[tid]

lg = Logit(y0, x0)

model = lg.fit()

score_tr = binary_classification_score(model, X_train.iloc[tid,:], y_train.iloc[tid])

score_val = binary_classification_score(model, X_train.iloc[vid,:], y_train.iloc[vid])

train_score.append(score_tr)

val_score.append(score_val)

train_score = pd.DataFrame(train_score)

val_score = pd.DataFrame(val_score)binary_classification_performance(train_score, val_score)

On the test set:

`binary_classification_score(model, X_test, y_test)`

The output is

`Accuracy 0.802632`

AUC 0.887955

Log Loss -6.816979

F1 0.835165

Precision 0.775510

Recall 0.904762

dtype: float64

We see that the accuracy, AUC, F1, Precision, and Recall are similar to the training/validation sets. This indicates little overfitting risks.