Logistic Regression in R

06.15.2021

Intro

Logistic Regression modifies a regression model to return a binary response, i.e. yes or no. This is helpful when we want to solve a classification problem to decided between two classes. In this article, we will learn how to get started with logistic regression in R.

Data

For this tutorial, we will use the German Credit data which is from the UCI Machine Learning Repository and comes with the caret package. The goal here will be to predict the Class which tells us if the observation is good or bad.

# install.packages('caret', dependencies = TRUE)
library(caret)
## Loading required package: lattice

## Loading required package: ggplot2
data(GermanCredit)

str(GermanCredit)
## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : Factor w/ 2 levels "Bad","Good": 2 1 2 2 1 2 2 2 2 1 ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

Basic Logistic Regression in R

We can git a logistic regression model in R by using the glm function. We need to pass three parameters to this function. First is the R formula Class ~ . which states that we want to predict Class by using all the other features/columns. Then we pass our data set, GermanCredit. Finally, we tell glm to use the binomial family which will use basic logisitic regression.

set.seed(1)

model = glm(
    Class ~ .,
    data = GermanCredit,
    family = binomial
)

summary(model)
## 
## Call:
## glm(formula = Class ~ ., family = binomial, data = GermanCredit)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6116  -0.7095   0.3752   0.6994   2.3410  
## 
## Coefficients: (13 not defined because of singularities)
##                                          Estimate Std. Error z value Pr(>|z|)
## (Intercept)                             8.341e+00  1.409e+00   5.918 3.25e-09
## Duration                               -2.786e-02  9.296e-03  -2.997 0.002724
## Amount                                 -1.283e-04  4.444e-05  -2.887 0.003894
## InstallmentRatePercentage              -3.301e-01  8.828e-02  -3.739 0.000185
## ResidenceDuration                      -4.776e-03  8.641e-02  -0.055 0.955920
## Age                                     1.454e-02  9.222e-03   1.576 0.114982
## NumberExistingCredits                  -2.721e-01  1.895e-01  -1.436 0.151109
## NumberPeopleMaintenance                -2.647e-01  2.492e-01  -1.062 0.288249
## Telephone                              -3.000e-01  2.013e-01  -1.491 0.136060
## ForeignWorker                          -1.392e+00  6.258e-01  -2.225 0.026095
## CheckingAccountStatus.lt.0             -1.712e+00  2.322e-01  -7.373 1.66e-13
## CheckingAccountStatus.0.to.200         -1.337e+00  2.325e-01  -5.752 8.83e-09
## CheckingAccountStatus.gt.200           -7.462e-01  3.831e-01  -1.948 0.051419
## CheckingAccountStatus.none                     NA         NA      NA       NA
## CreditHistory.NoCredit.AllPaid         -1.436e+00  4.399e-01  -3.264 0.001099
## CreditHistory.ThisBank.AllPaid         -1.579e+00  4.381e-01  -3.605 0.000312
## CreditHistory.PaidDuly                 -8.497e-01  2.587e-01  -3.284 0.001022
## CreditHistory.Delay                    -5.826e-01  3.345e-01  -1.742 0.081540
## CreditHistory.Critical                         NA         NA      NA       NA
## Purpose.NewCar                         -1.489e+00  7.764e-01  -1.918 0.055163
## Purpose.UsedCar                         1.777e-01  8.081e-01   0.220 0.825966
## Purpose.Furniture.Equipment            -6.972e-01  7.844e-01  -0.889 0.374123
## Purpose.Radio.Television               -5.972e-01  7.841e-01  -0.762 0.446249
## Purpose.DomesticAppliance              -9.660e-01  1.077e+00  -0.897 0.369646
## Purpose.Repairs                        -1.272e+00  9.264e-01  -1.373 0.169598
## Purpose.Education                      -1.525e+00  8.453e-01  -1.804 0.071192
## Purpose.Vacation                               NA         NA      NA       NA
## Purpose.Retraining                      5.706e-01  1.431e+00   0.399 0.690107
## Purpose.Business                       -7.487e-01  7.998e-01  -0.936 0.349202
## Purpose.Other                                  NA         NA      NA       NA
## SavingsAccountBonds.lt.100             -9.467e-01  2.625e-01  -3.607 0.000310
## SavingsAccountBonds.100.to.500         -5.889e-01  3.493e-01  -1.686 0.091805
## SavingsAccountBonds.500.to.1000        -5.706e-01  4.492e-01  -1.270 0.203940
## SavingsAccountBonds.gt.1000             3.925e-01  5.644e-01   0.695 0.486765
## SavingsAccountBonds.Unknown                    NA         NA      NA       NA
## EmploymentDuration.lt.1                 6.691e-02  4.270e-01   0.157 0.875475
## EmploymentDuration.1.to.4               1.828e-01  4.105e-01   0.445 0.656049
## EmploymentDuration.4.to.7               8.310e-01  4.455e-01   1.866 0.062110
## EmploymentDuration.gt.7                 2.766e-01  4.134e-01   0.669 0.503410
## EmploymentDuration.Unemployed                  NA         NA      NA       NA
## Personal.Male.Divorced.Seperated       -3.671e-01  4.537e-01  -0.809 0.418448
## Personal.Female.NotSingle              -9.162e-02  3.118e-01  -0.294 0.768908
## Personal.Male.Single                    4.490e-01  3.152e-01   1.424 0.154345
## Personal.Male.Married.Widowed                  NA         NA      NA       NA
## Personal.Female.Single                         NA         NA      NA       NA
## OtherDebtorsGuarantors.None            -9.786e-01  4.243e-01  -2.307 0.021072
## OtherDebtorsGuarantors.CoApplicant     -1.415e+00  5.685e-01  -2.488 0.012834
## OtherDebtorsGuarantors.Guarantor               NA         NA      NA       NA
## Property.RealEstate                     7.304e-01  4.245e-01   1.721 0.085308
## Property.Insurance                      4.490e-01  4.130e-01   1.087 0.277005
## Property.CarOther                       5.359e-01  4.017e-01   1.334 0.182211
## Property.Unknown                               NA         NA      NA       NA
## OtherInstallmentPlans.Bank             -6.463e-01  2.391e-01  -2.703 0.006871
## OtherInstallmentPlans.Stores           -5.231e-01  3.754e-01  -1.393 0.163501
## OtherInstallmentPlans.None                     NA         NA      NA       NA
## Housing.Rent                           -6.839e-01  4.770e-01  -1.434 0.151657
## Housing.Own                            -2.402e-01  4.503e-01  -0.534 0.593687
## Housing.ForFree                                NA         NA      NA       NA
## Job.UnemployedUnskilled                 4.795e-01  6.623e-01   0.724 0.469086
## Job.UnskilledResident                  -5.666e-02  3.501e-01  -0.162 0.871450
## Job.SkilledEmployee                    -7.524e-02  2.845e-01  -0.264 0.791419
## Job.Management.SelfEmp.HighlyQualified         NA         NA      NA       NA
##                                           
## (Intercept)                            ***
## Duration                               ** 
## Amount                                 ** 
## InstallmentRatePercentage              ***
## ResidenceDuration                         
## Age                                       
## NumberExistingCredits                     
## NumberPeopleMaintenance                   
## Telephone                                 
## ForeignWorker                          *  
## CheckingAccountStatus.lt.0             ***
## CheckingAccountStatus.0.to.200         ***
## CheckingAccountStatus.gt.200           .  
## CheckingAccountStatus.none                
## CreditHistory.NoCredit.AllPaid         ** 
## CreditHistory.ThisBank.AllPaid         ***
## CreditHistory.PaidDuly                 ** 
## CreditHistory.Delay                    .  
## CreditHistory.Critical                    
## Purpose.NewCar                         .  
## Purpose.UsedCar                           
## Purpose.Furniture.Equipment               
## Purpose.Radio.Television                  
## Purpose.DomesticAppliance                 
## Purpose.Repairs                           
## Purpose.Education                      .  
## Purpose.Vacation                          
## Purpose.Retraining                        
## Purpose.Business                          
## Purpose.Other                             
## SavingsAccountBonds.lt.100             ***
## SavingsAccountBonds.100.to.500         .  
## SavingsAccountBonds.500.to.1000           
## SavingsAccountBonds.gt.1000               
## SavingsAccountBonds.Unknown               
## EmploymentDuration.lt.1                   
## EmploymentDuration.1.to.4                 
## EmploymentDuration.4.to.7              .  
## EmploymentDuration.gt.7                   
## EmploymentDuration.Unemployed             
## Personal.Male.Divorced.Seperated          
## Personal.Female.NotSingle                 
## Personal.Male.Single                      
## Personal.Male.Married.Widowed             
## Personal.Female.Single                    
## OtherDebtorsGuarantors.None            *  
## OtherDebtorsGuarantors.CoApplicant     *  
## OtherDebtorsGuarantors.Guarantor          
## Property.RealEstate                    .  
## Property.Insurance                        
## Property.CarOther                         
## Property.Unknown                          
## OtherInstallmentPlans.Bank             ** 
## OtherInstallmentPlans.Stores              
## OtherInstallmentPlans.None                
## Housing.Rent                              
## Housing.Own                               
## Housing.ForFree                           
## Job.UnemployedUnskilled                   
## Job.UnskilledResident                     
## Job.SkilledEmployee                       
## Job.Management.SelfEmp.HighlyQualified    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1221.73  on 999  degrees of freedom
## Residual deviance:  895.82  on 951  degrees of freedom
## AIC: 993.82
## 
## Number of Fisher Scoring iterations: 5

The model above is quite large as there are many features.

Basic Caret Logistic Regression Model

Now that we know how to fit a basic logistic regression model, let’s see how to do the same in a library called, Caret. This library has many features that will help us training and build models.

We can fit a logistic regression model using the train method. Again we pass our formula Class ~ . and our data set data = GermanCredit. For the third parameter, we tell Caret to use the glm method.

set.seed(1)

model.glm <- train(
  Class ~ .,
  data = GermanCredit,
  method = 'glm'
)
model.glm
## Generalized Linear Model 
## 
## 1000 samples
##   61 predictor
##    2 classes: 'Bad', 'Good' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1000, 1000, 1000, 1000, 1000, 1000, ... 
## Resampling results:
## 
##   Accuracy  Kappa   
##   0.744237  0.358288

Caret gives us back two metric to evaluate the fit of our model. We can see that we got a .7444237 or about 74% accuracy.

Cross Validation

In practice, we don’t normal build our data in on training set. It is common to use a data partitioning strategy like k-fold cross-validation that resamples and splits our data many times. We then train the model on these samples and pick the best model. Caret makes this easy with the trainControl method.

We will use 10-fold cross-validation in this tutorial. To do this we need to pass three parameters method = "repeatedcv", number = 10 (for 10-fold). We store this result in a variable to pass to our train method later.

set.seed(1)
ctrl <- trainControl(
  method = "cv",
  number = 10
)

Now, we can retrain our model and pass the trainControl response to the trControl parameter. Notice the our call has added trControl = ctrl.

set.seed(1)

model.glm2 <- train(
  Class ~ .,
  data = GermanCredit,
  method = 'glm',
  trControl = ctrl
)
model.glm2
## Generalized Linear Model 
## 
## 1000 samples
##   61 predictor
##    2 classes: 'Bad', 'Good' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 900, 900, 900, 900, 900, 900, ... 
## Resampling results:
## 
##   Accuracy  Kappa    
##   0.754     0.3753967

This results seemed to have improved our accuracy for our training data.

Predicting With The model

Now that we have a model, let’s see how we can make predictions with it. We first create a subset of our data set and only select the features. In real life we would expect to have a new observation. We can pass these data and our trained model above to the predictions method and this will return Good or Bad based on the prediction for our observation.

test.features = subset(GermanCredit, select=-c(Class))

predictions = predict(model.glm2, newdata = test.features)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
predictions[1:10]
##  [1] Good Bad  Good Good Bad  Good Good Good Good Bad 
## Levels: Bad Good