Logistic Regression modifies a regression model to return a binary response, i.e. yes or no. This is helpful when we want to solve a classification problem to decided between two classes. In this article, we will learn how to get started with logistic regression in R.
For this tutorial, we will use the German Credit data which is from the
UCI Machine Learning Repository and comes with the caret package. The
goal here will be to predict the Class
which tells us if the
observation is good or bad.
# install.packages('caret', dependencies = TRUE)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
data(GermanCredit)
str(GermanCredit)
## 'data.frame': 1000 obs. of 62 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : Factor w/ 2 levels "Bad","Good": 2 1 2 2 1 2 2 2 2 1 ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CheckingAccountStatus.none : num 0 0 1 0 0 1 1 0 1 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ CreditHistory.Critical : num 1 0 1 0 0 0 0 0 0 1 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Vacation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ SavingsAccountBonds.Unknown : num 1 0 0 0 0 1 0 0 0 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ EmploymentDuration.Unemployed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ Personal.Male.Married.Widowed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Female.Single : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.Guarantor : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.None : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Housing.ForFree : num 0 0 0 1 1 1 0 0 0 0 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
## $ Job.Management.SelfEmp.HighlyQualified: num 0 0 0 0 0 0 0 1 0 1 ...
We can git a logistic regression model in R by using the glm function.
We need to pass three parameters to this function. First is the R
formula Class ~ .
which states that we want to predict Class by using
all the other features/columns. Then we pass our data set, GermanCredit.
Finally, we tell glm
to use the binomial
family which will use basic
logisitic regression.
set.seed(1)
model = glm(
Class ~ .,
data = GermanCredit,
family = binomial
)
summary(model)
##
## Call:
## glm(formula = Class ~ ., family = binomial, data = GermanCredit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6116 -0.7095 0.3752 0.6994 2.3410
##
## Coefficients: (13 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.341e+00 1.409e+00 5.918 3.25e-09
## Duration -2.786e-02 9.296e-03 -2.997 0.002724
## Amount -1.283e-04 4.444e-05 -2.887 0.003894
## InstallmentRatePercentage -3.301e-01 8.828e-02 -3.739 0.000185
## ResidenceDuration -4.776e-03 8.641e-02 -0.055 0.955920
## Age 1.454e-02 9.222e-03 1.576 0.114982
## NumberExistingCredits -2.721e-01 1.895e-01 -1.436 0.151109
## NumberPeopleMaintenance -2.647e-01 2.492e-01 -1.062 0.288249
## Telephone -3.000e-01 2.013e-01 -1.491 0.136060
## ForeignWorker -1.392e+00 6.258e-01 -2.225 0.026095
## CheckingAccountStatus.lt.0 -1.712e+00 2.322e-01 -7.373 1.66e-13
## CheckingAccountStatus.0.to.200 -1.337e+00 2.325e-01 -5.752 8.83e-09
## CheckingAccountStatus.gt.200 -7.462e-01 3.831e-01 -1.948 0.051419
## CheckingAccountStatus.none NA NA NA NA
## CreditHistory.NoCredit.AllPaid -1.436e+00 4.399e-01 -3.264 0.001099
## CreditHistory.ThisBank.AllPaid -1.579e+00 4.381e-01 -3.605 0.000312
## CreditHistory.PaidDuly -8.497e-01 2.587e-01 -3.284 0.001022
## CreditHistory.Delay -5.826e-01 3.345e-01 -1.742 0.081540
## CreditHistory.Critical NA NA NA NA
## Purpose.NewCar -1.489e+00 7.764e-01 -1.918 0.055163
## Purpose.UsedCar 1.777e-01 8.081e-01 0.220 0.825966
## Purpose.Furniture.Equipment -6.972e-01 7.844e-01 -0.889 0.374123
## Purpose.Radio.Television -5.972e-01 7.841e-01 -0.762 0.446249
## Purpose.DomesticAppliance -9.660e-01 1.077e+00 -0.897 0.369646
## Purpose.Repairs -1.272e+00 9.264e-01 -1.373 0.169598
## Purpose.Education -1.525e+00 8.453e-01 -1.804 0.071192
## Purpose.Vacation NA NA NA NA
## Purpose.Retraining 5.706e-01 1.431e+00 0.399 0.690107
## Purpose.Business -7.487e-01 7.998e-01 -0.936 0.349202
## Purpose.Other NA NA NA NA
## SavingsAccountBonds.lt.100 -9.467e-01 2.625e-01 -3.607 0.000310
## SavingsAccountBonds.100.to.500 -5.889e-01 3.493e-01 -1.686 0.091805
## SavingsAccountBonds.500.to.1000 -5.706e-01 4.492e-01 -1.270 0.203940
## SavingsAccountBonds.gt.1000 3.925e-01 5.644e-01 0.695 0.486765
## SavingsAccountBonds.Unknown NA NA NA NA
## EmploymentDuration.lt.1 6.691e-02 4.270e-01 0.157 0.875475
## EmploymentDuration.1.to.4 1.828e-01 4.105e-01 0.445 0.656049
## EmploymentDuration.4.to.7 8.310e-01 4.455e-01 1.866 0.062110
## EmploymentDuration.gt.7 2.766e-01 4.134e-01 0.669 0.503410
## EmploymentDuration.Unemployed NA NA NA NA
## Personal.Male.Divorced.Seperated -3.671e-01 4.537e-01 -0.809 0.418448
## Personal.Female.NotSingle -9.162e-02 3.118e-01 -0.294 0.768908
## Personal.Male.Single 4.490e-01 3.152e-01 1.424 0.154345
## Personal.Male.Married.Widowed NA NA NA NA
## Personal.Female.Single NA NA NA NA
## OtherDebtorsGuarantors.None -9.786e-01 4.243e-01 -2.307 0.021072
## OtherDebtorsGuarantors.CoApplicant -1.415e+00 5.685e-01 -2.488 0.012834
## OtherDebtorsGuarantors.Guarantor NA NA NA NA
## Property.RealEstate 7.304e-01 4.245e-01 1.721 0.085308
## Property.Insurance 4.490e-01 4.130e-01 1.087 0.277005
## Property.CarOther 5.359e-01 4.017e-01 1.334 0.182211
## Property.Unknown NA NA NA NA
## OtherInstallmentPlans.Bank -6.463e-01 2.391e-01 -2.703 0.006871
## OtherInstallmentPlans.Stores -5.231e-01 3.754e-01 -1.393 0.163501
## OtherInstallmentPlans.None NA NA NA NA
## Housing.Rent -6.839e-01 4.770e-01 -1.434 0.151657
## Housing.Own -2.402e-01 4.503e-01 -0.534 0.593687
## Housing.ForFree NA NA NA NA
## Job.UnemployedUnskilled 4.795e-01 6.623e-01 0.724 0.469086
## Job.UnskilledResident -5.666e-02 3.501e-01 -0.162 0.871450
## Job.SkilledEmployee -7.524e-02 2.845e-01 -0.264 0.791419
## Job.Management.SelfEmp.HighlyQualified NA NA NA NA
##
## (Intercept) ***
## Duration **
## Amount **
## InstallmentRatePercentage ***
## ResidenceDuration
## Age
## NumberExistingCredits
## NumberPeopleMaintenance
## Telephone
## ForeignWorker *
## CheckingAccountStatus.lt.0 ***
## CheckingAccountStatus.0.to.200 ***
## CheckingAccountStatus.gt.200 .
## CheckingAccountStatus.none
## CreditHistory.NoCredit.AllPaid **
## CreditHistory.ThisBank.AllPaid ***
## CreditHistory.PaidDuly **
## CreditHistory.Delay .
## CreditHistory.Critical
## Purpose.NewCar .
## Purpose.UsedCar
## Purpose.Furniture.Equipment
## Purpose.Radio.Television
## Purpose.DomesticAppliance
## Purpose.Repairs
## Purpose.Education .
## Purpose.Vacation
## Purpose.Retraining
## Purpose.Business
## Purpose.Other
## SavingsAccountBonds.lt.100 ***
## SavingsAccountBonds.100.to.500 .
## SavingsAccountBonds.500.to.1000
## SavingsAccountBonds.gt.1000
## SavingsAccountBonds.Unknown
## EmploymentDuration.lt.1
## EmploymentDuration.1.to.4
## EmploymentDuration.4.to.7 .
## EmploymentDuration.gt.7
## EmploymentDuration.Unemployed
## Personal.Male.Divorced.Seperated
## Personal.Female.NotSingle
## Personal.Male.Single
## Personal.Male.Married.Widowed
## Personal.Female.Single
## OtherDebtorsGuarantors.None *
## OtherDebtorsGuarantors.CoApplicant *
## OtherDebtorsGuarantors.Guarantor
## Property.RealEstate .
## Property.Insurance
## Property.CarOther
## Property.Unknown
## OtherInstallmentPlans.Bank **
## OtherInstallmentPlans.Stores
## OtherInstallmentPlans.None
## Housing.Rent
## Housing.Own
## Housing.ForFree
## Job.UnemployedUnskilled
## Job.UnskilledResident
## Job.SkilledEmployee
## Job.Management.SelfEmp.HighlyQualified
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1221.73 on 999 degrees of freedom
## Residual deviance: 895.82 on 951 degrees of freedom
## AIC: 993.82
##
## Number of Fisher Scoring iterations: 5
The model above is quite large as there are many features.
Now that we know how to fit a basic logistic regression model, let’s see how to do the same in a library called, Caret. This library has many features that will help us training and build models.
We can fit a logistic regression model using the train
method. Again
we pass our formula Class ~ .
and our data set data = GermanCredit
.
For the third parameter, we tell Caret to use the glm
method.
set.seed(1)
model.glm <- train(
Class ~ .,
data = GermanCredit,
method = 'glm'
)
model.glm
## Generalized Linear Model
##
## 1000 samples
## 61 predictor
## 2 classes: 'Bad', 'Good'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1000, 1000, 1000, 1000, 1000, 1000, ...
## Resampling results:
##
## Accuracy Kappa
## 0.744237 0.358288
Caret gives us back two metric to evaluate the fit of our model. We can see that we got a .7444237 or about 74% accuracy.
In practice, we don’t normal build our data in on training set. It is
common to use a data partitioning strategy like k-fold cross-validation
that resamples and splits our data many times. We then train the model
on these samples and pick the best model. Caret makes this easy with the
trainControl
method.
We will use 10-fold cross-validation in this tutorial. To do this we
need to pass three parameters method = "repeatedcv"
, number = 10
(for 10-fold). We store this result in a variable to pass to our train
method later.
set.seed(1)
ctrl <- trainControl(
method = "cv",
number = 10
)
Now, we can retrain our model and pass the trainControl
response to
the trControl
parameter. Notice the our call has added
trControl = ctrl
.
set.seed(1)
model.glm2 <- train(
Class ~ .,
data = GermanCredit,
method = 'glm',
trControl = ctrl
)
model.glm2
## Generalized Linear Model
##
## 1000 samples
## 61 predictor
## 2 classes: 'Bad', 'Good'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 900, 900, 900, 900, 900, 900, ...
## Resampling results:
##
## Accuracy Kappa
## 0.754 0.3753967
This results seemed to have improved our accuracy for our training data.
Now that we have a model, let’s see how we can make predictions with it.
We first create a subset of our data set and only select the features.
In real life we would expect to have a new observation. We can pass
these data and our trained model above to the predictions
method and
this will return Good or Bad based on the prediction for our
observation.
test.features = subset(GermanCredit, select=-c(Class))
predictions = predict(model.glm2, newdata = test.features)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
predictions[1:10]
## [1] Good Bad Good Good Bad Good Good Good Good Bad
## Levels: Bad Good