It is a predictive algorithm using independent variables to predict the dependent variable, just like Linear Regression, but with a difference that the dependent variable should be categorical variable.
Linear and Logistic regression are the most basic form of regression which are commonly used. The essential difference between these two is that Logistic regression is used when the dependent variable is binary in nature. In contrast, Linear regression is used when the dependent variable is continuous and nature of the regression line is linear.
There are three types of logistic regression, depending on the nature of the categorical response variable:
Here we will discuss about Binary Logistic Regression.
The multiple binary logistic regression model is the following:
\[log(\frac {p}{1-p}) = \beta_0 + \beta_1X_1 + .... + \beta_kX_k\] Above equation, P = probablity that an observation is in a specified category of the binary Y variable (always between 0 and 1)
X = explanatory variables which can be discrete, continuous, or a combination
To evaluate the performance of a logistic regression model, we must consider few metrics, they are
Using titanic dataset , perform binary logistic regression. Download data from kaggle using below link.
Feature description
Goal of this analysis is Predict if a passenger survived the sinking of the Titanic or not.
Train data having 891 rows and 12 columns, test data having 418 riws and 11 columns. In test data ‘Survived’ column is not present, using model we calculat the prediction of Survival for test data.
train <- read.csv("/Users/subhalaxmirout/DATA 621/train.csv")
test <- read.csv("/Users/subhalaxmirout/DATA 621/test.csv")
str(train)## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
## 'data.frame': 418 obs. of 11 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : chr "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
## $ Sex : chr "male" "female" "male" "male" ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Ticket : chr "330911" "363272" "240276" "315154" ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ Cabin : chr "" "" "" "" ...
## $ Embarked : chr "Q" "S" "Q" "S" ...
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
There are many missing/NA values present in the datset, lets clean the data. replace NA with median value of age.
test$Age[is.na(test$Age)] <- median(test$Age,na.rm=T)
test$Fare[is.na(test$Fare)] <- median(test$Fare,na.rm=T)
visdat::vis_miss(test)Lets see the data distribution of Age, and fare
Features that are assumed to be insignificant to predict survival, such as ‘PassengerId’,‘Ticket’,‘Embarked’, ‘Cabin’ and ‘Name,’ will be excluded. The features that are left for analysis are: Sex, Age, Pclass, Sibsp, Parch, Fare, Embarked and Survived.
train <- train %>% dplyr::select(c(-PassengerId,-Name,-Ticket,-Embarked,-Cabin))
test <- test %>% dplyr::select(c(-PassengerId,-Name,-Ticket,-Embarked,-Cabin))convert Sex column in to 2 or 1, if femal2 then 2 else 1.
Apply binomial logistic regression.
##
## Call:
## glm(formula = Survived ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7091 -0.6000 -0.4237 0.6221 2.4039
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.580705 0.505500 -1.149 0.25065
## Pclass -1.087436 0.139402 -7.801 6.15e-15 ***
## Sex 2.760875 0.198793 13.888 < 2e-16 ***
## Age -0.039398 0.007801 -5.051 4.40e-07 ***
## SibSp -0.348785 0.108990 -3.200 0.00137 **
## Parch -0.106709 0.117230 -0.910 0.36269
## Fare 0.002846 0.002359 1.207 0.22759
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 789.18 on 884 degrees of freedom
## AIC: 803.18
##
## Number of Fisher Scoring iterations: 5
Create another model and exclude Parch and Fare due to high p-value.
##
## Call:
## glm(formula = Survived ~ . - Parch - Fare, family = binomial,
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6817 -0.6029 -0.4159 0.6161 2.4327
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.301929 0.454477 -0.664 0.506471
## Pclass -1.175654 0.120073 -9.791 < 2e-16 ***
## Sex 2.739477 0.193984 14.122 < 2e-16 ***
## Age -0.039553 0.007761 -5.096 3.47e-07 ***
## SibSp -0.354433 0.103392 -3.428 0.000608 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 791.23 on 886 degrees of freedom
## AIC: 801.23
##
## Number of Fisher Scoring iterations: 5
train$prediction <- predict(model2, data = train, type="response")
train$prediction <- ifelse(train$prediction >=0.6, 1, 0)
table(factor(train$prediction), factor(train$Survived))##
## 0 1
## 0 504 123
## 1 45 219
Accuracy = (TP+TN)/(TP+TN+FP+FN)
## [1] 81.14
d <- predict(model2, newdata = test)
test$Survived <- ifelse(d >=0.6, 1, 0)
kable(test[1:20,]) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")| Pclass | Sex | Age | SibSp | Parch | Fare | Survived |
|---|---|---|---|---|---|---|
| 3 | 1 | 34.5 | 0 | 0 | 7.8292 | 0 |
| 3 | 2 | 47.0 | 1 | 0 | 7.0000 | 0 |
| 2 | 1 | 62.0 | 0 | 0 | 9.6875 | 0 |
| 3 | 1 | 27.0 | 0 | 0 | 8.6625 | 0 |
| 3 | 2 | 22.0 | 1 | 1 | 12.2875 | 0 |
| 3 | 1 | 14.0 | 0 | 0 | 9.2250 | 0 |
| 3 | 2 | 30.0 | 0 | 0 | 7.6292 | 0 |
| 2 | 1 | 26.0 | 1 | 1 | 29.0000 | 0 |
| 3 | 2 | 18.0 | 0 | 0 | 7.2292 | 1 |
| 3 | 1 | 21.0 | 2 | 0 | 24.1500 | 0 |
| 3 | 1 | 27.0 | 0 | 0 | 7.8958 | 0 |
| 1 | 1 | 46.0 | 0 | 0 | 26.0000 | 0 |
| 1 | 2 | 23.0 | 1 | 0 | 82.2667 | 1 |
| 2 | 1 | 63.0 | 1 | 0 | 26.0000 | 0 |
| 1 | 2 | 47.0 | 1 | 0 | 61.1750 | 1 |
| 2 | 2 | 24.0 | 1 | 0 | 27.7208 | 1 |
| 2 | 1 | 35.0 | 0 | 0 | 12.3500 | 0 |
| 3 | 1 | 21.0 | 0 | 0 | 7.2250 | 0 |
| 3 | 2 | 27.0 | 1 | 0 | 7.9250 | 0 |
| 3 | 2 | 45.0 | 0 | 0 | 7.2250 | 0 |
From logistic regression analysis, we learn: