Chapter 6: Exercise 3

Using the iris dataset: a) Combine the Setosa and Versicolor into group “0” and label the Virginica to “1”. Create a new variable called iris$Group with the 0 and 1 labels. b) Build a logistic regression model using any available data that will predict the observations being Virginica (value of 1 in Group variable). c) Calculate the probability of a new plant being a Virginica for the following parameters: Sepal.Width = 5 Petal.Length = 10 Petal.Width = 7 Sepal.Length = 9

Answer:

##printing iris data set
print(iris)
##creating a new variable with 0 or 1 labels
iris <- as.data.frame(iris)
iris$Group <- gsub("setosa", "0", iris$Species, fixed=T)
iris$Group <- gsub("versicolor", "0", iris$Group, fixed=T)
iris$Group <- gsub("virginica", "1", iris$Group, fixed=T)
iris$Group <- as.numeric(iris$Group)

##creating test and train samples
set.seed(1234)
index <- sample(1:nrow(iris),size = 0.8*nrow(iris))
iris_train <- iris[index, ]
iris_test <- iris[-index,]
##building a logistic regression model for Virginica
my_logit <- glm(Group~Sepal.Length+Sepal.Width+
                  Petal.Length+Petal.Width,data=iris_train,family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(my_logit)
## 
## Call:
## glm(formula = Group ~ Sepal.Length + Sepal.Width + Petal.Length + 
##     Petal.Width, family = "binomial", data = iris_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.83672  -0.00455   0.00000   0.00479   1.54003  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   -35.587     20.103  -1.770   0.0767 .
## Sepal.Length   -2.557      2.882  -0.887   0.3750  
## Sepal.Width    -3.453      4.835  -0.714   0.4751  
## Petal.Length    8.368      4.400   1.902   0.0572 .
## Petal.Width    11.911      8.626   1.381   0.1673  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 154.1123  on 119  degrees of freedom
## Residual deviance:   9.6082  on 115  degrees of freedom
## AIC: 19.608
## 
## Number of Fisher Scoring iterations: 11

Based on the p-values, the only significant variable is “Petal.Length”.

##creating a data frame for new parameters
new_plant <- data.frame(Sepal.Length=9,Sepal.Width=5,Petal.Length=10,Petal.Width=7)

##predicting the probability of new plant being Virginica
predict(my_logit, new_plant, type="response")
## 1 
## 1

Therefore, the probability of the new plant being Virginica is 100%.

Chapter 6: Exercise 4

Using the kyphosis dataset: a) Convert the kyphosis$kyphosis variable to numeric, assign a 1 to present and a 0 to absent. b) Build a logistic regression using all other variables and estimate the probability of the observation having a “present” hyphosis. What can you say about the coefficients? Are they significant? c) Calculate the probability of kyphosis being “present” for the following observation: Age=50, Star=10, Number=5.

Answer:

 ##importing kyphosis dataset
#install.packages("rpart")
library(rpart)
print(kyphosis)

##creating a new variable with 0 or 1 labels
kyphosis <- as.data.frame(kyphosis)
kyphosis$Kyphosis <- gsub("absent", "0", kyphosis$Kyphosis, fixed=T)
kyphosis$Kyphosis <- gsub("present", "1", kyphosis$Kyphosis, fixed=T)
kyphosis$Kyphosis <- as.numeric(kyphosis$Kyphosis) 
#normalizing the data
normal <- function(var){
  my_normal <- (var-min(var))/(max(var)-min(var))
  return(my_normal)
} #Closing the normal UDF

##calling the function
kyphosis$age_norm <- normal(kyphosis$Age)
kyphosis$number_norm <- normal(kyphosis$Number)
kyphosis$start_norm <- normal(kyphosis$Start)

##creating test and train samples
index_kyphosis <- sample(1:nrow(kyphosis), size = 0.8*nrow(kyphosis))
kyphosis_train <- kyphosis[index_kyphosis, ]
kyphosis_test <- kyphosis[-index_kyphosis,]
 ##building a logistic regression model for Virginica
kyphosis_logit <- glm(Kyphosis~age_norm+number_norm+
                       start_norm,data=kyphosis_train,family = "binomial")
summary(kyphosis_logit)
## 
## Call:
## glm(formula = Kyphosis ~ age_norm + number_norm + start_norm, 
##     family = "binomial", data = kyphosis_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5298  -0.5216  -0.3308  -0.1150   2.1818  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   -1.243      1.418  -0.877  0.38067   
## age_norm       2.475      1.624   1.523  0.12766   
## number_norm    4.302      2.471   1.741  0.08163 . 
## start_norm    -4.211      1.369  -3.077  0.00209 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 69.697  on 63  degrees of freedom
## Residual deviance: 47.135  on 60  degrees of freedom
## AIC: 55.135
## 
## Number of Fisher Scoring iterations: 6
##without normalizing the data
kypho_logit_wn <- glm(Kyphosis~Age+Number+
                        Start,data=kyphosis,family = "binomial")
summary(kypho_logit_wn)
## 
## Call:
## glm(formula = Kyphosis ~ Age + Number + Start, family = "binomial", 
##     data = kyphosis)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3124  -0.5484  -0.3632  -0.1659   2.1613  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -2.036934   1.449575  -1.405  0.15996   
## Age          0.010930   0.006446   1.696  0.08996 . 
## Number       0.410601   0.224861   1.826  0.06785 . 
## Start       -0.206510   0.067699  -3.050  0.00229 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 83.234  on 80  degrees of freedom
## Residual deviance: 61.380  on 77  degrees of freedom
## AIC: 69.38
## 
## Number of Fisher Scoring iterations: 5

Based on the logistic regression model, only start is significant since the p-value is below 0.05.

 ##creating a data frame for new observations
new_kyphosis <- data.frame(Age=50,Number=5,Start=10)

##predicting the probability of kyphosis being "present"
predict(kypho_logit_wn, new_kyphosis, type="response")
##         1 
## 0.1820524

The probability of the kyphosis being present is only 18%.

Chapter 6: Exercise 5

Using all the single variable regressions from Exercise 1, test if the variable pairs are homocedastic or heterocedastic. Plot your findings. Using the plot (x=my_x_variable, y=my_y_variable, type=‘p’) function. Use my_data$variable_name to define x and y variables in the function.

Answer: Based on the plots, plots 1 and 3 are heteroscedastic while plot 2 is homoscedastic.

##plotting to check if variable pairs are homoscedastic or heteroscedastic
library(ggplot2)
ggplot(data=iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species))+geom_point()

ggplot(data=iris, aes(x=Sepal.Length, y=Petal.Length, color=Species))+geom_point()

ggplot(data=iris, aes(x=Sepal.Length, y=Petal.Width, color=Species))+geom_point()