Using the iris dataset: a) Combine the Setosa and Versicolor into group “0” and label the Virginica to “1”. Create a new variable called iris$Group with the 0 and 1 labels. b) Build a logistic regression model using any available data that will predict the observations being Virginica (value of 1 in Group variable). c) Calculate the probability of a new plant being a Virginica for the following parameters: Sepal.Width = 5 Petal.Length = 10 Petal.Width = 7 Sepal.Length = 9
Answer:
##printing iris data set
print(iris)
##creating a new variable with 0 or 1 labels
iris <- as.data.frame(iris)
iris$Group <- gsub("setosa", "0", iris$Species, fixed=T)
iris$Group <- gsub("versicolor", "0", iris$Group, fixed=T)
iris$Group <- gsub("virginica", "1", iris$Group, fixed=T)
iris$Group <- as.numeric(iris$Group)
##creating test and train samples
set.seed(1234)
index <- sample(1:nrow(iris),size = 0.8*nrow(iris))
iris_train <- iris[index, ]
iris_test <- iris[-index,]
##building a logistic regression model for Virginica
my_logit <- glm(Group~Sepal.Length+Sepal.Width+
Petal.Length+Petal.Width,data=iris_train,family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(my_logit)
##
## Call:
## glm(formula = Group ~ Sepal.Length + Sepal.Width + Petal.Length +
## Petal.Width, family = "binomial", data = iris_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.83672 -0.00455 0.00000 0.00479 1.54003
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -35.587 20.103 -1.770 0.0767 .
## Sepal.Length -2.557 2.882 -0.887 0.3750
## Sepal.Width -3.453 4.835 -0.714 0.4751
## Petal.Length 8.368 4.400 1.902 0.0572 .
## Petal.Width 11.911 8.626 1.381 0.1673
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 154.1123 on 119 degrees of freedom
## Residual deviance: 9.6082 on 115 degrees of freedom
## AIC: 19.608
##
## Number of Fisher Scoring iterations: 11
Based on the p-values, the only significant variable is “Petal.Length”.
##creating a data frame for new parameters
new_plant <- data.frame(Sepal.Length=9,Sepal.Width=5,Petal.Length=10,Petal.Width=7)
##predicting the probability of new plant being Virginica
predict(my_logit, new_plant, type="response")
## 1
## 1
Therefore, the probability of the new plant being Virginica is 100%.
Using the kyphosis dataset: a) Convert the kyphosis$kyphosis variable to numeric, assign a 1 to present and a 0 to absent. b) Build a logistic regression using all other variables and estimate the probability of the observation having a “present” hyphosis. What can you say about the coefficients? Are they significant? c) Calculate the probability of kyphosis being “present” for the following observation: Age=50, Star=10, Number=5.
Answer:
##importing kyphosis dataset
#install.packages("rpart")
library(rpart)
print(kyphosis)
##creating a new variable with 0 or 1 labels
kyphosis <- as.data.frame(kyphosis)
kyphosis$Kyphosis <- gsub("absent", "0", kyphosis$Kyphosis, fixed=T)
kyphosis$Kyphosis <- gsub("present", "1", kyphosis$Kyphosis, fixed=T)
kyphosis$Kyphosis <- as.numeric(kyphosis$Kyphosis)
#normalizing the data
normal <- function(var){
my_normal <- (var-min(var))/(max(var)-min(var))
return(my_normal)
} #Closing the normal UDF
##calling the function
kyphosis$age_norm <- normal(kyphosis$Age)
kyphosis$number_norm <- normal(kyphosis$Number)
kyphosis$start_norm <- normal(kyphosis$Start)
##creating test and train samples
index_kyphosis <- sample(1:nrow(kyphosis), size = 0.8*nrow(kyphosis))
kyphosis_train <- kyphosis[index_kyphosis, ]
kyphosis_test <- kyphosis[-index_kyphosis,]
##building a logistic regression model for Virginica
kyphosis_logit <- glm(Kyphosis~age_norm+number_norm+
start_norm,data=kyphosis_train,family = "binomial")
summary(kyphosis_logit)
##
## Call:
## glm(formula = Kyphosis ~ age_norm + number_norm + start_norm,
## family = "binomial", data = kyphosis_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5298 -0.5216 -0.3308 -0.1150 2.1818
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.243 1.418 -0.877 0.38067
## age_norm 2.475 1.624 1.523 0.12766
## number_norm 4.302 2.471 1.741 0.08163 .
## start_norm -4.211 1.369 -3.077 0.00209 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 69.697 on 63 degrees of freedom
## Residual deviance: 47.135 on 60 degrees of freedom
## AIC: 55.135
##
## Number of Fisher Scoring iterations: 6
##without normalizing the data
kypho_logit_wn <- glm(Kyphosis~Age+Number+
Start,data=kyphosis,family = "binomial")
summary(kypho_logit_wn)
##
## Call:
## glm(formula = Kyphosis ~ Age + Number + Start, family = "binomial",
## data = kyphosis)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3124 -0.5484 -0.3632 -0.1659 2.1613
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.036934 1.449575 -1.405 0.15996
## Age 0.010930 0.006446 1.696 0.08996 .
## Number 0.410601 0.224861 1.826 0.06785 .
## Start -0.206510 0.067699 -3.050 0.00229 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 83.234 on 80 degrees of freedom
## Residual deviance: 61.380 on 77 degrees of freedom
## AIC: 69.38
##
## Number of Fisher Scoring iterations: 5
Based on the logistic regression model, only start is significant since the p-value is below 0.05.
##creating a data frame for new observations
new_kyphosis <- data.frame(Age=50,Number=5,Start=10)
##predicting the probability of kyphosis being "present"
predict(kypho_logit_wn, new_kyphosis, type="response")
## 1
## 0.1820524
The probability of the kyphosis being present is only 18%.
Using all the single variable regressions from Exercise 1, test if the variable pairs are homocedastic or heterocedastic. Plot your findings. Using the plot (x=my_x_variable, y=my_y_variable, type=‘p’) function. Use my_data$variable_name to define x and y variables in the function.
Answer: Based on the plots, plots 1 and 3 are heteroscedastic while plot 2 is homoscedastic.
##plotting to check if variable pairs are homoscedastic or heteroscedastic
library(ggplot2)
ggplot(data=iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species))+geom_point()
ggplot(data=iris, aes(x=Sepal.Length, y=Petal.Length, color=Species))+geom_point()
ggplot(data=iris, aes(x=Sepal.Length, y=Petal.Width, color=Species))+geom_point()