Title: Week_10_Data_Dive
Output: HTML document
#installing necessary libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
#loading the data
data(diamonds)
#creating a binary column for the Cut variable
diamonds$cut_binary <- ifelse(diamonds$cut %in% c("Fair","Ideal"),1,0)
Building a logistic regression model for the binary column cut_binary
logistic_model <- glm(cut_binary ~ carat + depth + price, data = diamonds, family = "binomial")
summary(logistic_model)
##
## Call:
## glm(formula = cut_binary ~ carat + depth + price, family = "binomial",
## data = diamonds)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.507e+00 3.952e-01 -19.00 <2e-16 ***
## carat -1.574e+00 5.156e-02 -30.52 <2e-16 ***
## depth 1.292e-01 6.424e-03 20.11 <2e-16 ***
## price 1.242e-04 5.976e-06 20.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 73697 on 53939 degrees of freedom
## Residual deviance: 71980 on 53936 degrees of freedom
## AIC: 71988
##
## Number of Fisher Scoring iterations: 4
Intercept (-7.507e+00): This represents the log odds of the probability of cut_binary being true when all other predictor variables (carat, depth, and price) are zero. Since it’s negative, it indicates that when all other predictors are zero, the log odds of cut_binary being true are negative.
Carat (-1.574e+00): For every one-unit increase in carat, holding all other predictors constant, the log-odds of cut_binary being true decrease by 1.574. Since it’s negative, it suggests that as carat increases, the likelihood of cut_binary being true decreases.
Depth (1.292e-01): For every one-unit increase in depth, holding all other predictors constant, the log-odds of cut_binary being true increase by 0.1292. Since it’s positive, it suggests that as depth increases, the likelihood of cut_binary being true also increases.
Price (1.242e-04): For every one-unit increase in
price, holding all other predictors constant, the log-odds of cut_binary
being true increase by 0.0001242. Since it’s positive, it suggests that
as the price increases, the likelihood of cut_binary being true also
increases.
Using the Standard Error for the carat coefficient, let’s build a
C.I.
coef_carat <- coef(logistic_model)["carat"]
se_carat <- 0.05156
conf_int_carat <- coef_carat + qt(c(0.025, 0.975), df = logistic_model$df.residual) * se_carat
conf_int_carat
## [1] -1.674604 -1.472488
Interpretation:
The 95% confidence interval for the coefficient of carat is
[-1.674604, -1.472488]. This means that we are 95% confident that the
true coefficient of carat in the population falls within this
interval.
Let us visualize the model using the carat variable
ggplot(diamonds, aes(x = carat, y = cut_binary)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "glm", method.args = list(family = "binomial"))
## `geom_smooth()` using formula = 'y ~ x'
Interpretation:
The non-linear relationship observed in the scatterplot, particularly at
lower carat weights, suggests that a transformation of the carat
variable, such as the logarithmic transformation, may be necessary to
better capture the relationship with cut quality in the logistic
regression model. This transformation can help linearize the
relationship and improve the model’s fit and interpretability.
ggplot(diamonds, aes(x = log(carat), y = cut_binary)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "glm", method.args = list(family = "binomial"))
## `geom_smooth()` using formula = 'y ~ x'
Interpretation:
The scatterplot with the fitted logistic regression curve
suggests that the log transformation of the carat variable has
successfully linearized the relationship with cut quality, making it
suitable for the logistic regression model. The positive slope indicates
that larger carat weights (higher log(carat) values) are associated with
a higher probability of being a high-quality diamond.