library(DATA606)
##
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics
## This package is designed to support this course. The text book used
## is OpenIntro Statistics, 3rd Edition. You can read this by typing
## vignette('os3') or visit www.OpenIntro.org.
##
## The getLabs() function will return a list of the labs available.
##
## The demo(package='DATA606') will list the demos that are available.
library(ggplot2)
\(120.07 - 1.93 \times parity\)
The estimated weight of a child that is not a first born weights 1.93 onces less than one who is.
First born: 120.07
Not first born: 120.07 - 1.93 * 1 = 118.14
\(18.93 - 9.11 \times eth + 3.10 \times sex + 2.15 \times lrn\)
\(\beta_{eth}\): The estimated average number of days a student who is not an aboriginal is absent from school is 9.11 less than when they are not aboriginals.
\(\beta_{sex}\): Males are estimated to be 3.10 days on average more absent than females.
\(\beta_{lrn}\): On average, if a student is a slow learner, they take 2.15 days longer than the average student.
predicted <- 18.93 - 0 + 3.10 * 1 + 2.15 * 1
observed <- 2
observed - predicted
## [1] -22.18
Shows that the model overestimated the number of days.
R_squared <- 1 - (240.57 / 264.17)
R_squared
## [1] 0.08933641
adjR_squared <- 1 - ((240.57 / 264.17) * (146 - 1) / (146 - 3 - 1))
adjR_squared
## [1] 0.07009704
Which, if any, variable should be removed from the model first?
No learner status
The O-rings tend to damage when the temperature is lower. That is, as the temperature increases, the chances of a damage become smaller.
Failures have been coded as 1 for a damaged O-ring and 0 for an undamaged O-ring, and a logistic regression model was ???t to these data. A summary of this model is given below. Describe the key components of this summary table in words.
Write out the logistic model using the point estimates of the model parameters. \(logit(p_i) = log_e(\frac{p_i}{1-p_i})\)
\(logit(pi)\) = \(11.6630 - 0.2162 \times temperature\)
\(H_0: \beta_1 = 0\):- Not damaged because of the temperature
\(H_A: \beta_1 \neq 0\): Damaged because of the temperature
P-value is 0 so we reject the null hypothesis that the temperature has nothing to do with the O-rings being damaged.
The data provided in the previous exercise are shown in the plot. The logistic model fit to these data may be written as:
\(log_e(\frac{p_i}{1-p_i}) = 11.6630 - 0.2162 \times temperature\)
here \(\hat{p}\) is the model-estimated probability that an O-ring will become damaged. Use the model to calculate the probability that an O-ring will become damaged at each of the following ambient temperatures: 51, 53, and 55 degrees Fahrenheit.
#51
exp(11.6630 - 0.2162 * 51)/(1 + exp(11.6630 - 0.2162 * 51))
## [1] 0.6540297
#53
exp(11.6630 - 0.2162 * 53)/(1 + exp(11.6630 - 0.2162 * 53))
## [1] 0.5509228
#55
exp(11.6630 - 0.2162 * 55)/(1 + exp(11.6630 - 0.2162 * 55))
## [1] 0.4432456
temp <- c(51, 53, 54, 57, 59, 61, 63, 65, 67, 69, 71)
hat_prob <- function(x){
exp(11.6630 - 0.2162 * x)/(1 + exp(11.6630 - 0.2162 * x))
}
prob_damage <- hat_prob(temp)
df <- data.frame(temp, prob_damage)
ggplot(df, aes(x = temp, y = prob_damage)) + geom_point() + geom_line()
Logistic regression requires that each data point be independent of all other data points. If the observations are related to one another, then the model will tend to overweight the significance of those observations. In other words, logistic regression relies heavily on having an adequate number of samples for each combination of independent variables as small sample sizes can lead to widely inaccurate estimates of parameters. In our case, the sample size is 23 which does not validate independence.