library("DATA606")
## Loading required package: shiny
## Loading required package: openintro
## Warning: package 'openintro' was built under R version 4.1.2
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
## Loading required package: OIdata
## Loading required package: RCurl
## Warning: package 'RCurl' was built under R version 4.1.2
## Loading required package: maps
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.1.2
## Loading required package: markdown
##
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics
## This package is designed to support this course. The text book used
## is OpenIntro Statistics, 4th Edition. You can read this by typing
## vignette('os4') or visit www.OpenIntro.org.
##
## The getLabs() function will return a list of the labs available.
##
## The demo(package='DATA606') will list the demos that are available.
##
## Attaching package: 'DATA606'
## The following objects are masked from 'package:openintro':
##
## calc_streak, present, qqnormsim
## The following object is masked from 'package:utils':
##
## demo
Baby weights, Part I. (9.1, p. 350) The Child Health and Development Studies investigate a range of topics. One study considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. Here, we study the relationship between smoking and weight of the baby. The variable smoke is coded 1 if the mother is a smoker, and 0 if not. The summary table below shows the results of a linear regression model for predicting the average birth weight of babies, measured in ounces, based on the smoking status of the mother.
The variability within the smokers and non-smokers are about equal and the distributions are symmetric. With these conditions satisfied, it is reasonable to apply the model. (Note that we don’t need to check linearity since the predictor has only two levels.)
Intercept = 123.05; slope=-8.94
y = mx + b
babyweight = 123.05−8.94(smoke)
The slope indicated the estimated weight of babies born to mothers who smoke vs. the mothers who don’t smoke. For mothers who smoke the predicted birth weight of babies is 114.11 oz while for the non smoker mothers the predicted baby weight is 123.05 oz.
# smoker
smoke_baby_wgt <-123.05-8.94*1
smoke_baby_wgt
## [1] 114.11
# non-smoker
non_smoke_baby_wgt <- 123.05-8.94*0
non_smoke_baby_wgt
## [1] 123.05
Being that the p-value is closer to 0 we can reject the null hypothesis where Ho: B1 is equal to 0 and Ha: B1 is not equal to 0. There is a negative correlation between the babies weight born from smoker or non-smoker mothers.
Absenteeism, Part I. (9.4, p. 352) Researchers interested in the relationship between absenteeism from school and certain demographic characteristics of children collected data from 146 randomly sampled students in rural New South Wales, Australia, in a particular school year. Below are three observations from this data set.
The summary table below shows the results of a linear regression
model for predicting the average number of days absent based on ethnic
background (eth
: 0 -aboriginal, 1 -not aboriginal), sex
(sex
: 0 -female, 1 -male), and learner status
(lrn
: 0 -average learner, 1 -slow learner).
y=18.93−9.11∗eth+3.10∗sex+2.15∗lrn
eth: predicts the absenteeism for non-aboriginal students
sex: predicts the average number of days absent by male students
lrn: predicts the average number of days absent by slow learners.
The residual for the first observation of the given student is negative 22.18.
eth <-0
# male only
sex <-1
lrn <-1
days_missed <-2
prediction <-18.93-9.11*eth+3.1*sex+2.15*lrn
# residuals
first_observ <-days_missed-prediction
first_observ
## [1] -22.18
The r-squared is 0.0893 and the adjusted r-squared is 0.0701.
# variance residual and outcome
residual <-240.57
outcome <-264.17
n <- 146
k <-3
# $R^2$
r_squ <-1-(residual / outcome)
r_squ
## [1] 0.08933641
# adjusted $R^2$
adju_r2 <-1-(residual / outcome)*((n-1)/(n-k-1))
adju_r2
## [1] 0.07009704
Absenteeism, Part II. (9.8, p. 357) Exercise above
considers a model that predicts the number of days absent using three
predictors: ethnic background (eth
), gender
(sex
), and learner status (lrn
). The table
below shows the adjusted R-squared for the model as well as adjusted
R-squared values for all models we evaluate in the first step of the
backwards elimination process.
Which, if any, variable should be removed from the model first?
We could eliminate the no ethnicity
since it has
a negative value.
Challenger disaster, Part I. (9.16, p. 380) On January 28, 1986, a routine launch was anticipated for the Challenger space shuttle. Seventy-three seconds into the flight, disaster happened: the shuttle broke apart, killing all seven crew members on board. An investigation into the cause of the disaster focused on a critical seal called an O-ring, and it is believed that damage to these O-rings during a shuttle launch may be related to the ambient temperature during the launch. The table below summarizes observational data on O-rings for 23 shuttle missions, where the mission order is based on the temperature at the time of the launch. Temp gives the temperature in Fahrenheit, Damaged represents the number of damaged O-rings, and Undamaged represents the number of O-rings that were not damaged.
Each column above represents a different shuttle mission with the data collected being an observation in respect to the temperature and damaged / undamaged O-rings. As the temperature increases the frequency of the damaged O-rings decreases.
With a p-value of 0 there’s meaning to the relationship between the temperatures and damaged O-rings.
log(p/(1-p)) = 11.6630-0.2162xTemperature
Based on the model I do think the concerns about the O-rings are justified. The p-value justifies a strong correlation since it is of a low value.
Challenger disaster, Part II. (9.18, p. 381) Exercise above introduced us to O-rings that were identified as a plausible explanation for the breakup of the Challenger space shuttle 73 seconds into takeoff in 1986. The investigation found that the ambient temperature at the time of the shuttle launch was closely related to the damage of O-rings, which are a critical component of the shuttle. See this earlier exercise if you would like to browse the original data.
where \(\hat{p}\) is the model-estimated probability that an O-ring will become damaged. Use the model to calculate the probability that an O-ring will become damaged at each of the following ambient temperatures: 51, 53, and 55 degrees Fahrenheit. The model-estimated probabilities for several additional ambient temperatures are provided below, where subscripts indicate the temperature:
\[\begin{align*} &\hat{p}_{57} = 0.341 && \hat{p}_{59} = 0.251 && \hat{p}_{61} = 0.179 && \hat{p}_{63} = 0.124 \\ &\hat{p}_{65} = 0.084 && \hat{p}_{67} = 0.056 && \hat{p}_{69} = 0.037 && \hat{p}_{71} = 0.024 \end{align*}\]
The probability of O-ring damage at 51 degrees Fahrenheit is 69.43%, at 53 degrees Fahrenheit is 59.75% and at 55 degrees Fahrenheit is 49.25%.
# temperature 51
p_51 <-exp(11.663-51*.2126) / (1 + exp(11.663-51*.2126))
p_51
## [1] 0.6943212
# temperature 53
p_53 <-exp(11.663-53*.2126) / (1 + exp(11.663-53*.2126))
p_53
## [1] 0.5975339
# temperature 55
p_55 <-exp(11.663-55*.2126) / (1 + exp(11.663-55*.2126))
p_55
## [1] 0.4925006
Plot
temp2 <-c(seq(51, 71, 2))
prob <-exp(11.6630-0.2162*temp2) / (1 + exp(11.6630-0.2162*temp2))
plot(data.frame(temp2, prob), type = "b", pch = 15)
Some of my assumptions are that the observations appear to be independent from each other but we’d have to consider all the variables more than 23 missions.