7.6 7.6 Husbands and wives, Part I. The Great Britain Office of Population Census and Surveys once collected data on a random sample of 170 married couples in Britain, recording the age (in years) and heights (converted here to inches) of the husbands and wives.16 The scatterplot on the left shows the wife’s age plotted against her husband’s age, and the plot on the right shows wife’s height plotted against husband’s height.
#Very strong positive correlation
#weak positive relationship
#The first plot because the data points are clustered together and if we were to draw a line through the data they'd all be very close to the line.
#It does because the there will will be rounding when turning cm to inches.
7.12 7.12 Trees. The scatterplots below show the relationship between height, diameter, and volume of timber in 31 felled black cherry trees. The diameter of the tree is measured 4.5 feet above the
#Moderate positive correlation
#Strong Positive correlation
#Using diameter is more preferable because the correlation is stronger and the linear regression model is more likely to be stronger and a closer fit.
7.18 Correlation, Part II. What would be the correlation between the annual salaries of males and females at a company if for a certain type of position men always made (a) $5,000 more than women?
#poisitive linear relation
#poisitive linear relation
#poisitive linear relation
7.24 Nutrition at Starbucks, Part I. The scatterplot below shows the relationship between the number of calories and amount of carbohydrates (in grams) Starbucks food menu items contain. Since Starbucks only lists the number of calories on the display items, we are interested in predicting the amount of carbs a menu item has based on its calorie content.
#Weak positive linear relationship
#Response variable: amount of carbohydrates, Explanatory variable: number of calories in food items
#To predict the amount of carbs in a Starbucks food item based on calories.
#Linearity: There is a positive linear trend
#Nearly normal residual : The residual distribution is skewed left.
#Constant variability : The data fits better with lower numbers so there isn't really constant variabilty.
#Independent observation : They should be independent but because they come from one menu so it may not be that independant.
#Because of the heteroscedascity of the data sample, fitting of least squares line isn't the best route.
7.30 The following regression output is for predicting the heart weight (in g) of cats from their body weight (in kg). The coefficients are estimated using a dataset of 144 domestic cats. (a) Write out the linear model.
#Heart Weight = 4.034(Body Weight)-.357
# when a heart weighs 0 kg the cat's expected body wieght is -.357 g
# for every 1 kg in body weight we expect 4.034 g increase in heart weight
#Body weight is able to explain 64.66% of the variance in heart weight
R2=.6466
CV <- round(sqrt(R2),3)
print(CV)
## [1] 0.804
7.36 Many people believe that gender, weight, drinking habits, and many other factors are much more important in predicting blood alcohol content (BAC) than simply considering the number of drinks a person consumed. Here we examine data from sixteen student volunteers at Ohio State University who each drank a randomly assigned number of cans of beer. These students were evenly divided between men and women, and they differed in weight and drinking habits. Thirty minutes later, a police officer measured their blood alcohol content (BAC) in grams of alcohol per deciliter of blood. The scatterplot and regression table summarize the findings.
#The relations is strong positive linear between cans of beer and BAc. For every one can of beer BAC increase by .0180
#BAC=0.0180∗beers-.0127
c)Do the data provide strong evidence that drinking more cans of beer is associated with an increase in blood alcohol? State the null and alternative hypotheses, report the p-value, and state your conclusion.
#P value is less than .05 so it statisctically significant. Null Hyp, H0: No significant relation. Ha: There is significant association. or slope doesnt equal 0. Given the low P value we have to reject Null hypothesis and state there is strong evidence that drinking more beers is associated with an increase in blood alcohol.
d)The correlation coefficient for number of cans of beer and BAC is 0.89. Calculate R2 and interpret it in context.
r=0.89
R2 <- round(r^2,3)
print(R2)
## [1] 0.792
# Yes
Answer: 7.42 Babies. Is the gestational age (time between conception and birth) of a low birth-weight baby useful in predicting head circumference at birth? Twenty-five low birth-weight babies were studied at a Harvard teaching hospital; the investigators calculated the regression of head circumference (measured in centimeters) against gestational age (measured in weeks). The estimated regression line is head circdumference = 3.91 + 0.78 ⇥ gestational age (a) What is the predicted head circumference for a baby whose gestational age is 28 weeks?
#head circumfrence = 3.91 + 0.78*age
HC <- 3.91 + 0.78*(28)
HC
## [1] 25.75
t <- (.78-0)/.35
t
## [1] 2.228571
2* pt(-abs(t), 23)
## [1] 0.03590217
Stong evidence sind p value of .0359 i less than .05
8.2 Baby weights, Part II. Exercise 8.1 introduces a data set on birth weight of babies. Another variable we consider is parity, which is 0 if the child is the first born, and 1 otherwise. The summary table below shows the results of a linear regression model for predicting the average birth weight of babies, measured in ounces, from parity.
#weight = -1.93*parity + 120.07
#first born = 120.07 oz others = 118.14
#no because p value of .1052 > .05.
8.4 Absenteeism. Researchers interested in the relationship between absenteeism from school and certain demographic characteristics of children collected data from 146 randomly sampled students in rural New SouthWales, Australia, in a particular school year. Below are three observations from this data set.
The summary table below shows the results of a linear regression model for predicting the average number of days absent based on ethnic background (eth: 0 - aboriginal, 1 - not aboriginal), sex (sex: 0 - female, 1 - male), and learner status (lrn: 0 - average learner, 1 - slow learner) (a) Write the equation of the regression line.
#Absenteeism = 18.93 - 9.11 *(ethnic background) + 3.10 *(sex) + 2.15 *(learner status)
# When not an aboriginal the absenteeism goes down 9.11 days. Males miss 3.10 more days than females. Slow learners miss 2.15 days more than average learners.
Absent <- 18.93 - 9.11 * (0) + 3.10 * (1) + 2.15 * (1)
Residual <- 2 - Absent
Residual
## [1] -22.18
R2.Ab <- 1 - (240.57)/(264.17)
R2.Ab.adj <- 1 - (240.57/264.17)*((146-1)/(146-3-1))
paste("R^2", round(R2.Ab, 4))
## [1] "R^2 0.0893"
paste("R^2 Adj.: ", round(R2.Ab.adj, 4))
## [1] "R^2 Adj.: 0.0701"
8.6 Cherry trees. Timber yield is approximately equal to the volume of a tree, however, this value is difficult to measure without first cutting the tree down. Instead, other variables, such as height and diameter, may be used to predict a tree’s volume and yield. Researchers wanting to understand the relationship between these variables for black cherry trees collected data from 31 such trees in the Allegheny National Forest, Pennsylvania. Height is measured in feet, diameter in inches (at 54 inches above ground), and volume in cubic feet. a) Calculate a 95% confidence interval for the coefficient of height, and interpret it in the context of the data.
t <- 1.96
se <- .13
b1 <- .34
CIlower <- (b1 - t * se)
CIupper <- (b1 + t * se)
print (CIlower)
## [1] 0.0852
print (CIupper)
## [1] 0.5948
#vol= −57.99+0.34∗ht+4.71∗dia
#vol= −57.99+0.34∗79+4.71∗11.3
#vol = 22.093
#Actual volume is 24.2 so the model underestimates the volume
8.8 Absenteeism, Part II Exercise 8.4 considers a model that predicts the number of days absent using three predictors: ethnic background (eth), gender (sex), and learner status (lrn). The table below shows the adjusted R-squared for the model as well as adjusted R-squared values for all models we evaluate in the first step of the backwards elimination process. Which, if any, variable should be removed from the model first?
#No ethnicity status because it has the lowest R2 adj value
8.10 Exercise 8.4 provides regression output for the full model, including all explanatory variables available in the data set, for predicting the number of days absent from school. In this exercise we consider a forward-selection algorithm and add variables to the model one-at-a-time. The table below shows the p-value and adjusted R2 of each model where we include only the corresponding predictor. Based on this table, which variable should be added to the model first?
#Ethnicity should be added because it has the highest R2 adj value and lowest p value meaning it is most significant.
8.12 Movie lovers, Part II Suppose an online media streaming company is interested in building a movie recommendation system. The website maintains data on the movies in their database (genre, length, cast, director, budget, etc.) and additionally collects data from their subscribers (demographic information, previously watched movies, how they rated previously watched movies, etc.). The recommendation system will be deemed successful if subscribers actually watch, and rate highly, the movies recommended to them. Should the company use the adjusted R2 or the p-value approach in selecting variables for their recommendation system?
# PValue is better because it will tell you if the relationship is statistically significant. Also I would use R2 adjusted to account for the numerous variables.
8.14 GPA and IQ A regression model for predicting GPA from gender and IQ was fit, and both predictors were found to be statistically significant. Using the plots given below, determine if this regression model is appropriate for these data.
# With the exception of the initial values the data points fit very well on the regression line. Also the lack of correlation on the residuals and fitted values shows there is indepence is the data.
#There is good evidence to support using regression model for this data.
8.18 Challenger disaster, Part II. Exercise 8.16 introduced us to O-rings that were identified as a plausible explanation for the breakup of the Challenger space shuttle 73 seconds into takeoff in 1986. The investigation found that the ambient temperature at the time of the shuttle launch was closely related to the damage of O-rings, which are a critical component of the shuttle. See this earlier exercise if you would like to browse the original data. (a) The data provided in the previous exercise are shown in the plot. The logistic model fit to these data may be written as log(p/(1−p)=11.6630−0.2162×Temperature where ˆp is the model-estimated probability that an O-ring will become damaged. Use the model to calculate the probability that an O-ring will become damaged at each of the following ambient temperatures: 51, 53, and 55 degrees Fahrenheit. The model-estimated probabilities for several additional ambient temperatures are provided below, where subscripts indicate the temperature:
x = -0.2162
int = 11.6630
temp = 51
yi = int + temp * x
pi = (exp(yi)) / (1 + exp(yi))
paste("PI at 51:", round(pi*100, 3), "%")
## [1] "PI at 51: 65.403 %"
x = -0.2162
int = 11.6630
temp = 53
yi = int + temp * x
pi = (exp(yi)) / (1 + exp(yi))
paste("PI at 53:", round(pi*100, 3), "%")
## [1] "PI at 53: 55.092 %"
x = -0.2162
int = 11.6630
temp = 55
yi = int + temp * x
pi = (exp(yi)) / (1 + exp(yi))
paste("PI at 55:", round(pi*100, 3), "%")
## [1] "PI at 55: 44.325 %"
b1 <- -0.2162
int <- 11.6630
Temp <- seq(51, 71, 2)
PFail <- as.numeric()
for (i in Temp) {
yi = int + i * b1
pi = (exp(yi)) / (1 + exp(yi))
PFail <- cbind(PFail, pi)
}
plot(Temp, PFail, type = "o",)
# The two conditions are met Each predictor xi must be linearly related to logit(pi) if all other predictors are held constant, and that each outcome Yi is independent of the other outcomes. The first can't be verified well with the limited data set but the second is because each launch from the data set should independant of each other.