7.6 Husbands and wives, Part I.
Strong positive linear correlation.
Looks like it could be positive correlation but perhaps non-linear and not correlated at all.
The age plot because the points are more closely fit (like a line) and the slope shows a stronger positive correlation.
No it should not, as the measurements are the same, just in a different metric.
7.12 Trees. The scatterplots below show the relationship between height, diameter, and volume of timber in 31 felled black cherry trees. The diameter of the tree is measured 4.5 feet above the ground.
Looks like positive correlation but perhaps non-linear.
Positive linear correlation.
I would rather the diameter measurements, because the graphs seem to indicate that it has the stronger positive correlation, and aid us better in predicting the volume.
7.18 Correlation, Part II. What would be the correlation between the annual salaries of males and females at a company if for a certain type of position men always made
salaryMen=salaryWomen+5000. This would end up being a positive linear relationship as the salary of men increases by 5000 at each point.
salaryMen= salaryWomen+0.25salarywomen Another positive linear relationship.
salaryMen=salaryWomen-0.15salarywomen Negative linear relationship.
7.24 Nutrition at Starbucks, Part I. The scatterplot below shows the relationship between the number of calories and amount of carbohydrates (in grams) Starbucks food menu items contain. Since Starbucks only lists the number of calories on the display items, we are interested in predicting the amount of carbs a menu item has based on its calorie content.
Positive linear correlation.
The response variable would be the amount of carbs a menu item has in grams (Carbs) which is affected by the explanatory variable of calorie content (Calories).
Because we are trying to predict the amount of carbs based on the calorie count.
The data appears to fit a linear trend based on the scatterplot. Based on the plot and histogram, the residuals appear to have a normal distribution. Based on the residual plot we do not see constant variability.
7.30 Cats, Part I. The following regression output is for predicting the heart weight (in g) of cats from their body weight (in kg).
heartweight= -0.357+4.034bodywt
The intercept is -0.357 which means that if the body weight of a cat was 0 kg, the heart weight would be -0.357 kg. Which in reality does not make sense, but a cat would never actually weigh 0 pounds so this measurement is just a point in which the calculations are made off of.
The slope is 4.034, which indicates that if the body weight of a cat were to increase by 1 kg, than the heart weight of a cat would increase by 4.034 kg.
R-squared is 64.66% which means that 64.66% of cats heart weight is explained by cats body weight.
sqrt(0.6466)
## [1] 0.8041144
7.36 Beer and blood alcohol content. Many people believe that gender, weight, drinking habits, and many other factors are much more important in predicting blood alcohol content (BAC) than simply considering the number of drinks a person consumed. Here we examine data from sixteen student volunteers at Ohio State University who each drank a randomly assigned number of cans of beer. These students were evenly divided between men and women, and they di↵ered in weight and drinking habits. Thirty minutes later, a police officer measured their blood alcohol content (BAC) in grams of alcohol per deciliter of blood.23 The scatterplot and regression table summarize the findings.
Positive linear correlation.
BAC= -0.0127+0.0180beers
The slope is 0.0180 which indicates that for every 1 beer, the BAC will increase by 0.0180. The intercept is -0.0127, which indicates that if the amount of beers equals 0, then we would expect the BAC to be -0.0127.
Ho: slope=0 (No association between beers and BAC) Ha: Slope does not equal zero. (Association between beers and BAC).
The p-value is equal to zero so we reject the null hypothesis and can state that there is a statistically significant association between drinking beers and BAC.
0.89^2
## [1] 0.7921
The R-squared is 0.7921 which indicates that beers drank describes 79.21% of the variable BAC.
Yes I think it would because a bar is a place where alcohol consumption would be higher than normal, so the relationship between beers and BAC would probably hold up.
7.42 Babies. Is the gestational age (time between conception and birth) of a low birth-weight baby useful in predicting head circumference at birth? Twenty-five low birth-weight babies were studied at a Harvard teaching hospital; the investigators calculated the regression of head circumference (measured in centimeters) against gestational age (measured in weeks). The estimated regression line is head_circumference= 3.91 + 0.78*gestational_age
hc <-3.91+0.78*28
hc
## [1] 25.75
The predicted head circumference is 25.75 cm.
Ho: slope=0, No association Ha: slope not =0, Association
qt(0.95, 23)
## [1] 1.713872
tscore <- (0.78-0)/0.35
tscore
## [1] 2.228571
Based on the t-score being greater than the critical t-value, we can reject the null hypothesis and conclude that gestational age is significantly associated with head circumference.
8.2 Baby weights, Part II. Exercise 8.1 introduces a data set on birth weight of babies. Another variable we consider is parity, which is 0 if the child is the first born, and 1 otherwise. The summary table below shows the results of a linear regression model for predicting the average birth weight of babies, measured in ounces, from parity. Estimate Std. Error t value Pr(>|t|) (Intercept) 120.07 0.60 199.94 0.0000 parity -1.93 1.19 -1.62 0.1052
birthweight= 120.07-1.93parity
The slope is -1.93 which indicates that if parity=1 meaning the child is not a first born, then the expected birth weight decreases by 1.93 ounces.
##First borns
bw1 <- 120.07-1.93*0
bw1
## [1] 120.07
##Others
bw2 <- 120.07-1.93*1
bw2
## [1] 118.14
We expect a first born to be 120.07 ounces and the others to be 118.14 ounces.
Ho: slope=0, No association Ha: slope not =0, Association. The p-value is equal to 0.1052, so we cannot reject the null hypothesis that there is no association between average birth weight and parity.
8.4 Absenteeism, Part I. Researchers interested in the relationship between absenteeism from school and certain demographic characteristics of children collected data from 146 randomly sampled students in rural New South Wales, Australia, in a particular school year. Below are three observations from this data set. eth sex lrn days 10 1 1 2 2 0 1 1 11 146 1 0 0 37 The summary table below shows the results of a linear regression model for predicting the average number of days absent based on ethnic background (eth: 0 - aboriginal, 1 - not aboriginal), sex (sex: 0 - female, 1 - male), and learner status (lrn: 0 - average learner, 1 - slow learner).18 Estimate Std. Error t value Pr(>|t|) (Intercept) 18.93 2.57 7.37 0.0000 eth -9.11 2.60 -3.51 0.0000 sex 3.10 2.64 1.18 0.2411 lrn 2.15 2.65 0.81 0.4177
Daysabsent= 18.93-9.11eth+3.10sex+2.15lrn
The slope for ethnic background is -9.11, which indicates that if eth=1 (not aboriginal), then the expected average days absent would decrease by 9.11 days. The slope for sex is 3.1o which indicates that if sex=1 (male) then the expected average days absent increases by 3.10 days. The slope for learner status is 2.15 which indicates that if lrn=1 (slow learner) expected average days absent increases by 2.15 days. (All else being constant for each one)
#eth=0, sex=1, lrn=1
ada <- 2 #actual days absent
eda <- 18.93-9.13*0+3.10*1+2.15*1 #expected days absent
residual <- ada-eda
residual
## [1] -22.18
n <- 146
k <- 3
var1 <- 240.57
var2 <- 264.17
rsquared <- 1-(var1/var2)
rsquared
## [1] 0.08933641
adjusted <- 1-rsquared*((n-1)/(n-k-1))
adjusted
## [1] 0.9087762
8.6 Cherry trees. Timber yield is approximately equal to the volume of a tree, however, this value is difficult to measure without first cutting the tree down. Instead, other variables, such as height and diameter, may be used to predict a tree’s volume and yield. Researchers wanting to understand the relationship between these variables for black cherry trees collected data from 31 such trees in the Allegheny National Forest, Pennsylvania. Height is measured in feet, diameter in inches (at 54 inches above ground), and volume in cubic feet.19 Estimate Std. Error t value Pr(>|t|) (Intercept) -57.99 8.64 -6.71 0.00 height 0.34 0.13 2.61 0.01 diameter 4.71 0.26 17.82 0.00
n <-31
df <- n-2
t <-qt(0.975, df)
se <- 0.13
slope <-0.34
lower <- slope-(t*se)
lower
## [1] 0.07412015
upper <-slope+(t*se)
upper
## [1] 0.6058799
We are 95% confident that the coefficient of height will be between about 0.0741 and 0.6059. This means that all else being constant, if the height of a tree grows 1 foot, we would expect the volume of the tree to increase somewhere between 0.0741 and 0.6059 cubic feet.
##height=79, diameter=11.3, volume=24.2
av <- 24.2 #actual volume
av
## [1] 24.2
ev <- -57.99+0.34*79+4.71*11.3 #expected volume
ev
## [1] 22.093
av-ev
## [1] 2.107
The model underestimates the volume of this tree by 2.107 cubic feet.
8.8 Absenteeism, Part II. Exercise 8.4 considers a model that predicts the number of days absent using three predictors: ethnic background (eth), gender (sex), and learner status (lrn). The table below shows the adjusted R-squared for the model as well as adjusted R-squared values for all models we evaluate in the first step of the backwards elimination process. Model Adjusted R2 1 Full model 0.0701 2 No ethnicity -0.0033 3 No sex 0.0676 4 No learner status 0.0723 Which, if any, variable should be removed from the model first?
When we remove the variable for learner status (lrn), our R-squared indicates we describe expected average days absent better so we should remove that variable first.
8.10 Absenteeism, Part III. Exercise 8.4 provides regression output for the full model, including all explanatory variables available in the data set, for predicting the number of days absent from school. In this exercise we consider a forward-selection algorithm and add variables to the model one-at-a-time. The table below shows the p-value and adjusted R2 of each model where we include only the corresponding predictor. Based on this table, which variable should be added to the model first? variable ethnicity sex learner status p-value 0.0007 0.3142 0.5870 R2 adj 0.0714 0.0001 0
Based on this table, we should add the ethnicity variable first as the p-value shows a statistically significant association with absent days and adjusted R-squared is better than the other two.
8.12 Movie lovers, Part II. Suppose an online media streaming company is interested in building a movie recommendation system. The website maintains data on the movies in their database (genre, length, cast, director, budget, etc.) and additionally collects data from their subscribers (demographic information, previously watched movies, how they rated previously watched movies, etc.). The recommendation system will be deemed successful if subscribers actually watch, and rate highly, the movies recommended to them. Should the company use the adjusted R2 or the p-value approach in selecting variables for their recommendation system?
The p-value approach because it will show if each variable is statistically significant correlated to the response variable in the model.
8.14 GPA and IQ. A regression model for predicting GPA from gender and IQ was fit, and both predictors were found to be statistically significant. Using the plots given below, determine if this regression model is appropriate for these data.
Based on the plots it seems that most of the observations lie on the normal quantile plot except for a few outliers. The residuals and fitted values are not correlated showing independence. So overall it seems as though the regression model is approriate for these data.
8.18 Challenger disaster, Part II. Exercise 8.16 introduced us to O-rings that were identified as a plausible explanation for the breakup of the Challenger space shuttle 73 seconds into takeo↵ in 1986. The investigation found that the ambient temperature at the time of the shuttle launch was closely related to the damage of O-rings, which are a critical component of the shuttle. See this earlier exercise if you would like to browse the original data.
damage51 <- 11.6630-0.2162*51
pdamaged51 <- exp(damage51)/(1+exp(damage51))
pdamaged51
## [1] 0.6540297
damage53 <- 11.6630-0.2162*53
pdamaged53 <-exp(damage53)/(1+exp(damage53))
pdamaged53
## [1] 0.5509228
damage55 <- 11.6630-0.2162*55
pdamaged55 <-exp(damage55)/(1+exp(damage55))
pdamaged55
## [1] 0.4432456
damage51 <- 11.6630-0.2162*51
pdamaged51 <- exp(damage51)/(1+exp(damage51))
damage53 <- 11.6630-0.2162*53
pdamaged53 <-exp(damage53)/(1+exp(damage53))
damage55 <- 11.6630-0.2162*55
pdamaged55 <-exp(damage55)/(1+exp(damage55))
model <- c(pdamaged51, pdamaged53, pdamaged55, 0.341, 0.251, 0.179, 0.124, 0.084, 0.056, 0.037, 0.024)
tempmod <- c(51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71)
temperature <- c(53,57,58,63,66,67,67,67,68,69,70,70,70,70,72,73,75,75,76,76,78,79,81)
damaged <- c(5,1,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0)
undamaged <- c(1,5,5,5,6,6,6,6,6,6,5,6,5,6,6,6,6,5,6,6,6,6,6)
probdamage <- damaged/(damaged+undamaged)
library(ggplot2)
mydata <- data.frame(temperature, damaged, undamaged, probdamage)
ggplot(mydata,aes(x=temperature, y=probdamage)) +geom_point() +
stat_smooth(method = 'glm', family = 'binomial')
## Warning: Ignoring unknown parameters: family
## `geom_smooth()` using formula 'y ~ x'
plot(tempmod, model, type="o", col="green")
Each predictor (xi) must be linearly related to the logit(pi) when all other predictors are held constant, which in this instance is hard to tell. In addition, independence of each data point seems to hold up in this case because the shuttle launches should occur independently of the other ones.