Assignment 7

7.6 7.6 Husbands and wives, Part I. The Great Britain Office of Population Census and Surveys once collected data on a random sample of 170 married couples in Britain, recording the age (in years) and heights (converted here to inches) of the husbands and wives.16 The scatterplot on the left shows the wife’s age plotted against her husband’s age, and the plot on the right shows wife’s height plotted against husband’s height.

Describe the relationship between husbands’ and wives’ ages.

#Very strong positive correlation

Describe the relationship between husbands’ and wives’ heights.

#weak positive relationship

Which plot shows a stronger correlation? Explain your reasoning.

#The first plot because the data points are clustered together and if we were to draw a line through the data they'd all be very close to the line.

Data on heights were originally collected in centimeters, and then converted to inches. Does this conversion a↵ect the correlation between husbands’ and wives’ heights?

#It does because the there will will be rounding when turning cm to inches.

7.12 7.12 Trees. The scatterplots below show the relationship between height, diameter, and volume of timber in 31 felled black cherry trees. The diameter of the tree is measured 4.5 feet above the

Describe the relationship between volume and height of these trees.

#Moderate positive correlation

Describe the relationship between volume and diameter of these trees.

#Strong Positive correlation

Suppose you have height and diameter measurements for another black cherry tree. Which of these variables would be preferable to use to predict the volume of timber in this tree using a simple linear regression model? Explain your reasoning.

#Using diameter is more preferable because the correlation is stronger and the linear regression model is more likely to be stronger and a closer fit.

7.18 Correlation, Part II. What would be the correlation between the annual salaries of males and females at a company if for a certain type of position men always made (a) $5,000 more than women?

#poisitive linear relation

25% more than women?

#poisitive linear relation

15% less than women?

#poisitive linear relation

7.24 Nutrition at Starbucks, Part I. The scatterplot below shows the relationship between the number of calories and amount of carbohydrates (in grams) Starbucks food menu items contain. Since Starbucks only lists the number of calories on the display items, we are interested in predicting the amount of carbs a menu item has based on its calorie content.

Describe the relationship between number of calories and amount of carbohydrates (in grams) that Starbucks food menu items contain.

#Weak positive linear relationship

In this scenario, what are the explanatory and response variables?

#Response variable: amount of carbohydrates, Explanatory variable: number of calories in food items

Why might we want to fit a regression line to these data?

#To predict the amount of carbs in a Starbucks food item based on calories.

Do these data meet the conditions required for fitting a least squares line?

#Linearity: There is a positive linear trend
#Nearly normal residual : The residual distribution is skewed left.
#Constant variability : The data fits better with lower numbers so there isn't really constant variabilty.
#Independent observation : They should be independent but because they come from one menu so it may not be that independant.
#Because of the heteroscedascity of the data sample, fitting of least squares line isn't the best route.

7.30 The following regression output is for predicting the heart weight (in g) of cats from their body weight (in kg). The coefficients are estimated using a dataset of 144 domestic cats. (a) Write out the linear model.

#Heart Weight = 4.034(Body Weight)-.357

Interpret the intercept.

# when a heart weighs 0 kg the cat's expected body wieght is -.357 g

Interpret the slope.

# for every 1 kg in body weight we expect 4.034 g increase in heart weight

Interpret R2.

#Body weight is able to explain 64.66% of the variance in heart weight

Calculate the correlation coefficient.

R2=.6466
CV <- round(sqrt(R2),3)
print(CV)

## [1] 0.804

7.36 Many people believe that gender, weight, drinking habits, and many other factors are much more important in predicting blood alcohol content (BAC) than simply considering the number of drinks a person consumed. Here we examine data from sixteen student volunteers at Ohio State University who each drank a randomly assigned number of cans of beer. These students were evenly divided between men and women, and they differed in weight and drinking habits. Thirty minutes later, a police officer measured their blood alcohol content (BAC) in grams of alcohol per deciliter of blood. The scatterplot and regression table summarize the findings.

Describe the relationship between the number of cans of beer and BAC.

#The relations is strong positive linear between cans of beer and BAc. For every one can of beer BAC increase by .0180

Write the equation of the regression line. Interpret the slope and intercept in context.

#BAC=0.0180∗beers-.0127

c)Do the data provide strong evidence that drinking more cans of beer is associated with an increase in blood alcohol? State the null and alternative hypotheses, report the p-value, and state your conclusion.

#P value is less than .05 so it statisctically significant. Null Hyp, H0: No significant relation. Ha: There is significant association. or slope doesnt equal 0. Given the low P value we have to reject Null hypothesis and state there is strong evidence that drinking more beers is associated with an increase in blood alcohol.

d)The correlation coefficient for number of cans of beer and BAC is 0.89. Calculate R2 and interpret it in context.

r=0.89
R2 <- round(r^2,3)
print(R2)

## [1] 0.792

Suppose we visit a bar, ask people how many drinks they have had, and also take their BAC. Do you think the relationship between number of drinks and BAC would be as strong as the relationship found in the Ohio State study?

# Yes

Answer: 7.42 Babies. Is the gestational age (time between conception and birth) of a low birth-weight baby useful in predicting head circumference at birth? Twenty-five low birth-weight babies were studied at a Harvard teaching hospital; the investigators calculated the regression of head circumference (measured in centimeters) against gestational age (measured in weeks). The estimated regression line is head circdumference = 3.91 + 0.78 ⇥ gestational age (a) What is the predicted head circumference for a baby whose gestational age is 28 weeks?

#head circumfrence = 3.91 + 0.78*age
HC <- 3.91 + 0.78*(28)
HC

## [1] 25.75

The standard error for the coefficient of gestational age is 0.35, which is associated with df = 23. Does the model provide strong evidence that gestational age is significantly associated with head circumference?

t <- (.78-0)/.35
t

## [1] 2.228571

2* pt(-abs(t), 23)

## [1] 0.03590217

Stong evidence sind p value of .0359 i less than .05

8.2 Baby weights, Part II. Exercise 8.1 introduces a data set on birth weight of babies. Another variable we consider is parity, which is 0 if the child is the first born, and 1 otherwise. The summary table below shows the results of a linear regression model for predicting the average birth weight of babies, measured in ounces, from parity.

Write the equation of the regression line.

#weight = -1.93*parity + 120.07

Interpret the slope in this context, and calculate the predicted birth weight of first borns and others.

#first born = 120.07 oz others = 118.14

Is there a statistically significant relationship between the average birth weight and parity?

#no because p value of .1052 > .05.

8.4 Absenteeism. Researchers interested in the relationship between absenteeism from school and certain demographic characteristics of children collected data from 146 randomly sampled students in rural New SouthWales, Australia, in a particular school year. Below are three observations from this data set.

The summary table below shows the results of a linear regression model for predicting the average number of days absent based on ethnic background (eth: 0 - aboriginal, 1 - not aboriginal), sex (sex: 0 - female, 1 - male), and learner status (lrn: 0 - average learner, 1 - slow learner) (a) Write the equation of the regression line.

#Absenteeism = 18.93 - 9.11 *(ethnic background) + 3.10 *(sex) + 2.15 *(learner status)

Interpret each one of the slopes in this context.

# When not an aboriginal the absenteeism goes down 9.11 days. Males miss 3.10 more days than females. Slow learners miss 2.15 days more than average learners.

Calculate the residual for the first observation in the data set: a student who is aboriginal, male, a slow learner, and missed 2 days of school.

Absent <- 18.93 - 9.11 * (0) + 3.10 * (1) + 2.15 * (1)
Residual <- 2 - Absent
Residual

## [1] -22.18

The variance of the residuals is 240.57, and the variance of the number of absent days for all students in the data set is 264.17. Calculate the R2 and the adjusted R2. Note that there are 146 observations in the data set.

R2.Ab <- 1 - (240.57)/(264.17)
R2.Ab.adj <- 1 - (240.57/264.17)*((146-1)/(146-3-1))
paste("R^2", round(R2.Ab, 4))

## [1] "R^2 0.0893"

paste("R^2 Adj.: ", round(R2.Ab.adj, 4))

## [1] "R^2 Adj.:  0.0701"

8.6 Cherry trees. Timber yield is approximately equal to the volume of a tree, however, this value is difficult to measure without first cutting the tree down. Instead, other variables, such as height and diameter, may be used to predict a tree’s volume and yield. Researchers wanting to understand the relationship between these variables for black cherry trees collected data from 31 such trees in the Allegheny National Forest, Pennsylvania. Height is measured in feet, diameter in inches (at 54 inches above ground), and volume in cubic feet. a) Calculate a 95% confidence interval for the coefficient of height, and interpret it in the context of the data.

t <- 1.96
se <- .13
b1 <- .34
CIlower <- (b1 - t * se)
CIupper <- (b1 + t * se)
print (CIlower)

## [1] 0.0852

print (CIupper)

## [1] 0.5948

One tree in this sample is 79 feet tall, has a diameter of 11.3 inches, and is 24.2 cubic feet in volume. Determine if the model overestimates or underestimates the volume of this tree, and by how much.

#vol= −57.99+0.34∗ht+4.71∗dia
#vol= −57.99+0.34∗79+4.71∗11.3
#vol = 22.093
#Actual volume is 24.2 so the model underestimates the volume

8.8 Absenteeism, Part II Exercise 8.4 considers a model that predicts the number of days absent using three predictors: ethnic background (eth), gender (sex), and learner status (lrn). The table below shows the adjusted R-squared for the model as well as adjusted R-squared values for all models we evaluate in the first step of the backwards elimination process. Which, if any, variable should be removed from the model first?

#No ethnicity status because it has the lowest R2 adj value

8.10 Exercise 8.4 provides regression output for the full model, including all explanatory variables available in the data set, for predicting the number of days absent from school. In this exercise we consider a forward-selection algorithm and add variables to the model one-at-a-time. The table below shows the p-value and adjusted R2 of each model where we include only the corresponding predictor. Based on this table, which variable should be added to the model first?

#Ethnicity should be added because it has the highest R2 adj value and lowest p value meaning it is most significant.

8.12 Movie lovers, Part II Suppose an online media streaming company is interested in building a movie recommendation system. The website maintains data on the movies in their database (genre, length, cast, director, budget, etc.) and additionally collects data from their subscribers (demographic information, previously watched movies, how they rated previously watched movies, etc.). The recommendation system will be deemed successful if subscribers actually watch, and rate highly, the movies recommended to them. Should the company use the adjusted R2 or the p-value approach in selecting variables for their recommendation system?

# PValue is better because it will tell you if the relationship is statistically significant. Also I would use R2 adjusted to account for the numerous variables.

8.14 GPA and IQ A regression model for predicting GPA from gender and IQ was fit, and both predictors were found to be statistically significant. Using the plots given below, determine if this regression model is appropriate for these data.

# With the exception of the initial values the data points fit very well on the regression line. Also the lack of correlation on the residuals and fitted values shows there is indepence is the data.
#There is good evidence to support using regression model for this data.

8.18 Challenger disaster, Part II. Exercise 8.16 introduced us to O-rings that were identified as a plausible explanation for the breakup of the Challenger space shuttle 73 seconds into takeoff in 1986. The investigation found that the ambient temperature at the time of the shuttle launch was closely related to the damage of O-rings, which are a critical component of the shuttle. See this earlier exercise if you would like to browse the original data. (a) The data provided in the previous exercise are shown in the plot. The logistic model fit to these data may be written as log(p/(1−p)=11.6630−0.2162×Temperature where ˆp is the model-estimated probability that an O-ring will become damaged. Use the model to calculate the probability that an O-ring will become damaged at each of the following ambient temperatures: 51, 53, and 55 degrees Fahrenheit. The model-estimated probabilities for several additional ambient temperatures are provided below, where subscripts indicate the temperature:

x = -0.2162
int = 11.6630
temp = 51
yi = int + temp * x
pi = (exp(yi)) / (1 + exp(yi))

paste("PI at 51:", round(pi*100, 3), "%")

## [1] "PI at 51: 65.403 %"

x = -0.2162
int = 11.6630
temp = 53
yi = int + temp * x
pi = (exp(yi)) / (1 + exp(yi))

paste("PI at 53:", round(pi*100, 3), "%")

## [1] "PI at 53: 55.092 %"

x = -0.2162
int = 11.6630
temp = 55
yi = int + temp * x
pi = (exp(yi)) / (1 + exp(yi))

paste("PI at 55:", round(pi*100, 3), "%")

## [1] "PI at 55: 44.325 %"

Add the model-estimated probabilities from part (a) on the plot, then connect these dots using a smooth curve to represent the model-estimated probabilities.

b1 <- -0.2162
int <- 11.6630
Temp <- seq(51, 71, 2)
PFail <- as.numeric()

for (i in Temp) {
  yi = int + i * b1
  pi = (exp(yi)) / (1 + exp(yi)) 
  PFail <- cbind(PFail, pi)
}

plot(Temp, PFail, type = "o",)

Describe any concerns you may have regarding applying logistic regression in this application, and note any assumptions that are required to accept the model’s validity

# The two conditions are met Each predictor xi must be linearly related to logit(pi) if all other predictors are held constant, and that each outcome Yi is independent of the other outcomes. The first can't be verified well with the limited data set but the second is because each launch from the data set should independant of each other.

Assignment 7

James Lunga

3/15/2021