if (Sys.info()["sysname"] == "Windows") {
setwd("~/Masters/DATA606/Week7/Homework")
} else {
setwd("~/Documents/Masters/DATA606/Week7/Homework")
}
require(ggplot2)
## Loading required package: ggplot2
Answer:
The explanatory variable is the calorie count and the response variable is the amount of carbohydrates
Answer:
We can predict the number of carbohydrates in an item based on the number of calories it is indicated to have. If we were on a diet and wanted to limit not just calorie intake, but also carbohydrate intake, we could use this prediction to make a resonable estimate on the number of carbohydrates in a given item.
Answer:
The variability of the data around the line increases with larger values of calories, which indicates that a linear regression model insufficient to apply to the data.
Answer:
First, calculate the slope:
sx <- 10.37
sy <- 9.41
r <- 0.67
(b1 <- 0.67 * sy/sx)
## [1] 0.6079749
Next, the least squares regression line passes through \(\overline { y } =\quad { b }_{ O }\quad +\quad { b }_{ 1 }\quad *\quad \overline { x }\). Therefore, \({ b }_{ O }=\quad \overline { y } \quad -\quad { b }_{ 1 }\quad *\quad \overline { x }\)
y_bar <- 171.14
x_bar <- 107.2
(b0 <- y_bar - b1 * x_bar)
## [1] 105.9651
Therefore, the linear regression line is represented with this equation:
\[height\quad =\quad 105.97\quad +\quad 0.67\quad *\quad shoulder\quad girth\]
Answer:
The slope (b1): For each additional centimeter in shoulder girth, the model predicts an additional 0.67 cm in height.
The intercept (b0): When the shoulder girth measured is zero, the expected height is 105.97. However, it is not expected than an actual measurement of zero would be correct.
Answer:
The r-squared value is the square of correlation, so r-squared is:
0.67^2
## [1] 0.4489
About 44.9% of the variability in height is accounted for by the model, i.e. explained by the shoulder girth.
Answer:
The expected height is:
b0 + b1 * 100
## [1] 166.7626
Answer:
A residual is the expected value subtracted from the observed, which in this case is:
160 - (b0 + b1 * 100)
## [1] -6.762581
A negative value means that the model overestimates the height of this person.
Answer:
No, the data appear to be for adults since the mean shoulder girth is 107.2 and the standard deviation is 10.37. A measurement of 56 cm is 5 standard deviations from the mean, which appears to be outside of the data set and would be considered an extrapolation.
Answer:
\[heart\quad weight\quad (g)\quad =\quad -0.357\quad +\quad 4.034\quad *\quad body\quad weight\quad (kg)\]
Answer:
The expected heart weight for a body weight of zero kilograms is -0.357 grams.
Answer:
For each additional increase of one kilogram in body weight, we expect an increase in heart weight of 4.034 grams.
Answer:
Body weight (kg) explains 64.66% of variability of in the heart weight (g)
Answer:
The correlation coefficient is the square root of the r-squared value; therefore, the correlation coefficient is:
sqrt(0.6466)
## [1] 0.8041144
Answer:
Using the information given:
\[\overline { y } =\quad { b }_{ O }\quad +\quad { b }_{ 1 }\quad *\quad \overline { x }\]
\[\frac { \overline { y } -\quad { b }_{ O } }{ \overline { x } } \quad =\quad { b }_{ 1 }\]
y_bar <- 3.9983
x_bar <- -0.0883
b0 <- 4.01
((y_bar - b0)/x_bar)
## [1] 0.1325028
Using the table provided:
t_value <- 4.13
se <- 0.0322
t_value * se
## [1] 0.132986
Answer:
Yes, the probability of the slope shows that the p-value is approximately zero, which means that we would expect to see 0% of seeing another point estimate as extreme as the one we saw, given that the null hypothesis (slope = 0 ) is true. Also, we can calculate the 95% confidence interval for the slope which does not contain negative numbers (or zero) in the interval, confirming there is convincing evidence that the slope is positive:
t_95 <- qt(p = 0.975, df = (463 - 1))
slope <- t_value * se
slope - t_95 * se
## [1] 0.06970939
slope + t_95 * se
## [1] 0.1962626
Answer:
When fitting a least squares line, we generally require:
Linearity: Given the p-values indicated, and that the graph of the actual data points has not been provided, we can assume the data appear to be linear.
Nearly normal residuals: The data appear to have a left skew based on the histogram and normal probability plot, which is a cause for concern when fitting the data to a linear model.
Constant variability: The residual plots appear to show that the data has the same amount of variability regardless of the x value.
Independent observations: It appears there is a time-series plot provided as one of the residual graphs. There does not appear to be a trend in this data.
The data provided appear to meet all the conditions except the requirement for nearly normal residuals; therefore, we should investigate this issue or procede with caution.