sigma_time = 113
mean_time = 129
sigma_dist = 99
mean_dist = 108
correl = .636
(beta = correl * sigma_time / sigma_dist)
## [1] 0.7259394
The linear model is
\[T - \mu_{time} = \beta (D - \mu_{dist})\] Rearranging the equation, we see the constant term is the intercept:
(intercept = -mean_dist * beta + mean_time)
## [1] 50.59855
Thus, we conclude the equation of the regression line is:
\[T = .7259D + 50.59855\]
The slope means that for 1 mile of additional distance, the travel time increases by .7259 minutes. The intercept is the required travel time if the train goes zero distance. since this combination is implausible, we should ignore physical interpretation of the intercept time.
\(R^2\) is the square of the correlation. It is 40.45%. We intepret \(R^2\) to mean that 40.45% of the variation in the travel time is explained by the regression line.
(r_squared = correl^2)
## [1] 0.404496
\[ T = .7259 ( 103) + 50.59855\]
( time = .7259 * 103 + 50.59855)
## [1] 125.3663
The estimated travel time from Santa Barbara to Los Angeles is 125.36 minutes based on the regression lines.
(residual = 168 - time)
## [1] 42.63375
The model underestimates the travel time by 42.6 minutes.
There is a positive relationship between the number of calories and the amount of carbohydrates in Starbucks food items. As the number of calories increases, the amount of carbohydrates tends to increase.
The explanatory variable is the number of calories. The response variable is the amount of carbohydrates.
The regression line gives us the pattern of calories to carbohydrates. It is useful for nutritional menu planning.
The conditions for fitting a regression line are linearity (which appears to be true), near normality (also true based on histogram), constant variance (not satisfied as shown by greater variation for higher calories), and independent ( which appears to be plausible).
Let the correlation \(c = .67\) and the standard deviation of girth be \(\sigma_G = 10.37\) and standard deviation of height be \(\sigma_H=9.41\).
\[ b = \frac{\sigma_H}{\sigma_G}c\]
The equation of the line satisfies:
\[ b( G - \bar{G}) = H - \bar{H}\] \[bG - b\bar{G} + \bar{H} = H\]
Solving for these coefficients gives:
sigma_h = 9.41
sigma_g = 10.37
mean_h = 171.14
mean_g = 107.2
c = .67
(b = (sigma_h / sigma_g) * c)
## [1] 0.6079749
(intercept = -b *mean_g + mean_h )
## [1] 105.9651
We conclude that the regression line equation is:
\[ H = .6079749 G + 105.9651 \]
the slope represents the incremental increase in height associated with 1cm increase in girth is .607 cm. The intercept is the height of the person when the girth is zero. Since people don’t have zero girth, \(G=0\) is out of the range of sensible values.
The \(R^2\) is equal to the square of the correlation. \(R^2=.4489\). Thus, 44.9% of the variation in height is explained by the least square lines.
(r_squared = c^2 )
## [1] 0.4489
( predicted_height = .6079749 * 100 + 105.9651 )
## [1] 166.7626
(residual = 160 - predicted_height )
## [1] -6.76259
( baby_height = .6079749 * 56 + 105.9651 )
## [1] 140.0117
The linear model is \(y = 4.034 x - 0.357\) where \(x\) is the body weight in kg and \(y\) is the height weight in grams.
The intercept means that if cats body weight is 0, then the heart weight is negative .357 grams. Neither value is sensible and we obtain to treat the intercept as a parameter used to create the best fit only in its intended range.
For each additional kilogram of body weight of a cat, the heart weight is 4.034 grams heavier.
The \(R^2\) means that 64.66% of the variation in heart weight is explained by the least squares line.
The correlation coefficient is 0.8041144.
t_stat = 4.13
std_error = 0.0322
(slope = t_stat * std_error )
## [1] 0.132986
Alternatively, we can use the slope connected the intercept and the point defined by the means.
x_avg = -0.0883
y_avg = 3.9983
x_int = 0
y_int = 4.010
(slope = (y_avg - y_int)/ (x_avg - x_int) )
## [1] 0.1325028
The answers differ due to rounding error so we use only the 0.132 for any calculations as the slope.
The answer is statistically significant based on the large t-statistics of 4.13 and small p-value.
the 4 diagnostic plots suggest some skew in the residuals.
The requirements for a least squares lines are:
Linearity - this appears to be satisfied although the data appears to be dispersed cloud with a slight pattern.
Near normality of the residuals - somewhat satisfied by the histogram of residuals
Constant variability - there is more variability in the left tail of X than the right tail of X. I.e. high beauty professors show less variation than low beauty professors.
independent observations – support by the lack of trend in the order of data collection plot