The scatterplot below shows the relationship between the number of calories and amount of carbohydrates (in grams) Starbucks food menu items contain. Since Starbucks only lists the number of calories on the display items, we are interested in predicting the amount of carbs a menu item has based on its calorie content.
The explanatory variable is calories and the response variables is carbohydrates (grams).
We want to fit a regression line to examine whether we are doing better or worse compared to theoretical line. This comparison (residual) gives an idea whether there is a true relationship between indepedent and dependent variable. If residual for each oberserved value is too large we can say that explanatory variable cannot explain the dependent variable effectively and thus model is not a good approximation for predicting the outcome we want.
Even though the relationship appears linear in the scatterplot, and the residuals are nearly normal, we do not satisfy the criteria for constant variability. A simple linear model is inadequate for modeling these data.
Exercise 7.15 introduces data on shoulder girth and height of a group of individuals. The mean shoulder girth is 107.20 cm with a standard deviation of 10.37 cm. The mean height is 171.14 cm with a standard deviation of 9.41 cm. The correlation between height and shoulder girth is 0.67.
responseMean = 171.14
responseSD = 9.41
R = .67
explanatoryMean= 107.2
explanatorySD= 10.37
slope = (responseSD/explanatorySD)*R
intercept = responseMean - (slope)*explanatoryMean
regressionFunction = function(x,slope,intercept){
y =(x*slope)+ intercept
return(y)
}
tinyGraph = 1:150
tinyGraph = sapply(tinyGraph,regressionFunction,slope,intercept)
plot(tinyGraph, type = 'l',xlab = 'Shoulder girth',ylab='height')
Equation :
Y = 105.9650878 + 0.6079749 * exogenous variable
The slope is 0.608. Which means that for every 10 cm increase in shoulder girth, there will be an additional 6.08 cm to the height (or for every 1 cm increase in shoulder girth, there will be an additional .608 cm to the height). We must remember that not all values make sense when we plug them into this linear regresion equation. For example, a person with a shoulder width of 0 cm (which can’t happen) would indicate a height of 105.965 cm. This does not make sense.
R.squared <- R^2
paste("R squared: ", round(R.squared,3))
## [1] "R squared: 0.449"
This means that 44.9% of the variation found in this data is explained by the linear model i.e. explained by the shoulder girth width.
student.shoulder <- 100 # in cm
student.height <- 105.9650878 + 0.6079749 * student.shoulder
paste("The estimated height of a student with a shoulder girth of 100 cm is: ", round(student.height,3), "cm.")
## [1] "The estimated height of a student with a shoulder girth of 100 cm is: 166.763 cm."
# The residual = time(actual) - time(expected)
residual <- 160 - student.height
paste("The residual is: ", round(residual,3))
## [1] "The residual is: -6.763"
Residual = observed - predicted, which for this case is -6.7625805 Meaning the actual height is that much less than the line predicted
Original data only includes 80 to 140 shoulder girth which is a minus.Variability seems to be constant, correlation is strong and linear.While it would be innappropriate, I’d do it in absence of any other model.
The following regression output is for predicting the heart weight (in g) of cats from their body weight (in kg). The coefficients are estimated using a dataset of 144 domestic cats.
Heart Weight (g) = -0.357 + 4.034 * Body Weight (kg)
The intercept means that for a body weight of 0 kg, the average heart weight is -0.357 grams. It is an obviously theoretical example useful only to intepret the linear model.
The slope means that for each additional kilogram of body weight, the average heart weight of a cat increases by 4.034 grams.
Body weight (in kg) explains 64.66% of the variability in the heart weight (in g) of the cat.
paste("Correlation coefficient: ", round(sqrt(.6466),3))
## [1] "Correlation coefficient: 0.804"
Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching effectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor. Researchers at University of Texas, Austin collected data on teaching evaluation score (higher score means better) and standardized beauty score (a score of 0 means average, negative score means below average, and a positive score means above average) for a sample of 463 professors. The scatterplot below shows the relationship between these variables, and also provided is a regression output for predicting teaching evaluation score from beauty score.
prof.B1 <- (3.9983 - 4.010)/(-0.0883)
paste("B1 or the slope: ", round(prof.B1, 3))
## [1] "B1 or the slope: 0.133"
Since the slope is positive the relationship is positive. If we set up a hypothesis test with H0:B1=0 and HA:B1>0, then based on the summary table the p value is nearly 0. And this is for a two-sided test, so it’ll be even closer to 0 for a one-sided test. We reject the null hypothesis. There is convincing evidence that the relationship between teaching evluation and beauty is positive
.Linearity: Based on the scatterplot, there may be a weak linear relationship. There is no evident pattern in the residual plot.
.Nearly normal residuals: The histogram of the residuals exhibits a left skew. Additionally, the points seem to move away from the normal probability line on each end.
.Constant variability: Based on residual plot, there appears to be constant variability in the data.
.Independent observations: Observations are not a time series, and can be assumed to be independent (unless there is evidence that students copied each other’s evaluations).