See https://data606.net/assignments/homework/ for more information. Chapter 7 - Introduction to Linear Regression Practice: 7.23, 7.25, 7.29, 7.39 Graded: 7.24, 7.26, 7.30, 7.40
Refer to “Getting Started with R” in https://data606.net/post/
The scatterplot below shows the relationship between the number of calories and amount of carbohydrates (in grams) Starbucks food menu items contain. 21 Since Starbucks only lists the number of calories on the display items, we are interested in predicting the amount of carbs a menu item has based on its calorie content.
paste0("Calories is the explanatory variable and Carb (in grams) is the response variable. The explanatory variable is Calories on the x axis, and the response variable is Carbohydrates on the y axis.")## [1] "Calories is the explanatory variable and Carb (in grams) is the response variable. The explanatory variable is Calories on the x axis, and the response variable is Carbohydrates on the y axis."
paste0("The goal is to predict carbs based on increasing count of calories")## [1] "The goal is to predict carbs based on increasing count of calories"
The following are the conditions for fitting the least squares lines: - Linearity: The data should show a linear trend. From the scatterplot as noticed in a), we can see that the data show a moderate to weak linear relationship. - Nearly Normal Residual: From the histogram for residual, we can see that the residuals in this case have a slightly skeweed distribution to the left. - Constant Variability: In this case, from the residual plot, we can see that we do not have constant variability for residuals. The data fit the linear model much better for lower number of calories than for higher as shown by much larger residuals value. - Independent Observation: The menu item is presumably independent observation, but just involve starbucks only.
The Constant Variability condition is not satisfied and linearity is also barely met whereas other two conditions are pretty much satisfied. Hence, it does not fit a least squares line.
Exercise 7.15 introduces data on shoulder girth and height of a group of individuals. The mean shoulder girth is 107.20 cm with a standard deviation of 10.37 cm. The mean height is 171.14 cm with a standard deviation of 9.41 cm. The correlation between height and shoulder girth is 0.67.
# Shoulder girth is the explanatory variable
shoulder_mean <- 107.20 # in cm
shoulder_SD <- 10.37 # in cm
# Height is the response
height_mean <- 171.14 # in cn
height_SD <- 9.41 # in cm
# R for correlation
R <- 0.67
# Calculate the slope (or otherwise known as B1)
B1 <- R * (height_SD/shoulder_SD)
# Now to calculate for B0, we will use the values (x,y) = (107.20, 171.14). They are also the mean values, and they lie along the regression line.
# Now to rearrange the equation to solve for B0. B0 = y - B1*x
B0 <- height_mean - B1 * shoulder_mean
# With all of the pieces placed together, the equation for the regression line is:
paste("Regression line equation => Height =", round(B0,3), "+",round(B1,3),"* Shoulder")## [1] "Regression line equation => Height = 105.965 + 0.608 * Shoulder"
paste("Positive Linear Slope would mean the increase in height with each increase in shoulder girth. The intercept represents the height in cm when the shoulder girth = 0. The slope is 0.608 which means for every 1 cm increase in shoulder girth, there will be an additional .608 cm to the height.")## [1] "Positive Linear Slope would mean the increase in height with each increase in shoulder girth. The intercept represents the height in cm when the shoulder girth = 0. The slope is 0.608 which means for every 1 cm increase in shoulder girth, there will be an additional .608 cm to the height."
R.squared <- R^2
paste("R squared: ", round(R.squared,3))## [1] "R squared: 0.449"
This means that 44.9% of the variation found in this data is explained by the linear model i.e. explained by the shoulder girth width.
Predict the height of this student using the model.
shoulderGirth <- 100
predictedHeight <- B0 + (B1*shoulderGirth)
paste("The predicted height:",predictedHeight)## [1] "The predicted height: 166.762580520733"
residual <- 160 - predictedHeight
paste("The residual is: ", round(residual,3))## [1] "The residual is: -6.763"
paste("The residual is the difference of the observed response and the response we would predict based on the model fit. The negative residual value in this case means that this model overestimated the actual value.")## [1] "The residual is the difference of the observed response and the response we would predict based on the model fit. The negative residual value in this case means that this model overestimated the actual value."
Would it be appropriate to use this linear model to predict the height of this child?
paste("As per excercise 7.15, The original data set had a response variable values between ~80 and 135 cm. A measure of 56 is outside the sample and we would require extrapolation and would not be appropriate. There may be many confounding factors such as malnourishment, developmental delays, congenital diseases, etc. that may not be accounted for in this current linear regression model. Also, tt would not be appropriate to use this model to calculate the height of a 1 year old. It would be extrapolation, since it is outside of the model's range.")## [1] "As per excercise 7.15, The original data set had a response variable values between ~80 and 135 cm. A measure of 56 is outside the sample and we would require extrapolation and would not be appropriate. There may be many confounding factors such as malnourishment, developmental delays, congenital diseases, etc. that may not be accounted for in this current linear regression model. Also, tt would not be appropriate to use this model to calculate the height of a 1 year old. It would be extrapolation, since it is outside of the model's range."
The following regression output is for predicting the heart weight (in g) of cats from their body weight (in kg). The coefficients are estimated using a dataset of 144 domestic cats.
The linear model equation is Heart_Weight = B0 + B1 * Body_Weight, where B0 is height of the intercept and B1 is the slope. Therefore, by looking at the table, the linear regression model here is: Heart Weight (g) = -0.357 + 4.034 * `Body Weight (kg)
The intercept tells us that for a body weight of zero, the heart weight is -0.357 g In this context, the intercept has no meaning.It tells us there is a positive relationship between body weight and heart weight. If we increase body weight by 1 Kg, heart weight increases by 4.034 gR$^{2}$2 shows that 64.66% of the variability of heart weight is explained by body weight.r_squared <- 0.6466
cor <- sqrt(r_squared)
paste("The correlation coefficient of R squared",cor,"indicates the strong (because it's greater than 0.5) positive correlation")Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching e???ectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor. Researchers at University of Texas, Austin collected data on teaching evaluation score (higher score means better) and standardized beauty score (a score of 0 means average, negative score means below average, and a positive score means above average) for a sample of 463 professors.24 The scatterplot below shows the relationship between these variables, and also provided is a regression output for predicting teaching evaluation score from beauty score.
Alternatively, the slope may be computed using just the information provided in the model summary table.
prof.B1 <- (3.9983 - 4.010)/(-0.0883)
paste("B1 or the slope: ", round(prof.B1, 3))## [1] "B1 or the slope: 0.133"
Slope can be calculated from the linear model equation y = B0 + B1 x Here, they provide enough information so that we can solve B1 (the slope) We can substitute x with the average standardized beauty score (explanatory variable) = -0.0883 and we can substitute y with the average teaching evaluation score of 3.9983 (response). We know that the averages lie on the linear regression model.We also know the Y intercept. It is on the table. Y intercept = 4.010 Therefore, the equation is now: 3.9983 = 4.010 + B1 * (-0.0883) Now solve for B1 to find the slope.
Explain your reasoning.
paste("The value of the slope is positive.")## [1] "The value of the slope is positive."
Conditions required for linear regression: