This project is intended to extend and supplement what we talked about last week in discussing Linear Regression.
In some of the sections you will be asked to write R code to solve a particular problem. In some you will be asked to answer questions in your own words.
We are trying to fit a straight line to a scatter plot. This means finding the equation of the line that comes closest to all the points. More specifically, the problem becomes, given \(X = \{x_1,x_1,\ldots,x_n\}\) and \(Y = \{y_1, y_2,\dots,y_n\}\) find \(\beta_0\) and \(beta_1\) such that \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\] where, \(\sum_{i}^{n}\epsilon_i = 0\) and \(\sigma^2 = \sum_{i}^{n}\epsilon_i ^2\) is as minimized. This is called least squares regression.
Given estimates of \(\beta_0\) and \(\beta_1\) called \(\hat{\beta_0}\) and \(\hat{\beta_1}\) we define: \[\hat{y_i} = \hat{\beta_0} + \hat{\beta_1}x_i\] The \(\hat{y_i}\) are called the predicted values.
The residuals, \(e_i\), are the differences between the predicted values and the actual values.\[e_i = y_i - \hat{y_i}\]
Note: the residuals are estimates of the \(\epsilon_i\) in the model.
Please work through this document before continuing with this project.
There are four data sets in Project3 data folder the files section, problem1.csv, problem2.csv, problem3.csv, and problem4.csv These data are some of the groups from this data. You will need to load the files into your project using read_csv.
You will also need to read in the file `bdism.csv’ to do the problems in Section II.
setwd("/Users/justinpark/Desktop/2744203/Project3 Data")
prob1 <- read.csv("problem1.csv")
prob2 <- read.csv("problem2.csv")
prob3 <- read.csv("problem3.csv")
prob4 <- read.csv("problem4.csv")
bdims <- read.csv("bdims.csv")
x and y (x on the horizontal axis.)lm report the coefficients, and plot the regression line over the scatter plot.x. Record any observations you have in this document.x and y.x and y?x and y data, this means calculationg the z-score for each \(x_i\) and \(y_i\), it is possible to do this with one line of R code.Hint the z-score of a vector of data, x, is the vector \(z\) given by \(z = \frac{x - \bar{x}}{s_x}\) where \(s_x\) is the standard deviation of \(x\).
lm calculate the regression coefficients. What do you notice?# first problem
prob1 %>%
filter(group ==1) %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", color = "blue", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
# Summary Stats
sum = lm(y ~ x, data = prob1)
prob.res = resid(sum)
summary(sum)
##
## Call:
## lm(formula = y ~ x, data = prob1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.2072 -7.7558 0.3587 7.9145 23.7530
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.52478 1.48347 9.117 1.12e-13 ***
## x -0.80463 0.04565 -17.626 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.37 on 73 degrees of freedom
## Multiple R-squared: 0.8097, Adjusted R-squared: 0.8071
## F-statistic: 310.7 on 1 and 73 DF, p-value: < 2.2e-16
# Correlation Calculation
prob1 %>%
cor(method = c("pearson", "kendall", "spearman"))
## Warning in cor(., method = c("pearson", "kendall", "spearman")): the standard
## deviation is zero
## group x y
## group 1 NA NA
## x NA 1.0000000 -0.8998508
## y NA -0.8998508 1.0000000
Correlation is equal to -0.8999. This indicates a fairly strong negative relationship between x and y. And the summary stats show that the \(R^2\) is equal to 0.8097. This explains that 80.97% of the data fits the regression, indicating that it is a fairly good fit.
# Residuals
plot(prob1$x, prob.res,
ylab = "Residuals",
xlab = "x")
abline(0, 0)
# second problem
prob2 %>%
filter(group == 5) %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", color = "blue", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
# Summary Stats
sum2 = lm(y ~ x, data = prob2)
prob.res2 = resid(sum2)
summary(sum2)
##
## Call:
## lm(formula = y ~ x, data = prob2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.967 -3.344 1.423 4.608 5.726
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.967e+00 1.951e+00 4.597 0.000127 ***
## x 5.912e-17 3.716e-02 0.000 1.000000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.024 on 23 degrees of freedom
## Multiple R-squared: 1.033e-31, Adjusted R-squared: -0.04348
## F-statistic: 2.375e-30 on 1 and 23 DF, p-value: 1
# Correlation Calculation
prob2 %>%
cor(method = c("pearson", "kendall", "spearman"))
## Warning in cor(., method = c("pearson", "kendall", "spearman")): the standard
## deviation is zero
## group x y
## group 1 NA NA
## x NA 1.000000e+00 2.961786e-16
## y NA 2.961786e-16 1.000000e+00
Correlation is equal to \(2.961786 * 10^{-16}\). This, essentially being 0, indicates that no relationship exists between x and y. And the summary stats show that the \(R^2\) is equal to \(1.033 * 10^{-31}\). This explains that essentially 0% of the data fits the regression line, indicating that the model is not a good fit at all.
# Residuals
plot(prob2$x, prob.res2,
ylab = "Residuals",
xlab = "x")
abline(0, 0)
The graph being a parabola, would show a correlation of almost 0. As such, a linear model would be very hard to fit into a parabola.
# third problem
prob3 %>%
filter(group == 12) %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", color = "blue", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
# Summary Stats
sum3 = lm(y ~ x, data = prob3)
prob.res3 = resid(sum3)
summary(sum3)
## Warning in summary.lm(sum3): essentially perfect fit: summary may be unreliable
##
## Call:
## lm(formula = y ~ x, data = prob3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.956e-16 -2.600e-19 8.510e-18 1.609e-17 5.663e-17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.256e-16 1.896e-17 -6.626e+00 2.76e-08 ***
## x 1.000e+00 3.405e-17 2.937e+16 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.402e-17 on 48 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 8.625e+32 on 1 and 48 DF, p-value: < 2.2e-16
# Correlation Calculation
prob3 %>%
cor(method = c("pearson", "kendall", "spearman"))
## Warning in cor(., method = c("pearson", "kendall", "spearman")): the standard
## deviation is zero
## group x y
## group 1 NA NA
## x NA 1 1
## y NA 1 1
Correlation is equal to 1.00. This indicates a perfect positive relationship between x and y. And the summary stats show that the \(R^2\) is equal also equal to 1 (which makes sense as \(1^1 = 1\). This explains that 100% of the data fits the regression line, indicating that the linear model is a perfect fit of the data.
# Residuals
plot(prob3$x, prob.res3,
ylab = "Residuals",
xlab = "x")
abline(0, 0)
All the data points were alligned on the linear model at first observation. This explains the correlation and \(R^2\) value of 1. However, when looking at the residuals, it is apparent that there is still some (although very minimal) difference between the predicted and actual values.
# fourth problem
prob4 %>%
filter(group == 19) %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", color = "blue", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
# Summary Stats
sum4 = lm(y ~ x, data = prob4)
prob.res4 = resid(sum4)
summary(sum4)
##
## Call:
## lm(formula = y ~ x, data = prob4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.06170 -0.13830 0.08706 0.34014 0.85016
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.76533 0.09579 7.990 1.32e-10 ***
## x 0.18807 0.04557 4.127 0.000133 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5064 on 52 degrees of freedom
## Multiple R-squared: 0.2467, Adjusted R-squared: 0.2322
## F-statistic: 17.03 on 1 and 52 DF, p-value: 0.0001332
# Correlation Calculation
prob4 %>%
cor(method = c("pearson", "kendall", "spearman"))
## Warning in cor(., method = c("pearson", "kendall", "spearman")): the standard
## deviation is zero
## group x y
## group 1 NA NA
## x NA 1.0000000 0.4967192
## y NA 0.4967192 1.0000000
Correlation is equal to 0.4967. This indicates a weak positive relationship between x and y. And the summary stats show that the \(R^2\) is equal to 0.2467. This explains that 24.67% of the data fits the regression line, indicating that the linear model is not a very good fit
# Residuals
plot(prob4$x, prob.res4,
ylab = "Residuals",
xlab = "x")
abline(0, 0)
Anthropological researchers collected body measurements from 507 individuals (247 men and 260 women.) The data are contained in the file bdims.csv. A description of the variables can be found here
Using R Perform the following tasks, give the most complete specific answers you can given the data.
reg <- bdims %>%
dplyr::select(sho_gi, hgt)
ggplot(data = reg, aes(x = sho_gi, y = hgt)) +
geom_point() +
geom_smooth(method = "lm", color = "blue", se = FALSE) +
labs(
x = "Shoulder Girth",
y = "Height")
## `geom_smooth()` using formula 'y ~ x'
The graph with a linear model shows a positive correlation. Overall, as the shoulder girth increases, the height increases as well.
The slope of the regression line would decrease significantly.
lm?lm(hgt ~ sho_gi, data = reg)
##
## Call:
## lm(formula = hgt ~ sho_gi, data = reg)
##
## Coefficients:
## (Intercept) sho_gi
## 105.8325 0.6036
Intercept: If the shoulder girth of an individual were to be 0 cm, it is expected that their height is 105.8325 cm Slope: As the shoulder girth for individuals increases by 1 cm, it is expected that the height would increase by 0.6035 cm
corr <- cor(reg$sho_gi, reg$hgt, method = c("pearson", "kendall", "spearman"))
corr^2
## [1] 0.4432035
0.6036*(100)+105.8325
## [1] 166.1925
No, since the data is collected from men and women who are physically active, a one year old cannot be considered to be even remotely similar to them. Because of this, it would not be appropriate to predict the height of the child. And if the model were to be used, the resulting height would be 139.6341 cm. Which essentially predicts that the infant is already over a meter tall.
murders %>%
ggplot(aes(x = perc_pov, y = annual_murders_per_mil)) +
geom_point()
summary(lm(annual_murders_per_mil~ perc_pov, data = murders))
##
## Call:
## lm(formula = annual_murders_per_mil ~ perc_pov, data = murders)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1663 -2.5613 -0.9552 2.8887 12.3475
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -29.901 7.789 -3.839 0.0012 **
## perc_pov 2.559 0.390 6.562 3.64e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.512 on 18 degrees of freedom
## Multiple R-squared: 0.7052, Adjusted R-squared: 0.6889
## F-statistic: 43.06 on 1 and 18 DF, p-value: 3.638e-06
Just referring to the plot and the summary above:
y = 2.559x - 29.901
At a poverty percentage of 0%, the expected annual murder per mil is -29.901
As the poverty percentage increases by 1%, the expected annual murder per mil increases by 2.559
70.52% of the data can be explained by the line of regression. This indicates that a decent portion of the data is explained by the model
\(0.7052^{1/2} = 0.8398\)