#install.packages("openinto")
This project is intended to extend and supplement what we talked about last week in discussing Linear Regression.
In some of the sections you will be asked to write R code to solve a particular problem. In some you will be asked to answer questions in your own words.
We are trying to fit a straight line to a scatter plot. This means finding the equation of the line that comes closest to all the points. More specifically, the problem becomes, given \(X = \{x_1,x_1,\ldots,x_n\}\) and \(Y = \{y_1, y_2,\dots,y_n\}\) find \(\beta_0\) and \(beta_1\) such that \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\] where, \(\sum_{i}^{n}\epsilon_i = 0\) and \(\sigma^2 = \sum_{i}^{n}\epsilon_i ^2\) is as minimized. This is called least squares regression.
Given estimates of \(\beta_0\) and \(\beta_1\) called \(\hat{\beta_0}\) and \(\hat{\beta_1}\) we define: \[\hat{y_i} = \hat{\beta_0} + \hat{\beta_1}x_i\] The \(\hat{y_i}\) are called the predicted values.
The residuals, \(e_i\), are the differences between the predicted values and the actual values.\[e_i = y_i - \hat{y_i}\]
Note: the residuals are estimates of the \(\epsilon_i\) in the model.
Please work through this document before continuing with this project.
There are four data sets in Project3 data folder the files section, problem1.csv, problem2.csv, problem3.csv, and problem4.csv These data are some of the groups from this data. You will need to load the files into your project using read_csv.
You will also need to read in the file `bdism.csv’ to do the problems in Section II.
setwd("C:/Users/tycho/Desktop/DATA101")
problem1 <- read_csv("problem1.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## group = col_double(),
## x = col_double(),
## y = col_double()
## )
view(problem1)
problem1 %>%
ggplot(aes(x, y)) +
geom_point()
(summary(lm(y ~ x, data = problem1)))
##
## Call:
## lm(formula = y ~ x, data = problem1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.2072 -7.7558 0.3587 7.9145 23.7530
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.52478 1.48347 9.117 1.12e-13 ***
## x -0.80463 0.04565 -17.626 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.37 on 73 degrees of freedom
## Multiple R-squared: 0.8097, Adjusted R-squared: 0.8071
## F-statistic: 310.7 on 1 and 73 DF, p-value: < 2.2e-16
problem1 %>%
ggplot(aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red")
## `geom_smooth()` using formula 'y ~ x'
## 3:
model <- (summary(lm(y ~ x, data = problem1)))
res <- resid(model)
res
## 1 2 3 4 5
## 9.373492e+00 1.510940e+01 5.189054e+00 -1.148329e+01 -1.303100e+00
## 6 7 8 9 10
## 2.138027e+01 -3.665295e+00 8.983401e+00 1.505603e+01 1.027429e+01
## 11 12 13 14 15
## 3.587066e-01 3.006344e+00 -1.019769e+01 5.864940e+00 -1.006422e+01
## 16 17 18 19 20
## -9.743549e+00 -1.147619e+01 -7.360259e+00 1.338352e+01 -4.110787e+00
## 21 22 23 24 25
## -3.861121e-01 -5.139230e+00 7.009039e+00 1.147444e+01 -2.036012e+00
## 26 27 28 29 30
## 7.844302e+00 -4.134491e+00 1.334100e+01 7.765650e+00 1.258522e+01
## 31 32 33 34 35
## 7.984753e+00 3.160118e+00 -2.211238e+01 -7.529306e+00 6.271544e+00
## 36 37 38 39 40
## -4.899415e+00 -2.609127e+00 -4.409198e+00 1.982493e+01 -7.982239e+00
## 41 42 43 44 45
## -4.767796e+00 6.729302e-01 3.397794e+00 2.694373e+00 6.589179e-04
## 46 47 48 49 50
## -1.774166e+01 4.938532e+00 6.927625e+00 -5.332332e+00 8.510232e+00
## 51 52 53 54 55
## -8.489171e+00 5.058549e+00 -7.378020e+00 -2.040812e+01 1.420959e+01
## 56 57 58 59 60
## 8.802078e+00 2.109028e+00 1.867097e+01 -1.889653e+01 1.963526e+00
## 61 62 63 64 65
## -2.820716e+01 1.801657e+01 -9.958345e+00 -1.507428e+01 -9.797854e+00
## 66 67 68 69 70
## -1.349673e+01 -2.330287e+01 -1.118427e+01 -1.467768e+00 1.152775e+01
## 71 72 73 74 75
## 5.436499e+00 -1.428566e+01 -7.006722e+00 2.375299e+01 5.507030e+00
x <- problem1
y <- problem1
cor(y,x, method=c("pearson"), use = "complete.obs")
## Warning in cor(y, x, method = c("pearson"), use = "complete.obs"): the standard
## deviation is zero
## group x y
## group NA NA NA
## x NA 1.0000000 -0.8998508
## y NA -0.8998508 1.0000000
Yes because the scatter about the line is small so there is a strong linear relationship.
mean(problem1$x)
## [1] 15.13373
sd(problem1$x)
## [1] 28.95073
problem1 %>%
mutate(zscore = (x - mean(x))/sd(x))
mean(problem1$y)
## [1] 1.347782
sd(problem1$y)
## [1] 25.8871
problem1 %>%
mutate(zscore = (y - mean(y))/sd(y))
reg1 <- lm(y ~ x, problem1)
summary(reg1)
##
## Call:
## lm(formula = y ~ x, data = problem1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.2072 -7.7558 0.3587 7.9145 23.7530
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.52478 1.48347 9.117 1.12e-13 ***
## x -0.80463 0.04565 -17.626 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.37 on 73 degrees of freedom
## Multiple R-squared: 0.8097, Adjusted R-squared: 0.8071
## F-statistic: 310.7 on 1 and 73 DF, p-value: < 2.2e-16
b = problem1
mean(b$y)
## [1] 1.347782
sd(b$y)
## [1] 25.8871
a = problem1
mean(a$x)
## [1] 15.13373
sd(a$x)
## [1] 28.95073
cor(b$y, a$x, method="pearson")
## [1] -0.8998508
R2=(-0.9)2=0.81
An R^2 value of 0.81 indicates that 81% of the variance of y being studied is explained by the variance of x.
setwd("C:/Users/tycho/Desktop/DATA101")
problem2 <- read_csv("problem2.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## group = col_double(),
## x = col_double(),
## y = col_double()
## )
view(problem2)
problem2 %>%
ggplot(aes(x, y)) +
geom_point()
(summary(lm(y ~ x, data = problem2)))
##
## Call:
## lm(formula = y ~ x, data = problem2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.967 -3.344 1.423 4.608 5.726
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.967e+00 1.951e+00 4.597 0.000127 ***
## x 5.912e-17 3.716e-02 0.000 1.000000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.024 on 23 degrees of freedom
## Multiple R-squared: 1.033e-31, Adjusted R-squared: -0.04348
## F-statistic: 2.375e-30 on 1 and 23 DF, p-value: 1
problem2 %>%
ggplot(aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red")
## `geom_smooth()` using formula 'y ~ x'
model2 <- (summary(lm(y ~ x, data = problem2)))
res2 <- resid(model)
res2
## 1 2 3 4 5
## 9.373492e+00 1.510940e+01 5.189054e+00 -1.148329e+01 -1.303100e+00
## 6 7 8 9 10
## 2.138027e+01 -3.665295e+00 8.983401e+00 1.505603e+01 1.027429e+01
## 11 12 13 14 15
## 3.587066e-01 3.006344e+00 -1.019769e+01 5.864940e+00 -1.006422e+01
## 16 17 18 19 20
## -9.743549e+00 -1.147619e+01 -7.360259e+00 1.338352e+01 -4.110787e+00
## 21 22 23 24 25
## -3.861121e-01 -5.139230e+00 7.009039e+00 1.147444e+01 -2.036012e+00
## 26 27 28 29 30
## 7.844302e+00 -4.134491e+00 1.334100e+01 7.765650e+00 1.258522e+01
## 31 32 33 34 35
## 7.984753e+00 3.160118e+00 -2.211238e+01 -7.529306e+00 6.271544e+00
## 36 37 38 39 40
## -4.899415e+00 -2.609127e+00 -4.409198e+00 1.982493e+01 -7.982239e+00
## 41 42 43 44 45
## -4.767796e+00 6.729302e-01 3.397794e+00 2.694373e+00 6.589179e-04
## 46 47 48 49 50
## -1.774166e+01 4.938532e+00 6.927625e+00 -5.332332e+00 8.510232e+00
## 51 52 53 54 55
## -8.489171e+00 5.058549e+00 -7.378020e+00 -2.040812e+01 1.420959e+01
## 56 57 58 59 60
## 8.802078e+00 2.109028e+00 1.867097e+01 -1.889653e+01 1.963526e+00
## 61 62 63 64 65
## -2.820716e+01 1.801657e+01 -9.958345e+00 -1.507428e+01 -9.797854e+00
## 66 67 68 69 70
## -1.349673e+01 -2.330287e+01 -1.118427e+01 -1.467768e+00 1.152775e+01
## 71 72 73 74 75
## 5.436499e+00 -1.428566e+01 -7.006722e+00 2.375299e+01 5.507030e+00
x <- problem2
y <- problem2
cor(y,x, method=c("pearson"), use = "complete.obs")
## Warning in cor(y, x, method = c("pearson"), use = "complete.obs"): the standard
## deviation is zero
## group x y
## group NA NA NA
## x NA 1.000000e+00 2.961786e-16
## y NA 2.961786e-16 1.000000e+00
Does not indicate a strong linear relationship.
mean(problem2$x)
## [1] 45
sd(problem2$x)
## [1] 27.59925
problem2 %>%
mutate(zscore = (x - mean(x))/sd(x))
mean(problem2$y)
## [1] 8.96741
sd(problem2$y)
## [1] 4.918331
problem2 %>%
mutate(zscore = (y - mean(y))/sd(y))
reg2 <- lm(y ~ x, problem2)
summary(reg2)
##
## Call:
## lm(formula = y ~ x, data = problem2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.967 -3.344 1.423 4.608 5.726
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.967e+00 1.951e+00 4.597 0.000127 ***
## x 5.912e-17 3.716e-02 0.000 1.000000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.024 on 23 degrees of freedom
## Multiple R-squared: 1.033e-31, Adjusted R-squared: -0.04348
## F-statistic: 2.375e-30 on 1 and 23 DF, p-value: 1
b = problem2
mean(b$y)
## [1] 8.96741
sd(b$y)
## [1] 4.918331
a = problem2
mean(a$x)
## [1] 45
sd(a$x)
## [1] 27.59925
cor(b$y, a$x, method="pearson")
## [1] 2.961786e-16
R2=(2.96e-16)2=8.76e-32
setwd("C:/Users/tycho/Desktop/DATA101")
problem3 <- read_csv("problem3.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## group = col_double(),
## x = col_double(),
## y = col_double()
## )
view(problem3)
(summary(lm(y ~ x, data = problem3)))
## Warning in summary.lm(lm(y ~ x, data = problem3)): essentially perfect fit:
## summary may be unreliable
##
## Call:
## lm(formula = y ~ x, data = problem3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.956e-16 -2.600e-19 8.510e-18 1.609e-17 5.663e-17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.256e-16 1.896e-17 -6.626e+00 2.76e-08 ***
## x 1.000e+00 3.405e-17 2.937e+16 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.402e-17 on 48 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 8.625e+32 on 1 and 48 DF, p-value: < 2.2e-16
problem3 %>%
ggplot(aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red")
## `geom_smooth()` using formula 'y ~ x'
model3 <- (summary(lm(y ~ x, data = problem3)))
## Warning in summary.lm(lm(y ~ x, data = problem3)): essentially perfect fit:
## summary may be unreliable
res3 <- resid(model)
res3
## 1 2 3 4 5
## 9.373492e+00 1.510940e+01 5.189054e+00 -1.148329e+01 -1.303100e+00
## 6 7 8 9 10
## 2.138027e+01 -3.665295e+00 8.983401e+00 1.505603e+01 1.027429e+01
## 11 12 13 14 15
## 3.587066e-01 3.006344e+00 -1.019769e+01 5.864940e+00 -1.006422e+01
## 16 17 18 19 20
## -9.743549e+00 -1.147619e+01 -7.360259e+00 1.338352e+01 -4.110787e+00
## 21 22 23 24 25
## -3.861121e-01 -5.139230e+00 7.009039e+00 1.147444e+01 -2.036012e+00
## 26 27 28 29 30
## 7.844302e+00 -4.134491e+00 1.334100e+01 7.765650e+00 1.258522e+01
## 31 32 33 34 35
## 7.984753e+00 3.160118e+00 -2.211238e+01 -7.529306e+00 6.271544e+00
## 36 37 38 39 40
## -4.899415e+00 -2.609127e+00 -4.409198e+00 1.982493e+01 -7.982239e+00
## 41 42 43 44 45
## -4.767796e+00 6.729302e-01 3.397794e+00 2.694373e+00 6.589179e-04
## 46 47 48 49 50
## -1.774166e+01 4.938532e+00 6.927625e+00 -5.332332e+00 8.510232e+00
## 51 52 53 54 55
## -8.489171e+00 5.058549e+00 -7.378020e+00 -2.040812e+01 1.420959e+01
## 56 57 58 59 60
## 8.802078e+00 2.109028e+00 1.867097e+01 -1.889653e+01 1.963526e+00
## 61 62 63 64 65
## -2.820716e+01 1.801657e+01 -9.958345e+00 -1.507428e+01 -9.797854e+00
## 66 67 68 69 70
## -1.349673e+01 -2.330287e+01 -1.118427e+01 -1.467768e+00 1.152775e+01
## 71 72 73 74 75
## 5.436499e+00 -1.428566e+01 -7.006722e+00 2.375299e+01 5.507030e+00
x <- problem3
y <- problem3
cor(y,x, method=c("pearson"), use = "complete.obs")
## Warning in cor(y, x, method = c("pearson"), use = "complete.obs"): the standard
## deviation is zero
## group x y
## group NA NA NA
## x NA 1 1
## y NA 1 1
Calculation indicates a strong linear relationship between x and y. (no spread)
mean(problem3$x)
## [1] 0.4641447
sd(problem3$x)
## [1] 0.3105528
problem3 %>%
mutate(zscore = (x - mean(x))/sd(x))
mean(problem3$y)
## [1] 0.4641447
sd(problem3$y)
## [1] 0.3105528
problem3 %>%
mutate(zscore = (y - mean(y))/sd(y))
reg3 <- lm(y ~ x, problem3)
summary(reg3)
## Warning in summary.lm(reg3): essentially perfect fit: summary may be unreliable
##
## Call:
## lm(formula = y ~ x, data = problem3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.956e-16 -2.600e-19 8.510e-18 1.609e-17 5.663e-17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.256e-16 1.896e-17 -6.626e+00 2.76e-08 ***
## x 1.000e+00 3.405e-17 2.937e+16 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.402e-17 on 48 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 8.625e+32 on 1 and 48 DF, p-value: < 2.2e-16
b = problem3
mean(b$y)
## [1] 0.4641447
sd(b$y)
## [1] 0.3105528
a = problem3
mean(a$x)
## [1] 0.4641447
sd(a$x)
## [1] 0.3105528
cor(b$y, a$x, method="pearson")
## [1] 1
R2=12=1
setwd("C:/Users/tycho/Desktop/DATA101")
problem4 <- read_csv("problem4.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## group = col_double(),
## x = col_double(),
## y = col_double()
## )
view(problem4)
(summary(lm(y ~ x, data = problem4)))
##
## Call:
## lm(formula = y ~ x, data = problem4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.06170 -0.13830 0.08706 0.34014 0.85016
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.76533 0.09579 7.990 1.32e-10 ***
## x 0.18807 0.04557 4.127 0.000133 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5064 on 52 degrees of freedom
## Multiple R-squared: 0.2467, Adjusted R-squared: 0.2322
## F-statistic: 17.03 on 1 and 52 DF, p-value: 0.0001332
problem4 %>%
ggplot(aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red")
## `geom_smooth()` using formula 'y ~ x'
model4 <- (summary(lm(y ~ x, data = problem4)))
res4 <- resid(model)
res4
## 1 2 3 4 5
## 9.373492e+00 1.510940e+01 5.189054e+00 -1.148329e+01 -1.303100e+00
## 6 7 8 9 10
## 2.138027e+01 -3.665295e+00 8.983401e+00 1.505603e+01 1.027429e+01
## 11 12 13 14 15
## 3.587066e-01 3.006344e+00 -1.019769e+01 5.864940e+00 -1.006422e+01
## 16 17 18 19 20
## -9.743549e+00 -1.147619e+01 -7.360259e+00 1.338352e+01 -4.110787e+00
## 21 22 23 24 25
## -3.861121e-01 -5.139230e+00 7.009039e+00 1.147444e+01 -2.036012e+00
## 26 27 28 29 30
## 7.844302e+00 -4.134491e+00 1.334100e+01 7.765650e+00 1.258522e+01
## 31 32 33 34 35
## 7.984753e+00 3.160118e+00 -2.211238e+01 -7.529306e+00 6.271544e+00
## 36 37 38 39 40
## -4.899415e+00 -2.609127e+00 -4.409198e+00 1.982493e+01 -7.982239e+00
## 41 42 43 44 45
## -4.767796e+00 6.729302e-01 3.397794e+00 2.694373e+00 6.589179e-04
## 46 47 48 49 50
## -1.774166e+01 4.938532e+00 6.927625e+00 -5.332332e+00 8.510232e+00
## 51 52 53 54 55
## -8.489171e+00 5.058549e+00 -7.378020e+00 -2.040812e+01 1.420959e+01
## 56 57 58 59 60
## 8.802078e+00 2.109028e+00 1.867097e+01 -1.889653e+01 1.963526e+00
## 61 62 63 64 65
## -2.820716e+01 1.801657e+01 -9.958345e+00 -1.507428e+01 -9.797854e+00
## 66 67 68 69 70
## -1.349673e+01 -2.330287e+01 -1.118427e+01 -1.467768e+00 1.152775e+01
## 71 72 73 74 75
## 5.436499e+00 -1.428566e+01 -7.006722e+00 2.375299e+01 5.507030e+00
x <- problem4
y <- problem4
cor(y,x, method=c("pearson"), use = "complete.obs")
## Warning in cor(y, x, method = c("pearson"), use = "complete.obs"): the standard
## deviation is zero
## group x y
## group NA NA NA
## x NA 1.0000000 0.4967192
## y NA 0.4967192 1.0000000
A linear relationship can be interpreted by the data but it is not a particularly strong one.
mean(problem4$x)
## [1] 1.460012
sd(problem4$x)
## [1] 1.526316
problem4 %>%
mutate(zscore = (x - mean(x))/sd(x))
mean(problem4$y)
## [1] 1.039912
sd(problem4$y)
## [1] 0.5778975
problem4 %>%
mutate(zscore = (y - mean(y))/sd(y))
reg4 <- lm(y ~ x, problem4)
summary(reg4)
##
## Call:
## lm(formula = y ~ x, data = problem4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.06170 -0.13830 0.08706 0.34014 0.85016
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.76533 0.09579 7.990 1.32e-10 ***
## x 0.18807 0.04557 4.127 0.000133 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5064 on 52 degrees of freedom
## Multiple R-squared: 0.2467, Adjusted R-squared: 0.2322
## F-statistic: 17.03 on 1 and 52 DF, p-value: 0.0001332
b = problem4
mean(b$y)
## [1] 1.039912
sd(b$y)
## [1] 0.5778975
a = problem4
mean(a$x)
## [1] 1.460012
sd(a$x)
## [1] 1.526316
cor(b$y, a$x, method="pearson")
## [1] 0.4967192
R2=0.4972=0.247
An R^2 value of 0.247 indicates that 24.7% of the variance of y is explained by the variance of x.
x and y (x on the horizontal axis.)lm report the coefficients, and plot the regression line over the scatter plot.x. Record any observations you have in this document.x and y.x and y?x and y data, this means calculationg the z-score for each \(x_i\) and \(y_i\), it is possible to do this with one line of R code.Hint the z-score of a vector of data, x, is the vector \(z\) given by \(z = \frac{x - \bar{x}}{s_x}\) where \(s_x\) is the standard deviation of \(x\).
Using lm calculate the regression coefficients. What do you notice?
Calculate Total Variation and \(R^2\)
Assess your results.
Anthropological researchers collected body measurements from 507 individuals (247 men and 260 women.) The data are contained in the file bdims.csv. A description of the variables can be found here
Using R Perform the following tasks, give the most complete specific answers you can given the data.
setwd("C:/Users/tycho/Desktop/DATA101")
bdims <- read_csv("bdims.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double()
## )
## i Use `spec()` for the full column specifications.
view(bdims)
bdims %>%
ggplot() +
geom_point(mapping = aes(x = sho_gi, y = hgt)) +
geom_smooth(mapping = aes(x = sho_gi, y = hgt), method = lm)
## `geom_smooth()` using formula 'y ~ x'
gh <- lm(hgt ~ sho_gi, bdims)
summary(gh)
##
## Call:
## lm(formula = hgt ~ sho_gi, data = bdims)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.2297 -4.7976 -0.1142 4.7885 21.0979
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 105.83246 3.27245 32.34 <2e-16 ***
## sho_gi 0.60364 0.03011 20.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.026 on 505 degrees of freedom
## Multiple R-squared: 0.4432, Adjusted R-squared: 0.4421
## F-statistic: 402 on 1 and 505 DF, p-value: < 2.2e-16
Lin_height = 0.60364*sho_gi + 105.83246 Median is close to 0. Min and max are nearly the same (in terms of magnitude). Residual square is on the low side (44.32%).
How would the relationship change if shoulder girth were measured in inches while the units of height remained in centimeters? All the points on the plot will be shifted to the right (since inches are larger than centimeters), but this doesn’t change the relationship between the two variables.
Write the equation of the regression line for predicting height. How would you find the coefficients without using lm?
h = bdims
mean(h$hgt)
## [1] 171.1438
sd(h$hgt)
## [1] 9.407205
s = bdims
mean(s$sho_gi)
## [1] 108.1951
sd(s$sho_gi)
## [1] 10.37483
cor(s$sho_gi, h$hgt, method="pearson")
## [1] 0.6657353
slope <- 0.67*9.41/10.37
slope
## [1] 0.6079749
intercept <- 171.14-0.6079749*108.2
intercept
## [1] 105.3571
Equation: Height = 0.608*girth + 105.83
Interpret the slope and intecept in this context. slope: for every increase in girth, height rises by 0.608 intercept: height at girth of 0.
Calculate \(R^2\) of the regression line for predicting height from shoulder girth R^2 = 0.67^2 = 0.4489
A randomly selected student from your class has a shoulder girth of 100cm. Predict the height of this student using this model. 0.608*100+105.83=166.63
If the selected student is actually 160 cm tall, calculate the residual and explain it’s meaning. 160-166.63= -6.63 (Model overestimates student height by 6.16 cm)
A one year old has has a shoulder girth of 56 cm. Would it be appropriate to use this linear model to predict the height of this child? It would not be appropriate to use this linear model since the minimum shoulder girth is greater than 85 cm.
m=bdims
min(m$sho_gi)
## [1] 85.9
murders %>%
ggplot(aes(x = perc_pov, y = annual_murders_per_mil)) +
geom_point()
summary(lm(annual_murders_per_mil~ perc_pov, data = murders))
##
## Call:
## lm(formula = annual_murders_per_mil ~ perc_pov, data = murders)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1663 -2.5613 -0.9552 2.8887 12.3475
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -29.901 7.789 -3.839 0.0012 **
## perc_pov 2.559 0.390 6.562 3.64e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.512 on 18 degrees of freedom
## Multiple R-squared: 0.7052, Adjusted R-squared: 0.6889
## F-statistic: 43.06 on 1 and 18 DF, p-value: 3.638e-06
Just referring to the plot and the summary above:
y = murders
mean(y$annual_murders_per_mil)
## [1] 20.57
sd(y$annual_murders_per_mil)
## [1] 9.881407
x = murders
mean(x$perc_pov)
## [1] 19.72
sd(x$perc_pov)
## [1] 3.242254
cor(y$annual_murders_per_mil, x$perc_pov, method="pearson")
## [1] 0.8397782
Equation: y = 2.56*x-29.91
intercept <- 20.57-2.56*19.72
intercept
## [1] -29.9132
slope <- 0.84*9.88/3.24
slope
## [1] 2.561481
Interpret \(R^2\) R2=0.842=0.7056
Calculate the correlation coefficient
cor(y$annual_murders_per_mil, x$perc_pov)
## [1] 0.8397782