#install.packages("openinto")

Introduction

This project is intended to extend and supplement what we talked about last week in discussing Linear Regression.

In some of the sections you will be asked to write R code to solve a particular problem. In some you will be asked to answer questions in your own words.

Summary and Exercises

Single Variable Regression

We are trying to fit a straight line to a scatter plot. This means finding the equation of the line that comes closest to all the points. More specifically, the problem becomes, given \(X = \{x_1,x_1,\ldots,x_n\}\) and \(Y = \{y_1, y_2,\dots,y_n\}\) find \(\beta_0\) and \(beta_1\) such that \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\] where, \(\sum_{i}^{n}\epsilon_i = 0\) and \(\sigma^2 = \sum_{i}^{n}\epsilon_i ^2\) is as minimized. This is called least squares regression.

Given estimates of \(\beta_0\) and \(\beta_1\) called \(\hat{\beta_0}\) and \(\hat{\beta_1}\) we define: \[\hat{y_i} = \hat{\beta_0} + \hat{\beta_1}x_i\] The \(\hat{y_i}\) are called the predicted values.

The residuals, \(e_i\), are the differences between the predicted values and the actual values.\[e_i = y_i - \hat{y_i}\]

Note: the residuals are estimates of the \(\epsilon_i\) in the model.

Examining Residual variance.

Please work through this document before continuing with this project.

Section I Working with Simulated scatter plots.

There are four data sets in Project3 data folder the files section, problem1.csv, problem2.csv, problem3.csv, and problem4.csv These data are some of the groups from this data. You will need to load the files into your project using read_csv.

You will also need to read in the file `bdism.csv’ to do the problems in Section II.

Problem 1:

1:

setwd("C:/Users/tycho/Desktop/DATA101")
problem1 <- read_csv("problem1.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   group = col_double(),
##   x = col_double(),
##   y = col_double()
## )

view(problem1)

problem1 %>%
  ggplot(aes(x, y)) +
  geom_point()

2:

(summary(lm(y ~ x, data = problem1)))

## 
## Call:
## lm(formula = y ~ x, data = problem1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.2072  -7.7558   0.3587   7.9145  23.7530 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.52478    1.48347   9.117 1.12e-13 ***
## x           -0.80463    0.04565 -17.626  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.37 on 73 degrees of freedom
## Multiple R-squared:  0.8097, Adjusted R-squared:  0.8071 
## F-statistic: 310.7 on 1 and 73 DF,  p-value: < 2.2e-16

problem1 %>%
  ggplot(aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red")

## `geom_smooth()` using formula 'y ~ x'

## 3:

model <- (summary(lm(y ~ x, data = problem1)))
res <- resid(model)
res

##             1             2             3             4             5 
##  9.373492e+00  1.510940e+01  5.189054e+00 -1.148329e+01 -1.303100e+00 
##             6             7             8             9            10 
##  2.138027e+01 -3.665295e+00  8.983401e+00  1.505603e+01  1.027429e+01 
##            11            12            13            14            15 
##  3.587066e-01  3.006344e+00 -1.019769e+01  5.864940e+00 -1.006422e+01 
##            16            17            18            19            20 
## -9.743549e+00 -1.147619e+01 -7.360259e+00  1.338352e+01 -4.110787e+00 
##            21            22            23            24            25 
## -3.861121e-01 -5.139230e+00  7.009039e+00  1.147444e+01 -2.036012e+00 
##            26            27            28            29            30 
##  7.844302e+00 -4.134491e+00  1.334100e+01  7.765650e+00  1.258522e+01 
##            31            32            33            34            35 
##  7.984753e+00  3.160118e+00 -2.211238e+01 -7.529306e+00  6.271544e+00 
##            36            37            38            39            40 
## -4.899415e+00 -2.609127e+00 -4.409198e+00  1.982493e+01 -7.982239e+00 
##            41            42            43            44            45 
## -4.767796e+00  6.729302e-01  3.397794e+00  2.694373e+00  6.589179e-04 
##            46            47            48            49            50 
## -1.774166e+01  4.938532e+00  6.927625e+00 -5.332332e+00  8.510232e+00 
##            51            52            53            54            55 
## -8.489171e+00  5.058549e+00 -7.378020e+00 -2.040812e+01  1.420959e+01 
##            56            57            58            59            60 
##  8.802078e+00  2.109028e+00  1.867097e+01 -1.889653e+01  1.963526e+00 
##            61            62            63            64            65 
## -2.820716e+01  1.801657e+01 -9.958345e+00 -1.507428e+01 -9.797854e+00 
##            66            67            68            69            70 
## -1.349673e+01 -2.330287e+01 -1.118427e+01 -1.467768e+00  1.152775e+01 
##            71            72            73            74            75 
##  5.436499e+00 -1.428566e+01 -7.006722e+00  2.375299e+01  5.507030e+00

4:

x <- problem1
y <- problem1
cor(y,x, method=c("pearson"), use = "complete.obs")

## Warning in cor(y, x, method = c("pearson"), use = "complete.obs"): the standard
## deviation is zero

##       group          x          y
## group    NA         NA         NA
## x        NA  1.0000000 -0.8998508
## y        NA -0.8998508  1.0000000

5:

Yes because the scatter about the line is small so there is a strong linear relationship.

6:

mean(problem1$x)

## [1] 15.13373

sd(problem1$x)

## [1] 28.95073

problem1 %>%
  mutate(zscore = (x - mean(x))/sd(x))

mean(problem1$y)

## [1] 1.347782

sd(problem1$y)

## [1] 25.8871

problem1 %>%
  mutate(zscore = (y - mean(y))/sd(y))

7:

reg1 <- lm(y ~ x, problem1)
summary(reg1)

## 
## Call:
## lm(formula = y ~ x, data = problem1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.2072  -7.7558   0.3587   7.9145  23.7530 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.52478    1.48347   9.117 1.12e-13 ***
## x           -0.80463    0.04565 -17.626  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.37 on 73 degrees of freedom
## Multiple R-squared:  0.8097, Adjusted R-squared:  0.8071 
## F-statistic: 310.7 on 1 and 73 DF,  p-value: < 2.2e-16

8:

b = problem1
mean(b$y)

## [1] 1.347782

sd(b$y)

## [1] 25.8871

a = problem1
mean(a$x)

## [1] 15.13373

sd(a$x)

## [1] 28.95073

cor(b$y, a$x, method="pearson")

## [1] -0.8998508

R^2=(-0.9)2=0.81

9:

An R^2 value of 0.81 indicates that 81% of the variance of y being studied is explained by the variance of x.

Problem 2:

1:

setwd("C:/Users/tycho/Desktop/DATA101")
problem2 <- read_csv("problem2.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   group = col_double(),
##   x = col_double(),
##   y = col_double()
## )

view(problem2)

problem2 %>%
  ggplot(aes(x, y)) +
  geom_point()

2:

(summary(lm(y ~ x, data = problem2)))

## 
## Call:
## lm(formula = y ~ x, data = problem2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.967 -3.344  1.423  4.608  5.726 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.967e+00  1.951e+00   4.597 0.000127 ***
## x           5.912e-17  3.716e-02   0.000 1.000000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.024 on 23 degrees of freedom
## Multiple R-squared:  1.033e-31,  Adjusted R-squared:  -0.04348 
## F-statistic: 2.375e-30 on 1 and 23 DF,  p-value: 1

problem2 %>%
  ggplot(aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red")

## `geom_smooth()` using formula 'y ~ x'

3:

model2 <- (summary(lm(y ~ x, data = problem2)))
res2 <- resid(model)
res2

##             1             2             3             4             5 
##  9.373492e+00  1.510940e+01  5.189054e+00 -1.148329e+01 -1.303100e+00 
##             6             7             8             9            10 
##  2.138027e+01 -3.665295e+00  8.983401e+00  1.505603e+01  1.027429e+01 
##            11            12            13            14            15 
##  3.587066e-01  3.006344e+00 -1.019769e+01  5.864940e+00 -1.006422e+01 
##            16            17            18            19            20 
## -9.743549e+00 -1.147619e+01 -7.360259e+00  1.338352e+01 -4.110787e+00 
##            21            22            23            24            25 
## -3.861121e-01 -5.139230e+00  7.009039e+00  1.147444e+01 -2.036012e+00 
##            26            27            28            29            30 
##  7.844302e+00 -4.134491e+00  1.334100e+01  7.765650e+00  1.258522e+01 
##            31            32            33            34            35 
##  7.984753e+00  3.160118e+00 -2.211238e+01 -7.529306e+00  6.271544e+00 
##            36            37            38            39            40 
## -4.899415e+00 -2.609127e+00 -4.409198e+00  1.982493e+01 -7.982239e+00 
##            41            42            43            44            45 
## -4.767796e+00  6.729302e-01  3.397794e+00  2.694373e+00  6.589179e-04 
##            46            47            48            49            50 
## -1.774166e+01  4.938532e+00  6.927625e+00 -5.332332e+00  8.510232e+00 
##            51            52            53            54            55 
## -8.489171e+00  5.058549e+00 -7.378020e+00 -2.040812e+01  1.420959e+01 
##            56            57            58            59            60 
##  8.802078e+00  2.109028e+00  1.867097e+01 -1.889653e+01  1.963526e+00 
##            61            62            63            64            65 
## -2.820716e+01  1.801657e+01 -9.958345e+00 -1.507428e+01 -9.797854e+00 
##            66            67            68            69            70 
## -1.349673e+01 -2.330287e+01 -1.118427e+01 -1.467768e+00  1.152775e+01 
##            71            72            73            74            75 
##  5.436499e+00 -1.428566e+01 -7.006722e+00  2.375299e+01  5.507030e+00

4:

x <- problem2
y <- problem2
cor(y,x, method=c("pearson"), use = "complete.obs")

## Warning in cor(y, x, method = c("pearson"), use = "complete.obs"): the standard
## deviation is zero

##       group            x            y
## group    NA           NA           NA
## x        NA 1.000000e+00 2.961786e-16
## y        NA 2.961786e-16 1.000000e+00

5:

Does not indicate a strong linear relationship.

6:

mean(problem2$x)

## [1] 45

sd(problem2$x)

## [1] 27.59925

problem2 %>%
  mutate(zscore = (x - mean(x))/sd(x))

mean(problem2$y)

## [1] 8.96741

sd(problem2$y)

## [1] 4.918331

problem2 %>%
  mutate(zscore = (y - mean(y))/sd(y))

7:

reg2 <- lm(y ~ x, problem2)
summary(reg2)

## 
## Call:
## lm(formula = y ~ x, data = problem2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.967 -3.344  1.423  4.608  5.726 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.967e+00  1.951e+00   4.597 0.000127 ***
## x           5.912e-17  3.716e-02   0.000 1.000000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.024 on 23 degrees of freedom
## Multiple R-squared:  1.033e-31,  Adjusted R-squared:  -0.04348 
## F-statistic: 2.375e-30 on 1 and 23 DF,  p-value: 1

8:

b = problem2
mean(b$y)

## [1] 8.96741

sd(b$y)

## [1] 4.918331

a = problem2
mean(a$x)

## [1] 45

sd(a$x)

## [1] 27.59925

cor(b$y, a$x, method="pearson")

## [1] 2.961786e-16

R^2=(2.96e-16)2=8.76e-32

9:

An R^2 value of nearly 0 means there is almost no relationship we can conjure between x and y solely by linear regression analysis. (almost maximum variance)

Problem 3:

1:

setwd("C:/Users/tycho/Desktop/DATA101")
problem3 <- read_csv("problem3.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   group = col_double(),
##   x = col_double(),
##   y = col_double()
## )

view(problem3)

2:

(summary(lm(y ~ x, data = problem3)))

## Warning in summary.lm(lm(y ~ x, data = problem3)): essentially perfect fit:
## summary may be unreliable

## 
## Call:
## lm(formula = y ~ x, data = problem3)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -4.956e-16 -2.600e-19  8.510e-18  1.609e-17  5.663e-17 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -1.256e-16  1.896e-17 -6.626e+00 2.76e-08 ***
## x            1.000e+00  3.405e-17  2.937e+16  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.402e-17 on 48 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 8.625e+32 on 1 and 48 DF,  p-value: < 2.2e-16

problem3 %>%
  ggplot(aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red")

## `geom_smooth()` using formula 'y ~ x'

3:

model3 <- (summary(lm(y ~ x, data = problem3)))

## Warning in summary.lm(lm(y ~ x, data = problem3)): essentially perfect fit:
## summary may be unreliable

res3 <- resid(model)
res3

##             1             2             3             4             5 
##  9.373492e+00  1.510940e+01  5.189054e+00 -1.148329e+01 -1.303100e+00 
##             6             7             8             9            10 
##  2.138027e+01 -3.665295e+00  8.983401e+00  1.505603e+01  1.027429e+01 
##            11            12            13            14            15 
##  3.587066e-01  3.006344e+00 -1.019769e+01  5.864940e+00 -1.006422e+01 
##            16            17            18            19            20 
## -9.743549e+00 -1.147619e+01 -7.360259e+00  1.338352e+01 -4.110787e+00 
##            21            22            23            24            25 
## -3.861121e-01 -5.139230e+00  7.009039e+00  1.147444e+01 -2.036012e+00 
##            26            27            28            29            30 
##  7.844302e+00 -4.134491e+00  1.334100e+01  7.765650e+00  1.258522e+01 
##            31            32            33            34            35 
##  7.984753e+00  3.160118e+00 -2.211238e+01 -7.529306e+00  6.271544e+00 
##            36            37            38            39            40 
## -4.899415e+00 -2.609127e+00 -4.409198e+00  1.982493e+01 -7.982239e+00 
##            41            42            43            44            45 
## -4.767796e+00  6.729302e-01  3.397794e+00  2.694373e+00  6.589179e-04 
##            46            47            48            49            50 
## -1.774166e+01  4.938532e+00  6.927625e+00 -5.332332e+00  8.510232e+00 
##            51            52            53            54            55 
## -8.489171e+00  5.058549e+00 -7.378020e+00 -2.040812e+01  1.420959e+01 
##            56            57            58            59            60 
##  8.802078e+00  2.109028e+00  1.867097e+01 -1.889653e+01  1.963526e+00 
##            61            62            63            64            65 
## -2.820716e+01  1.801657e+01 -9.958345e+00 -1.507428e+01 -9.797854e+00 
##            66            67            68            69            70 
## -1.349673e+01 -2.330287e+01 -1.118427e+01 -1.467768e+00  1.152775e+01 
##            71            72            73            74            75 
##  5.436499e+00 -1.428566e+01 -7.006722e+00  2.375299e+01  5.507030e+00

4:

x <- problem3
y <- problem3
cor(y,x, method=c("pearson"), use = "complete.obs")

## Warning in cor(y, x, method = c("pearson"), use = "complete.obs"): the standard
## deviation is zero

##       group  x  y
## group    NA NA NA
## x        NA  1  1
## y        NA  1  1

5:

Calculation indicates a strong linear relationship between x and y. (no spread)

6:

mean(problem3$x)

## [1] 0.4641447

sd(problem3$x)

## [1] 0.3105528

problem3 %>%
  mutate(zscore = (x - mean(x))/sd(x))

mean(problem3$y)

## [1] 0.4641447

sd(problem3$y)

## [1] 0.3105528

problem3 %>%
  mutate(zscore = (y - mean(y))/sd(y))

7:

reg3 <- lm(y ~ x, problem3)
summary(reg3)

## Warning in summary.lm(reg3): essentially perfect fit: summary may be unreliable

## 
## Call:
## lm(formula = y ~ x, data = problem3)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -4.956e-16 -2.600e-19  8.510e-18  1.609e-17  5.663e-17 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -1.256e-16  1.896e-17 -6.626e+00 2.76e-08 ***
## x            1.000e+00  3.405e-17  2.937e+16  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.402e-17 on 48 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 8.625e+32 on 1 and 48 DF,  p-value: < 2.2e-16

8:

b = problem3
mean(b$y)

## [1] 0.4641447

sd(b$y)

## [1] 0.3105528

a = problem3
mean(a$x)

## [1] 0.4641447

sd(a$x)

## [1] 0.3105528

cor(b$y, a$x, method="pearson")

## [1] 1

R²⁼¹2=1

9:

An R^2 value of 1 means the points are perfectly aligned on the linear regression so 100% of the variance of y is explained by the variance of x.

Problem 4:

1:

setwd("C:/Users/tycho/Desktop/DATA101")
problem4 <- read_csv("problem4.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   group = col_double(),
##   x = col_double(),
##   y = col_double()
## )

view(problem4)

2:

(summary(lm(y ~ x, data = problem4)))

## 
## Call:
## lm(formula = y ~ x, data = problem4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.06170 -0.13830  0.08706  0.34014  0.85016 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.76533    0.09579   7.990 1.32e-10 ***
## x            0.18807    0.04557   4.127 0.000133 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5064 on 52 degrees of freedom
## Multiple R-squared:  0.2467, Adjusted R-squared:  0.2322 
## F-statistic: 17.03 on 1 and 52 DF,  p-value: 0.0001332

problem4 %>%
  ggplot(aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red")

## `geom_smooth()` using formula 'y ~ x'

3:

model4 <- (summary(lm(y ~ x, data = problem4)))
res4 <- resid(model)
res4

##             1             2             3             4             5 
##  9.373492e+00  1.510940e+01  5.189054e+00 -1.148329e+01 -1.303100e+00 
##             6             7             8             9            10 
##  2.138027e+01 -3.665295e+00  8.983401e+00  1.505603e+01  1.027429e+01 
##            11            12            13            14            15 
##  3.587066e-01  3.006344e+00 -1.019769e+01  5.864940e+00 -1.006422e+01 
##            16            17            18            19            20 
## -9.743549e+00 -1.147619e+01 -7.360259e+00  1.338352e+01 -4.110787e+00 
##            21            22            23            24            25 
## -3.861121e-01 -5.139230e+00  7.009039e+00  1.147444e+01 -2.036012e+00 
##            26            27            28            29            30 
##  7.844302e+00 -4.134491e+00  1.334100e+01  7.765650e+00  1.258522e+01 
##            31            32            33            34            35 
##  7.984753e+00  3.160118e+00 -2.211238e+01 -7.529306e+00  6.271544e+00 
##            36            37            38            39            40 
## -4.899415e+00 -2.609127e+00 -4.409198e+00  1.982493e+01 -7.982239e+00 
##            41            42            43            44            45 
## -4.767796e+00  6.729302e-01  3.397794e+00  2.694373e+00  6.589179e-04 
##            46            47            48            49            50 
## -1.774166e+01  4.938532e+00  6.927625e+00 -5.332332e+00  8.510232e+00 
##            51            52            53            54            55 
## -8.489171e+00  5.058549e+00 -7.378020e+00 -2.040812e+01  1.420959e+01 
##            56            57            58            59            60 
##  8.802078e+00  2.109028e+00  1.867097e+01 -1.889653e+01  1.963526e+00 
##            61            62            63            64            65 
## -2.820716e+01  1.801657e+01 -9.958345e+00 -1.507428e+01 -9.797854e+00 
##            66            67            68            69            70 
## -1.349673e+01 -2.330287e+01 -1.118427e+01 -1.467768e+00  1.152775e+01 
##            71            72            73            74            75 
##  5.436499e+00 -1.428566e+01 -7.006722e+00  2.375299e+01  5.507030e+00

4:

x <- problem4
y <- problem4
cor(y,x, method=c("pearson"), use = "complete.obs")

## Warning in cor(y, x, method = c("pearson"), use = "complete.obs"): the standard
## deviation is zero

##       group         x         y
## group    NA        NA        NA
## x        NA 1.0000000 0.4967192
## y        NA 0.4967192 1.0000000

5:

A linear relationship can be interpreted by the data but it is not a particularly strong one.

6:

mean(problem4$x)

## [1] 1.460012

sd(problem4$x)

## [1] 1.526316

problem4 %>%
  mutate(zscore = (x - mean(x))/sd(x))

mean(problem4$y)

## [1] 1.039912

sd(problem4$y)

## [1] 0.5778975

problem4 %>%
  mutate(zscore = (y - mean(y))/sd(y))

7:

reg4 <- lm(y ~ x, problem4)
summary(reg4)

## 
## Call:
## lm(formula = y ~ x, data = problem4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.06170 -0.13830  0.08706  0.34014  0.85016 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.76533    0.09579   7.990 1.32e-10 ***
## x            0.18807    0.04557   4.127 0.000133 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5064 on 52 degrees of freedom
## Multiple R-squared:  0.2467, Adjusted R-squared:  0.2322 
## F-statistic: 17.03 on 1 and 52 DF,  p-value: 0.0001332

8:

b = problem4
mean(b$y)

## [1] 1.039912

sd(b$y)

## [1] 0.5778975

a = problem4
mean(a$x)

## [1] 1.460012

sd(a$x)

## [1] 1.526316

cor(b$y, a$x, method="pearson")

## [1] 0.4967192

R^2=0.4972=0.247

9:

An R^2 value of 0.247 indicates that 24.7% of the variance of y is explained by the variance of x.

For each of the four files:

plot the scatter plot of x and y (x on the horizontal axis.)
fit a regression model using lm report the coefficients, and plot the regression line over the scatter plot.
Plot the residuals versus x. Record any observations you have in this document.
Calculate the correlation between x and y.
Does the calculation indicate a strong linear relationship between x and y?
Normalize the x and y data, this means calculationg the z-score for each \(x_i\) and \(y_i\), it is possible to do this with one line of R code.

Hint the z-score of a vector of data, x, is the vector \(z\) given by \(z = \frac{x - \bar{x}}{s_x}\) where \(s_x\) is the standard deviation of \(x\).

Using lm calculate the regression coefficients. What do you notice?
Calculate Total Variation and \(R^2\)
Assess your results.

Section II Body measurement study

Anthropological researchers collected body measurements from 507 individuals (247 men and 260 women.) The data are contained in the file bdims.csv. A description of the variables can be found here

Using R Perform the following tasks, give the most complete specific answers you can given the data.

setwd("C:/Users/tycho/Desktop/DATA101")
bdims <- read_csv("bdims.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double()
## )
## i Use `spec()` for the full column specifications.

view(bdims)

Describe the relationship between shoulder girth and height.

bdims %>% 
  ggplot() +
    geom_point(mapping = aes(x = sho_gi, y = hgt)) +
    geom_smooth(mapping = aes(x = sho_gi, y = hgt), method = lm)

## `geom_smooth()` using formula 'y ~ x'

gh <- lm(hgt ~ sho_gi, bdims)
summary(gh)

## 
## Call:
## lm(formula = hgt ~ sho_gi, data = bdims)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.2297  -4.7976  -0.1142   4.7885  21.0979 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 105.83246    3.27245   32.34   <2e-16 ***
## sho_gi        0.60364    0.03011   20.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.026 on 505 degrees of freedom
## Multiple R-squared:  0.4432, Adjusted R-squared:  0.4421 
## F-statistic:   402 on 1 and 505 DF,  p-value: < 2.2e-16

Lin_height = 0.60364*sho_gi + 105.83246 Median is close to 0. Min and max are nearly the same (in terms of magnitude). Residual square is on the low side (44.32%).

How would the relationship change if shoulder girth were measured in inches while the units of height remained in centimeters? All the points on the plot will be shifted to the right (since inches are larger than centimeters), but this doesn’t change the relationship between the two variables.
Write the equation of the regression line for predicting height. How would you find the coefficients without using lm?

h = bdims
mean(h$hgt)

## [1] 171.1438

sd(h$hgt)

## [1] 9.407205

s = bdims
mean(s$sho_gi)

## [1] 108.1951

sd(s$sho_gi)

## [1] 10.37483

cor(s$sho_gi, h$hgt, method="pearson")

## [1] 0.6657353

slope <- 0.67*9.41/10.37
slope

## [1] 0.6079749

intercept <- 171.14-0.6079749*108.2
intercept

## [1] 105.3571

Equation: Height = 0.608*girth + 105.83

Interpret the slope and intecept in this context. slope: for every increase in girth, height rises by 0.608 intercept: height at girth of 0.
Calculate \(R^2\) of the regression line for predicting height from shoulder girth R^2 = 0.67^2 = 0.4489
A randomly selected student from your class has a shoulder girth of 100cm. Predict the height of this student using this model. 0.608*100+105.83=166.63
If the selected student is actually 160 cm tall, calculate the residual and explain it’s meaning. 160-166.63= -6.63 (Model overestimates student height by 6.16 cm)
A one year old has has a shoulder girth of 56 cm. Would it be appropriate to use this linear model to predict the height of this child? It would not be appropriate to use this linear model since the minimum shoulder girth is greater than 85 cm.

m=bdims
min(m$sho_gi)

## [1] 85.9

Murders and Poverty

murders %>%
  ggplot(aes(x = perc_pov, y = annual_murders_per_mil)) +
  geom_point()

summary(lm(annual_murders_per_mil~ perc_pov, data = murders))

## 
## Call:
## lm(formula = annual_murders_per_mil ~ perc_pov, data = murders)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1663 -2.5613 -0.9552  2.8887 12.3475 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -29.901      7.789  -3.839   0.0012 ** 
## perc_pov       2.559      0.390   6.562 3.64e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.512 on 18 degrees of freedom
## Multiple R-squared:  0.7052, Adjusted R-squared:  0.6889 
## F-statistic: 43.06 on 1 and 18 DF,  p-value: 3.638e-06

Just referring to the plot and the summary above:

Write out the linear model

y = murders
mean(y$annual_murders_per_mil)

## [1] 20.57

sd(y$annual_murders_per_mil)

## [1] 9.881407

x = murders
mean(x$perc_pov)

## [1] 19.72

sd(x$perc_pov)

## [1] 3.242254

cor(y$annual_murders_per_mil, x$perc_pov, method="pearson")

## [1] 0.8397782

Equation: y = 2.56*x-29.91

Interpret the intercept

intercept <- 20.57-2.56*19.72
intercept

## [1] -29.9132

Interpret the slope

slope <- 0.84*9.88/3.24
slope

## [1] 2.561481

Interpret \(R^2\) R^2=0.842=0.7056
Calculate the correlation coefficient

cor(y$annual_murders_per_mil, x$perc_pov)

## [1] 0.8397782

Project 3 Regression