Introduction

This project is intended to extend and supplement what we talked about last week in discussing Linear Regression.

In some of the sections you will be asked to write R code to solve a particular problem. In some you will be asked to answer questions in your own words.

library(tidyverse)

Summary and Exercises

Single Variable Regression

We are trying to fit a straight line to a scatter plot. This means finding the equation of the line that comes closest to all the points. More specifically, the problem becomes, given \(X = \{x_1,x_1,\ldots,x_n\}\) and \(Y = \{y_1, y_2,\dots,y_n\}\) find \(\beta_0\) and \(beta_1\) such that \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\] where, \(\sum_{i}^{n}\epsilon_i = 0\) and \(\sigma^2 = \sum_{i}^{n}\epsilon_i ^2\) is as minimized. This is called least squares regression.

Given estimates of \(\beta_0\) and \(\beta_1\) called \(\hat{\beta_0}\) and \(\hat{\beta_1}\) we define: \[\hat{y_i} = \hat{\beta_0} + \hat{\beta_1}x_i\] The \(\hat{y_i}\) are called the predicted values.

The residuals, \(e_i\), are the differences between the predicted values and the actual values.\[e_i = y_i - \hat{y_i}\]

Note: the residuals are estimates of the \(\epsilon_i\) in the model.

Examining Residual variance.

Please work through this document before continuing with this project.

Section I Working with Simulated scatter plots.

There are four data sets in Project3 data folder the files section, problem1.csv, problem2.csv, problem3.csv, and problem4.csv These data are some of the groups from this data. You will need to load the files into your project using read_csv.

You will also need to read in the file `bdism.csv’ to do the problems in Section II.

prob1 <- read.csv("problem1.csv")
prob2 <- read.csv("problem2.csv")
prob3 <- read.csv("problem3.csv")
prob4 <- read.csv("problem4.csv")

For each of the four files:

  1. plot the scatter plot of x and y (x on the horizontal axis.)
#plotting scatter plots for all four data sets

ggplot(prob1) + 
  geom_point(aes(x = x, y = y))

ggplot(prob2) + 
  geom_point(aes(x = x, y = y))

ggplot(prob3) + 
  geom_point(aes(x = x, y = y))

ggplot(prob4) + 
  geom_point(aes(x = x, y = y))

  1. fit a regression model using lm report the coefficients, and plot the regression line over the scatter plot.
prob1_fit <- lm(x ~ y, data = prob1)
summary(prob1_fit)
## 
## Call:
## lm(formula = x ~ y, data = prob1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.914  -6.822  -0.336   7.087  27.507 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  16.4901     1.4702   11.22   <2e-16 ***
## y            -1.0063     0.0571  -17.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.71 on 73 degrees of freedom
## Multiple R-squared:  0.8097, Adjusted R-squared:  0.8071 
## F-statistic: 310.7 on 1 and 73 DF,  p-value: < 2.2e-16
prob2_fit <- lm(x ~ y, data = prob2)
summary(prob2_fit)
## 
## Call:
## lm(formula = x ~ y, data = prob2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -45.0  -22.5    0.0   22.5   45.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.500e+01  1.191e+01   3.778 0.000975 ***
## y           1.180e-15  1.170e+00   0.000 1.000000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.19 on 23 degrees of freedom
## Multiple R-squared:  5.247e-32,  Adjusted R-squared:  -0.04348 
## F-statistic: 1.207e-30 on 1 and 23 DF,  p-value: 1
prob3_fit <- lm(x ~ y, data = prob3)
summary(prob3_fit)
## Warning in summary.lm(prob3_fit): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = x ~ y, data = prob3)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -4.956e-16 -2.600e-19  8.510e-18  1.609e-17  5.663e-17 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -1.256e-16  1.896e-17 -6.626e+00 2.76e-08 ***
## y            1.000e+00  3.405e-17  2.937e+16  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.402e-17 on 48 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 8.625e+32 on 1 and 48 DF,  p-value: < 2.2e-16
prob4_fit <- lm(x ~ y, data = prob4)
summary(prob4_fit)
## 
## Call:
## lm(formula = x ~ y, data = prob4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1522 -1.1929  0.3144  1.0636  2.0662 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.09574    0.37736   0.254 0.800720    
## y            1.31191    0.31788   4.127 0.000133 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.337 on 52 degrees of freedom
## Multiple R-squared:  0.2467, Adjusted R-squared:  0.2322 
## F-statistic: 17.03 on 1 and 52 DF,  p-value: 0.0001332
ggplot(prob1, aes(x = x, y = y)) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red")
## `geom_smooth()` using formula 'y ~ x'

ggplot(prob2, aes(x = x, y = y)) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red")
## `geom_smooth()` using formula 'y ~ x'

ggplot(prob3, aes(x = x, y = y)) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red")
## `geom_smooth()` using formula 'y ~ x'

ggplot(prob4, aes(x = x, y = y)) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red")
## `geom_smooth()` using formula 'y ~ x'

#playing around with code I found online that will show the coefficints neatly on the graph:

ggplotRegression <- function (fit) {

require(ggplot2)

##
ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red") +
  labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),
                     "Intercept =",signif(fit$coef[[1]],5 ),
                     " Slope =",signif(fit$coef[[2]], 5),
                     " P =",signif(summary(fit)$coef[2,4], 5)))
}

ggplotRegression(prob1_fit)
## `geom_smooth()` using formula 'y ~ x'

ggplotRegression(prob2_fit)
## `geom_smooth()` using formula 'y ~ x'

ggplotRegression(prob3_fit)
## Warning in summary.lm(fit): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(fit): essentially perfect fit: summary may be unreliable
## `geom_smooth()` using formula 'y ~ x'

ggplotRegression(prob4_fit)
## `geom_smooth()` using formula 'y ~ x'

  1. Plot the residuals versus x.
library(modelr)
prob1_resid <- prob1 %>% 
  add_residuals(prob1_fit)
ggplot(prob1_resid, aes(x, resid)) + 
  geom_ref_line(h = 0) +
  geom_point()

prob2_resid <- prob2 %>% 
  add_residuals(prob2_fit)
ggplot(prob2_resid, aes(x, resid)) + 
  geom_ref_line(h = 0) +
  geom_point()

prob3_resid <- prob3 %>% 
  add_residuals(prob3_fit)
ggplot(prob3_resid, aes(x, resid)) + 
  geom_ref_line(h = 0) +
  geom_point()

prob4_resid <- prob4 %>% 
  add_residuals(prob4_fit)
ggplot(prob4_resid, aes(x, resid)) + 
  geom_ref_line(h = 0) +
  geom_point()

Record any observations you have in this document. ##linear regression model is maybe not appropriate for prob2.

  1. Calculate the correlation between x and y.
  2. Does the calculation indicate a strong linear relationship between x and y?
cor_prob1 <- cor.test(prob1$x, prob1$y, method = "pearson")
cor_prob1
## 
##  Pearson's product-moment correlation
## 
## data:  prob1$x and prob1$y
## t = -17.626, df = 73, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9357108 -0.8455841
## sample estimates:
##        cor 
## -0.8998508
# cor = -0.899 (close to -1), and p-value 2.2e-16 indicating strong linear relationship

cor_prob2 <- cor.test(prob2$x, prob2$y, method = "pearson")
cor_prob2
## 
##  Pearson's product-moment correlation
## 
## data:  prob2$x and prob2$y
## t = 1.4204e-15, df = 23, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3951309  0.3951309
## sample estimates:
##          cor 
## 2.961786e-16
# cor = 2.961786e-16, and p-value 1 indicating weak relationship

cor_prob3 <- cor.test(prob3$x, prob3$y, method = "pearson")
cor_prob3
## 
##  Pearson's product-moment correlation
## 
## data:  prob3$x and prob3$y
## t = Inf, df = 48, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  1 1
## sample estimates:
## cor 
##   1
# cor = 1 (perfect positive relationship), and p-value 2.2e-16 indicating a strong linear relationship

cor_prob4 <- cor.test(prob4$x, prob4$y, method = "pearson")
cor_prob4
## 
##  Pearson's product-moment correlation
## 
## data:  prob4$x and prob4$y
## t = 4.127, df = 52, p-value = 0.0001332
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2640820 0.6747383
## sample estimates:
##       cor 
## 0.4967192
# cor = 0.4967192, and p-value is 0.0001332 indicating a moderate linear relationship
  1. Normalize the x and y data, this means calculationg the z-score for each \(x_i\) and \(y_i\), it is possible to do this with one line of R code.

Hint the z-score of a vector of data, x, is the vector \(z\) given by \(z = \frac{x - \bar{x}}{s_x}\) where \(s_x\) is the standard deviation of \(x\).

normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}

prob1_norm <- normalize(prob1)
prob2_norm <- normalize(prob2)
prob3_norm <- normalize(prob3)
prob4_norm <- normalize(prob4)

prob1_norm
prob2_norm
prob3_norm
prob4_norm
  1. Using lm calculate the regression coefficients. What do you notice?
lm(x ~ y, prob1_norm)
## 
## Call:
## lm(formula = x ~ y, data = prob1_norm)
## 
## Coefficients:
## (Intercept)            y  
##       1.027       -1.006
lm(x ~ y, prob2_norm)
## 
## Call:
## lm(formula = x ~ y, data = prob2_norm)
## 
## Coefficients:
## (Intercept)            y  
##         0.5          0.0
lm(x ~ y, prob3_norm)
## 
## Call:
## lm(formula = x ~ y, data = prob3_norm)
## 
## Coefficients:
## (Intercept)            y  
##    1.57e-17     1.00e+00
lm(x ~ y, prob4_norm)
## 
## Call:
## lm(formula = x ~ y, data = prob4_norm)
## 
## Coefficients:
## (Intercept)            y  
##   -0.006963     1.311911

My normalized y coefficients are the same as the pre normalized data.

  1. Calculate Total Variation and \(R^2\) #otal variability=residual variability+regression variability
prob1_resvar = sum(residuals(prob1_fit)^2)
y_hat = predict(prob1_fit)
prob1_regvar = sum((y_hat - mean(prob1$x))^2)
prob1_totvar = sum((prob1$x - mean(prob1$x))^2)
prob1_totvar
## [1] 62022.72
prob1_resvar + prob1_regvar
## [1] 62022.72
prob2_resvar = sum(residuals(prob2_fit)^2)
y_hat = predict(prob2_fit)
prob2_regvar = sum((y_hat - mean(prob2$x))^2)
prob2_totvar = sum((prob2$x - mean(prob2$x))^2)
prob2_totvar
## [1] 18281.25
prob2_resvar + prob2_regvar
## [1] 18281.25
prob3_resvar = sum(residuals(prob3_fit)^2)
y_hat = predict(prob3_fit)
prob3_regvar = sum((y_hat - mean(prob3$x))^2)
prob3_totvar = sum((prob3$x - mean(prob3$x))^2)
prob3_totvar
## [1] 4.725709
prob3_resvar + prob3_regvar
## [1] 4.725709
prob4_resvar = sum(residuals(prob4_fit)^2)
y_hat = predict(prob4_fit)
prob4_regvar = sum((y_hat - mean(prob4$x))^2)
prob4_totvar = sum((prob4$x - mean(prob4$x))^2)
prob4_totvar
## [1] 123.4709
prob4_resvar + prob4_regvar
## [1] 123.4709
prob1_rsk = prob1_regvar / prob1_totvar
prob1_rsk
## [1] 0.8097315
prob2_rsk = prob2_regvar / prob2_totvar
prob2_rsk
## [1] 6.62805e-32
prob3_rsk = prob3_regvar / prob3_totvar
prob3_rsk
## [1] 1
prob4_rsk = prob4_regvar / prob4_totvar
prob4_rsk
## [1] 0.24673
  1. Assess your results. ## 81% of problem1 variation is explained by the regression model ## Almost none of problem2 variation is explained by the regression model ## All of problem3 variation is explained by the regression model ## 25% of problem4 variation is explained by the regression model

Section II Body measurement study

Anthropological researchers collected body measurements from 507 individuals (247 men and 260 women.) The data are contained in the file bdims.csv. A description of the variables can be found here

bdims <- read.csv("bdims.csv")

Using R Perform the following tasks, give the most complete specific answers you can given the data.

  1. Describe the relationship between shoulder girth and height.
bdims %>%
  ggplot(aes(x = sho_gi, y = hgt)) +
  geom_point()

## as shoulder girth icreases so does height.

  1. How would the relationship change if shoulder girth were measured in inches while the units of height remained in centimeters?

measuring in inches does not change the size of the girth. the relationship would be the same.

  1. Write the equation of the regression line for predicting height. How would you find the coefficients without using lm? ##Y = bX + a ##b = slope ##a = incercept ##Y = height ##X = shoulder girth
library(stats)

#slope formula = correlation * (sd x / sd y)
slope <- cor(bdims$hgt, bdims$sho_gi) * (sd(bdims$hgt)/ sd(bdims$sho_gi))
slope
## [1] 0.6036442
## intercept formula = mean of y - slope * mean of x
intercept <- mean(bdims$hgt) - slope * mean(bdims$sho_gi)
intercept
## [1] 105.8325

##height = 105.83 + 0.6 * shoulder girth

  1. Interpret the slope and intercept in this context. ## intercept is height when shoulder girth is zero. It is impossible to have zero girth. Slope is for every unit increased by shoulder girth an additional 0.6 of height will increase.

  2. Calculate \(R^2\) of the regression line for predicting height fro shoulder girth m

rsk <- (cor(bdims$hgt, bdims$sho_gi)) ^ 2
rsk
## [1] 0.4432035
  1. A randomly selected student from your class has a shoulder girth of 100cm. Predict the height of this student using this model.
intercept + slope * 100
## [1] 166.1969
  1. If the selected student is actually 160 cm tall, calculate the residual and explain it’s meaning.
160 - 166.1969
## [1] -6.1969

the model’s prediction is wrong by -6.1969

  1. A one year old has has a shoulder girth of 56 cm. Would it be appropriate to use this linear model to predict the height of this child?
max(bdims$sho_gi)
## [1] 134.8
min(bdims$sho_gi)
## [1] 85.9

#56 cm is out of range of the model and shouldn’t be used to predict height

Murders and Poverty

murders %>%
  ggplot(aes(x = perc_pov, y = annual_murders_per_mil)) +
  geom_point()

murders
summary(lm(annual_murders_per_mil~ perc_pov, data = murders))
## 
## Call:
## lm(formula = annual_murders_per_mil ~ perc_pov, data = murders)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1663 -2.5613 -0.9552  2.8887 12.3475 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -29.901      7.789  -3.839   0.0012 ** 
## perc_pov       2.559      0.390   6.562 3.64e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.512 on 18 degrees of freedom
## Multiple R-squared:  0.7052, Adjusted R-squared:  0.6889 
## F-statistic: 43.06 on 1 and 18 DF,  p-value: 3.638e-06

Just referring to the plot and the summary above:

  1. Write out the linear model # Y = bX + a # Y = annual_murders_per_mil # X = perc_pov # (b)slope =2.559 # (a)intercept = -29.901 # annual_murders_per_mil = 2.559 * perc_pov - 29.901

  2. Interpret the intercept # if poverty is zero than murders would be - 29.901 which is impossible.

  3. Interpret the slope # for every unit of poverty murders will increase by 2.559

  4. Interpret \(R^2\)

%70.52 of the dependent variable murders is explained by the independent variable poverty in the regression model

  1. Calculate the correlation coefficient
sqrt(.7052)
## [1] 0.8397619