Project 3 Regression

Introduction

This project is intended to extend and supplement what we talked about last week in discussing Linear Regression.

In some of the sections you will be asked to write R code to solve a particular problem. In some you will be asked to answer questions in your own words.

Summary and Exercises

Single Variable Regression

We are trying to fit a straight line to a scatter plot. This means finding the equation of the line that comes closest to all the points. More specifically, the problem becomes, given \(X = \{x_1,x_1,\ldots,x_n\}\) and \(Y = \{y_1, y_2,\dots,y_n\}\) find \(\beta_0\) and \(beta_1\) such that \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\] where, \(\sum_{i}^{n}\epsilon_i = 0\) and \(\sigma^2 = \sum_{i}^{n}\epsilon_i ^2\) is as minimized. This is called least squares regression.

Given estimates of \(\beta_0\) and \(\beta_1\) called \(\hat{\beta_0}\) and \(\hat{\beta_1}\) we define: \[\hat{y_i} = \hat{\beta_0} + \hat{\beta_1}x_i\] The \(\hat{y_i}\) are called the predicted values.

The residuals, \(e_i\), are the differences between the predicted values and the actual values.\[e_i = y_i - \hat{y_i}\]

Note: the residuals are estimates of the \(\epsilon_i\) in the model.

Examining Residual variance.

Please work through this document before continuing with this project.

Section I Working with Simulated scatter plots.

There are four data sets in Project3 data folder the files section, problem1.csv, problem2.csv, problem3.csv, and problem4.csv These data are some of the groups from this data. You will need to load the files into your project using read_csv.

You will also need to read in the file `bdism.csv’ to do the problems in Section II.

setwd("/Users/justinpark/Desktop/2744203/Project3 Data")
prob1 <- read.csv("problem1.csv")
prob2 <- read.csv("problem2.csv")
prob3 <- read.csv("problem3.csv")
prob4 <- read.csv("problem4.csv")
bdims <- read.csv("bdims.csv")

For each of the four files:

plot the scatter plot of x and y (x on the horizontal axis.)
fit a regression model using lm report the coefficients, and plot the regression line over the scatter plot.
Plot the residuals versus x. Record any observations you have in this document.
Calculate the correlation between x and y.
Does the calculation indicate a strong linear relationship between x and y?
Normalize the x and y data, this means calculationg the z-score for each \(x_i\) and \(y_i\), it is possible to do this with one line of R code.

Hint the z-score of a vector of data, x, is the vector \(z\) given by \(z = \frac{x - \bar{x}}{s_x}\) where \(s_x\) is the standard deviation of \(x\).

Using lm calculate the regression coefficients. What do you notice?
Calculate Total Variation and \(R^2\)
Assess your results.

Problem 1

# first problem
prob1 %>% 
  filter(group ==1) %>%
  ggplot(aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

# Summary Stats
sum = lm(y ~ x, data = prob1)
prob.res = resid(sum)
summary(sum)

## 
## Call:
## lm(formula = y ~ x, data = prob1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.2072  -7.7558   0.3587   7.9145  23.7530 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.52478    1.48347   9.117 1.12e-13 ***
## x           -0.80463    0.04565 -17.626  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.37 on 73 degrees of freedom
## Multiple R-squared:  0.8097, Adjusted R-squared:  0.8071 
## F-statistic: 310.7 on 1 and 73 DF,  p-value: < 2.2e-16

# Correlation Calculation
prob1 %>%
  cor(method = c("pearson", "kendall", "spearman"))

## Warning in cor(., method = c("pearson", "kendall", "spearman")): the standard
## deviation is zero

##       group          x          y
## group     1         NA         NA
## x        NA  1.0000000 -0.8998508
## y        NA -0.8998508  1.0000000

Correlation is equal to -0.8999. This indicates a fairly strong negative relationship between x and y. And the summary stats show that the \(R^2\) is equal to 0.8097. This explains that 80.97% of the data fits the regression, indicating that it is a fairly good fit.

# Residuals
plot(prob1$x, prob.res,
     ylab = "Residuals",
     xlab = "x")
abline(0, 0)

Problem 2

# second problem
prob2 %>% 
  filter(group == 5) %>%
  ggplot(aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

# Summary Stats
sum2 = lm(y ~ x, data = prob2)
prob.res2 = resid(sum2)
summary(sum2)

## 
## Call:
## lm(formula = y ~ x, data = prob2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.967 -3.344  1.423  4.608  5.726 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.967e+00  1.951e+00   4.597 0.000127 ***
## x           5.912e-17  3.716e-02   0.000 1.000000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.024 on 23 degrees of freedom
## Multiple R-squared:  1.033e-31,  Adjusted R-squared:  -0.04348 
## F-statistic: 2.375e-30 on 1 and 23 DF,  p-value: 1

# Correlation Calculation
prob2 %>%
  cor(method = c("pearson", "kendall", "spearman"))

## Warning in cor(., method = c("pearson", "kendall", "spearman")): the standard
## deviation is zero

##       group            x            y
## group     1           NA           NA
## x        NA 1.000000e+00 2.961786e-16
## y        NA 2.961786e-16 1.000000e+00

Correlation is equal to \(2.961786 * 10^{-16}\). This, essentially being 0, indicates that no relationship exists between x and y. And the summary stats show that the \(R^2\) is equal to \(1.033 * 10^{-31}\). This explains that essentially 0% of the data fits the regression line, indicating that the model is not a good fit at all.

# Residuals
plot(prob2$x, prob.res2,
     ylab = "Residuals",
     xlab = "x")
abline(0, 0)

The graph being a parabola, would show a correlation of almost 0. As such, a linear model would be very hard to fit into a parabola.

Problem 3

# third problem
prob3 %>% 
  filter(group == 12) %>%
  ggplot(aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

# Summary Stats
sum3 = lm(y ~ x, data = prob3)
prob.res3 = resid(sum3)
summary(sum3)

## Warning in summary.lm(sum3): essentially perfect fit: summary may be unreliable

## 
## Call:
## lm(formula = y ~ x, data = prob3)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -4.956e-16 -2.600e-19  8.510e-18  1.609e-17  5.663e-17 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -1.256e-16  1.896e-17 -6.626e+00 2.76e-08 ***
## x            1.000e+00  3.405e-17  2.937e+16  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.402e-17 on 48 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 8.625e+32 on 1 and 48 DF,  p-value: < 2.2e-16

# Correlation Calculation
prob3 %>%
  cor(method = c("pearson", "kendall", "spearman"))

## Warning in cor(., method = c("pearson", "kendall", "spearman")): the standard
## deviation is zero

##       group  x  y
## group     1 NA NA
## x        NA  1  1
## y        NA  1  1

Correlation is equal to 1.00. This indicates a perfect positive relationship between x and y. And the summary stats show that the \(R^2\) is equal also equal to 1 (which makes sense as \(1^1 = 1\). This explains that 100% of the data fits the regression line, indicating that the linear model is a perfect fit of the data.

# Residuals
plot(prob3$x, prob.res3,
     ylab = "Residuals",
     xlab = "x")
abline(0, 0)

All the data points were alligned on the linear model at first observation. This explains the correlation and \(R^2\) value of 1. However, when looking at the residuals, it is apparent that there is still some (although very minimal) difference between the predicted and actual values.

Problem 4

# fourth problem
prob4 %>% 
  filter(group == 19) %>%
  ggplot(aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

# Summary Stats
sum4 = lm(y ~ x, data = prob4)
prob.res4 = resid(sum4)
summary(sum4)

## 
## Call:
## lm(formula = y ~ x, data = prob4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.06170 -0.13830  0.08706  0.34014  0.85016 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.76533    0.09579   7.990 1.32e-10 ***
## x            0.18807    0.04557   4.127 0.000133 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5064 on 52 degrees of freedom
## Multiple R-squared:  0.2467, Adjusted R-squared:  0.2322 
## F-statistic: 17.03 on 1 and 52 DF,  p-value: 0.0001332

# Correlation Calculation
prob4 %>%
  cor(method = c("pearson", "kendall", "spearman"))

## Warning in cor(., method = c("pearson", "kendall", "spearman")): the standard
## deviation is zero

##       group         x         y
## group     1        NA        NA
## x        NA 1.0000000 0.4967192
## y        NA 0.4967192 1.0000000

Correlation is equal to 0.4967. This indicates a weak positive relationship between x and y. And the summary stats show that the \(R^2\) is equal to 0.2467. This explains that 24.67% of the data fits the regression line, indicating that the linear model is not a very good fit

# Residuals
plot(prob4$x, prob.res4,
     ylab = "Residuals",
     xlab = "x")
abline(0, 0)

Section II Body measurement study

Anthropological researchers collected body measurements from 507 individuals (247 men and 260 women.) The data are contained in the file bdims.csv. A description of the variables can be found here

Using R Perform the following tasks, give the most complete specific answers you can given the data.

Describe the relationship between shoulder girth and height.

reg <- bdims %>% 
  dplyr::select(sho_gi, hgt)
  
ggplot(data = reg, aes(x = sho_gi, y = hgt)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue", se = FALSE) +
  labs(
    x = "Shoulder Girth",
    y = "Height")

## `geom_smooth()` using formula 'y ~ x'

The graph with a linear model shows a positive correlation. Overall, as the shoulder girth increases, the height increases as well.

How would the relationship change if shoulder girth were measured in inches while the units of height remained in centimeters?

The slope of the regression line would decrease significantly.

Write the equation of the regression line for predicting height. How would you find the coefficients without using lm?

lm(hgt ~ sho_gi, data = reg)

## 
## Call:
## lm(formula = hgt ~ sho_gi, data = reg)
## 
## Coefficients:
## (Intercept)       sho_gi  
##    105.8325       0.6036

\(y = 0.6036x - 105.8325\)

Interpret the slope and intecept in this context.

Intercept: If the shoulder girth of an individual were to be 0 cm, it is expected that their height is 105.8325 cm Slope: As the shoulder girth for individuals increases by 1 cm, it is expected that the height would increase by 0.6035 cm

Calculate \(R^2\) of the regression line for predicting height from shoulder girth

corr <- cor(reg$sho_gi, reg$hgt, method = c("pearson", "kendall", "spearman"))

corr^2

## [1] 0.4432035

A randomly selected student from your class has a shoulder girth of 100cm. Predict the height of this student using this model.

0.6036*(100)+105.8325

## [1] 166.1925

A one year old has has a shoulder girth of 56 cm. Would it be appropriate to use this linear model to predict the height of this child?

No, since the data is collected from men and women who are physically active, a one year old cannot be considered to be even remotely similar to them. Because of this, it would not be appropriate to predict the height of the child. And if the model were to be used, the resulting height would be 139.6341 cm. Which essentially predicts that the infant is already over a meter tall.

Murders and Poverty

murders %>%
  ggplot(aes(x = perc_pov, y = annual_murders_per_mil)) +
  geom_point()

summary(lm(annual_murders_per_mil~ perc_pov, data = murders))

## 
## Call:
## lm(formula = annual_murders_per_mil ~ perc_pov, data = murders)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1663 -2.5613 -0.9552  2.8887 12.3475 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -29.901      7.789  -3.839   0.0012 ** 
## perc_pov       2.559      0.390   6.562 3.64e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.512 on 18 degrees of freedom
## Multiple R-squared:  0.7052, Adjusted R-squared:  0.6889 
## F-statistic: 43.06 on 1 and 18 DF,  p-value: 3.638e-06

Just referring to the plot and the summary above:

Write out the linear model

y = 2.559x - 29.901

Interpret the intercept

At a poverty percentage of 0%, the expected annual murder per mil is -29.901

Interpret the slope

As the poverty percentage increases by 1%, the expected annual murder per mil increases by 2.559

Interpret \(R^2\)

70.52% of the data can be explained by the line of regression. This indicates that a decent portion of the data is explained by the model

Calculate the correlation coefficient

\(0.7052^{1/2} = 0.8398\)