This project is intended to extend and supplement what we talked about last week in discussing Linear Regression.
In some of the sections you will be asked to write R code to solve a particular problem. In some you will be asked to answer questions in your own words.
library(tidyverse)
We are trying to fit a straight line to a scatter plot. This means finding the equation of the line that comes closest to all the points. More specifically, the problem becomes, given \(X = \{x_1,x_1,\ldots,x_n\}\) and \(Y = \{y_1, y_2,\dots,y_n\}\) find \(\beta_0\) and \(beta_1\) such that \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\] where, \(\sum_{i}^{n}\epsilon_i = 0\) and \(\sigma^2 = \sum_{i}^{n}\epsilon_i ^2\) is as minimized. This is called least squares regression.
Given estimates of \(\beta_0\) and \(\beta_1\) called \(\hat{\beta_0}\) and \(\hat{\beta_1}\) we define: \[\hat{y_i} = \hat{\beta_0} + \hat{\beta_1}x_i\] The \(\hat{y_i}\) are called the predicted values.
The residuals, \(e_i\), are the differences between the predicted values and the actual values.\[e_i = y_i - \hat{y_i}\]
Note: the residuals are estimates of the \(\epsilon_i\) in the model.
Please work through this document before continuing with this project.
There are four data sets in Project3 data folder the files section, problem1.csv, problem2.csv, problem3.csv, and problem4.csv These data are some of the groups from this data. You will need to load the files into your project using read_csv.
You will also need to read in the file `bdism.csv’ to do the problems in Section II.
prob1 <- read.csv("problem1.csv")
prob2 <- read.csv("problem2.csv")
prob3 <- read.csv("problem3.csv")
prob4 <- read.csv("problem4.csv")
x and y (x on the horizontal axis.)#plotting scatter plots for all four data sets
ggplot(prob1) +
geom_point(aes(x = x, y = y))
ggplot(prob2) +
geom_point(aes(x = x, y = y))
ggplot(prob3) +
geom_point(aes(x = x, y = y))
ggplot(prob4) +
geom_point(aes(x = x, y = y))
lm report the coefficients, and plot the regression line over the scatter plot.prob1_fit <- lm(x ~ y, data = prob1)
summary(prob1_fit)
##
## Call:
## lm(formula = x ~ y, data = prob1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.914 -6.822 -0.336 7.087 27.507
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.4901 1.4702 11.22 <2e-16 ***
## y -1.0063 0.0571 -17.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.71 on 73 degrees of freedom
## Multiple R-squared: 0.8097, Adjusted R-squared: 0.8071
## F-statistic: 310.7 on 1 and 73 DF, p-value: < 2.2e-16
prob2_fit <- lm(x ~ y, data = prob2)
summary(prob2_fit)
##
## Call:
## lm(formula = x ~ y, data = prob2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.0 -22.5 0.0 22.5 45.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.500e+01 1.191e+01 3.778 0.000975 ***
## y 1.180e-15 1.170e+00 0.000 1.000000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.19 on 23 degrees of freedom
## Multiple R-squared: 5.247e-32, Adjusted R-squared: -0.04348
## F-statistic: 1.207e-30 on 1 and 23 DF, p-value: 1
prob3_fit <- lm(x ~ y, data = prob3)
summary(prob3_fit)
## Warning in summary.lm(prob3_fit): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = x ~ y, data = prob3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.956e-16 -2.600e-19 8.510e-18 1.609e-17 5.663e-17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.256e-16 1.896e-17 -6.626e+00 2.76e-08 ***
## y 1.000e+00 3.405e-17 2.937e+16 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.402e-17 on 48 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 8.625e+32 on 1 and 48 DF, p-value: < 2.2e-16
prob4_fit <- lm(x ~ y, data = prob4)
summary(prob4_fit)
##
## Call:
## lm(formula = x ~ y, data = prob4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1522 -1.1929 0.3144 1.0636 2.0662
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.09574 0.37736 0.254 0.800720
## y 1.31191 0.31788 4.127 0.000133 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.337 on 52 degrees of freedom
## Multiple R-squared: 0.2467, Adjusted R-squared: 0.2322
## F-statistic: 17.03 on 1 and 52 DF, p-value: 0.0001332
ggplot(prob1, aes(x = x, y = y)) +
geom_point() +
stat_smooth(method = "lm", col = "red")
## `geom_smooth()` using formula 'y ~ x'
ggplot(prob2, aes(x = x, y = y)) +
geom_point() +
stat_smooth(method = "lm", col = "red")
## `geom_smooth()` using formula 'y ~ x'
ggplot(prob3, aes(x = x, y = y)) +
geom_point() +
stat_smooth(method = "lm", col = "red")
## `geom_smooth()` using formula 'y ~ x'
ggplot(prob4, aes(x = x, y = y)) +
geom_point() +
stat_smooth(method = "lm", col = "red")
## `geom_smooth()` using formula 'y ~ x'
#playing around with code I found online that will show the coefficints neatly on the graph:
ggplotRegression <- function (fit) {
require(ggplot2)
##
ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) +
geom_point() +
stat_smooth(method = "lm", col = "red") +
labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),
"Intercept =",signif(fit$coef[[1]],5 ),
" Slope =",signif(fit$coef[[2]], 5),
" P =",signif(summary(fit)$coef[2,4], 5)))
}
ggplotRegression(prob1_fit)
## `geom_smooth()` using formula 'y ~ x'
ggplotRegression(prob2_fit)
## `geom_smooth()` using formula 'y ~ x'
ggplotRegression(prob3_fit)
## Warning in summary.lm(fit): essentially perfect fit: summary may be unreliable
## Warning in summary.lm(fit): essentially perfect fit: summary may be unreliable
## `geom_smooth()` using formula 'y ~ x'
ggplotRegression(prob4_fit)
## `geom_smooth()` using formula 'y ~ x'
x.library(modelr)
prob1_resid <- prob1 %>%
add_residuals(prob1_fit)
ggplot(prob1_resid, aes(x, resid)) +
geom_ref_line(h = 0) +
geom_point()
prob2_resid <- prob2 %>%
add_residuals(prob2_fit)
ggplot(prob2_resid, aes(x, resid)) +
geom_ref_line(h = 0) +
geom_point()
prob3_resid <- prob3 %>%
add_residuals(prob3_fit)
ggplot(prob3_resid, aes(x, resid)) +
geom_ref_line(h = 0) +
geom_point()
prob4_resid <- prob4 %>%
add_residuals(prob4_fit)
ggplot(prob4_resid, aes(x, resid)) +
geom_ref_line(h = 0) +
geom_point()
Record any observations you have in this document. ##linear regression model is maybe not appropriate for prob2.
x and y.x and y?cor_prob1 <- cor.test(prob1$x, prob1$y, method = "pearson")
cor_prob1
##
## Pearson's product-moment correlation
##
## data: prob1$x and prob1$y
## t = -17.626, df = 73, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9357108 -0.8455841
## sample estimates:
## cor
## -0.8998508
# cor = -0.899 (close to -1), and p-value 2.2e-16 indicating strong linear relationship
cor_prob2 <- cor.test(prob2$x, prob2$y, method = "pearson")
cor_prob2
##
## Pearson's product-moment correlation
##
## data: prob2$x and prob2$y
## t = 1.4204e-15, df = 23, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3951309 0.3951309
## sample estimates:
## cor
## 2.961786e-16
# cor = 2.961786e-16, and p-value 1 indicating weak relationship
cor_prob3 <- cor.test(prob3$x, prob3$y, method = "pearson")
cor_prob3
##
## Pearson's product-moment correlation
##
## data: prob3$x and prob3$y
## t = Inf, df = 48, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 1 1
## sample estimates:
## cor
## 1
# cor = 1 (perfect positive relationship), and p-value 2.2e-16 indicating a strong linear relationship
cor_prob4 <- cor.test(prob4$x, prob4$y, method = "pearson")
cor_prob4
##
## Pearson's product-moment correlation
##
## data: prob4$x and prob4$y
## t = 4.127, df = 52, p-value = 0.0001332
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2640820 0.6747383
## sample estimates:
## cor
## 0.4967192
# cor = 0.4967192, and p-value is 0.0001332 indicating a moderate linear relationship
x and y data, this means calculationg the z-score for each \(x_i\) and \(y_i\), it is possible to do this with one line of R code.Hint the z-score of a vector of data, x, is the vector \(z\) given by \(z = \frac{x - \bar{x}}{s_x}\) where \(s_x\) is the standard deviation of \(x\).
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
prob1_norm <- normalize(prob1)
prob2_norm <- normalize(prob2)
prob3_norm <- normalize(prob3)
prob4_norm <- normalize(prob4)
prob1_norm
prob2_norm
prob3_norm
prob4_norm
lm calculate the regression coefficients. What do you notice?lm(x ~ y, prob1_norm)
##
## Call:
## lm(formula = x ~ y, data = prob1_norm)
##
## Coefficients:
## (Intercept) y
## 1.027 -1.006
lm(x ~ y, prob2_norm)
##
## Call:
## lm(formula = x ~ y, data = prob2_norm)
##
## Coefficients:
## (Intercept) y
## 0.5 0.0
lm(x ~ y, prob3_norm)
##
## Call:
## lm(formula = x ~ y, data = prob3_norm)
##
## Coefficients:
## (Intercept) y
## 1.57e-17 1.00e+00
lm(x ~ y, prob4_norm)
##
## Call:
## lm(formula = x ~ y, data = prob4_norm)
##
## Coefficients:
## (Intercept) y
## -0.006963 1.311911
prob1_resvar = sum(residuals(prob1_fit)^2)
y_hat = predict(prob1_fit)
prob1_regvar = sum((y_hat - mean(prob1$x))^2)
prob1_totvar = sum((prob1$x - mean(prob1$x))^2)
prob1_totvar
## [1] 62022.72
prob1_resvar + prob1_regvar
## [1] 62022.72
prob2_resvar = sum(residuals(prob2_fit)^2)
y_hat = predict(prob2_fit)
prob2_regvar = sum((y_hat - mean(prob2$x))^2)
prob2_totvar = sum((prob2$x - mean(prob2$x))^2)
prob2_totvar
## [1] 18281.25
prob2_resvar + prob2_regvar
## [1] 18281.25
prob3_resvar = sum(residuals(prob3_fit)^2)
y_hat = predict(prob3_fit)
prob3_regvar = sum((y_hat - mean(prob3$x))^2)
prob3_totvar = sum((prob3$x - mean(prob3$x))^2)
prob3_totvar
## [1] 4.725709
prob3_resvar + prob3_regvar
## [1] 4.725709
prob4_resvar = sum(residuals(prob4_fit)^2)
y_hat = predict(prob4_fit)
prob4_regvar = sum((y_hat - mean(prob4$x))^2)
prob4_totvar = sum((prob4$x - mean(prob4$x))^2)
prob4_totvar
## [1] 123.4709
prob4_resvar + prob4_regvar
## [1] 123.4709
prob1_rsk = prob1_regvar / prob1_totvar
prob1_rsk
## [1] 0.8097315
prob2_rsk = prob2_regvar / prob2_totvar
prob2_rsk
## [1] 6.62805e-32
prob3_rsk = prob3_regvar / prob3_totvar
prob3_rsk
## [1] 1
prob4_rsk = prob4_regvar / prob4_totvar
prob4_rsk
## [1] 0.24673
Anthropological researchers collected body measurements from 507 individuals (247 men and 260 women.) The data are contained in the file bdims.csv. A description of the variables can be found here
bdims <- read.csv("bdims.csv")
Using R Perform the following tasks, give the most complete specific answers you can given the data.
bdims %>%
ggplot(aes(x = sho_gi, y = hgt)) +
geom_point()
## as shoulder girth icreases so does height.
lm? ##Y = bX + a ##b = slope ##a = incercept ##Y = height ##X = shoulder girthlibrary(stats)
#slope formula = correlation * (sd x / sd y)
slope <- cor(bdims$hgt, bdims$sho_gi) * (sd(bdims$hgt)/ sd(bdims$sho_gi))
slope
## [1] 0.6036442
## intercept formula = mean of y - slope * mean of x
intercept <- mean(bdims$hgt) - slope * mean(bdims$sho_gi)
intercept
## [1] 105.8325
##height = 105.83 + 0.6 * shoulder girth
Interpret the slope and intercept in this context. ## intercept is height when shoulder girth is zero. It is impossible to have zero girth. Slope is for every unit increased by shoulder girth an additional 0.6 of height will increase.
Calculate \(R^2\) of the regression line for predicting height fro shoulder girth m
rsk <- (cor(bdims$hgt, bdims$sho_gi)) ^ 2
rsk
## [1] 0.4432035
intercept + slope * 100
## [1] 166.1969
160 - 166.1969
## [1] -6.1969
max(bdims$sho_gi)
## [1] 134.8
min(bdims$sho_gi)
## [1] 85.9
#56 cm is out of range of the model and shouldn’t be used to predict height
murders %>%
ggplot(aes(x = perc_pov, y = annual_murders_per_mil)) +
geom_point()
murders
summary(lm(annual_murders_per_mil~ perc_pov, data = murders))
##
## Call:
## lm(formula = annual_murders_per_mil ~ perc_pov, data = murders)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1663 -2.5613 -0.9552 2.8887 12.3475
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -29.901 7.789 -3.839 0.0012 **
## perc_pov 2.559 0.390 6.562 3.64e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.512 on 18 degrees of freedom
## Multiple R-squared: 0.7052, Adjusted R-squared: 0.6889
## F-statistic: 43.06 on 1 and 18 DF, p-value: 3.638e-06
Just referring to the plot and the summary above:
Write out the linear model # Y = bX + a # Y = annual_murders_per_mil # X = perc_pov # (b)slope =2.559 # (a)intercept = -29.901 # annual_murders_per_mil = 2.559 * perc_pov - 29.901
Interpret the intercept # if poverty is zero than murders would be - 29.901 which is impossible.
Interpret the slope # for every unit of poverty murders will increase by 2.559
Interpret \(R^2\)
sqrt(.7052)
## [1] 0.8397619