This project is intended to extend and supplement what we talked about last week in discussing Linear Regression.
In some of the sections you will be asked to write R code to solve a particular problem. In some you will be asked to answer questions in your own words.
We are trying to fit a straight line to a scatter plot. This means finding the equation of the line that comes closest to all the points. More specifically, the problem becomes, given \(X = \{x_1,x_1,\ldots,x_n\}\) and \(Y = \{y_1, y_2,\dots,y_n\}\) find \(\beta_0\) and \(beta_1\) such that \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\] where, \(\sum_{i}^{n}\epsilon_i = 0\) and \(\sigma^2 = \sum_{i}^{n}\epsilon_i ^2\) is as minimized. This is called least squares regression.
Given estimates of \(\beta_0\) and \(\beta_1\) called \(\hat{\beta_0}\) and \(\hat{\beta_1}\) we define: \[\hat{y_i} = \hat{\beta_0} + \hat{\beta_1}x_i\] The \(\hat{y_i}\) are called the predicted values.
The residuals, \(e_i\), are the differences between the predicted values and the actual values.\[e_i = y_i - \hat{y_i}\]
Note: the residuals are estimates of the \(\epsilon_i\) in the model.
Please work through this document before continuing with this project.
There are four data sets in Project3 data folder the files section, problem1.csv, problem2.csv, problem3.csv, and problem4.csv These data are some of the groups from this data. You will need to load the files into your project using read_csv.
You will also need to read in the file `bdism.csv’ to do the problems in Section II.
problem1 <- read_csv("problem1.csv")
-- Column specification ----------------------------------------------------------------------------------
cols(
group = col_double(),
x = col_double(),
y = col_double()
)
problem2 <- read_csv("problem2.csv")
-- Column specification ----------------------------------------------------------------------------------
cols(
group = col_double(),
x = col_double(),
y = col_double()
)
problem3 <- read_csv("problem3.csv")
-- Column specification ----------------------------------------------------------------------------------
cols(
group = col_double(),
x = col_double(),
y = col_double()
)
problem4 <- read_csv("problem4.csv")
-- Column specification ----------------------------------------------------------------------------------
cols(
group = col_double(),
x = col_double(),
y = col_double()
)
bdism <- read_csv("bdims.csv")
-- Column specification ----------------------------------------------------------------------------------
cols(
.default = col_double()
)
i Use `spec()` for the full column specifications.
x and y (x on the horizontal axis.)problem1 %>%
ggplot(aes (x = x, y = y)) +
geom_point()
problem2 %>%
ggplot(aes (x = x, y = y)) +
geom_point()
problem3 %>%
ggplot(aes (x = x, y = y)) +
geom_point()
problem4 %>%
ggplot(aes (x = x, y = y)) +
geom_point()
lm report the coefficients, and plot the regression line over the scatter plot.problem1 %>%
ggplot(aes (x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", color = "red")
`geom_smooth()` using formula 'y ~ x'
problem2 %>%
ggplot(aes (x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", color = "red")
`geom_smooth()` using formula 'y ~ x'
problem3 %>%
ggplot(aes (x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", color = "red")
`geom_smooth()` using formula 'y ~ x'
problem4 %>%
ggplot(aes (x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", color = "red")
`geom_smooth()` using formula 'y ~ x'
x.problem1 %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
geom_hline(aes(yintercept = 0), color = "blue")
problem2 %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
geom_hline(aes(yintercept = 0), color = "blue")
problem3 %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
geom_hline(aes(yintercept = 0), color = "blue")
problem4 %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
geom_hline(aes(yintercept = 0), color = "blue")
Record any observations you have in this document.
I noticed that only problem 1 and problem 2 had more two or more points at the x intercept. Problems 1 and 3 on the other hand were the only two that had correlations.
x and y.x_1 <- problem1$x
y_1 <- problem1$y
cor(x_1,y_1)
[1] -0.8998508
x_2 <- problem2$x
y_2 <- problem2$y
cor(x_2,y_2)
[1] 2.961786e-16
x_3 <- problem3$x
y_3 <- problem3$y
cor(x_3,y_3)
[1] 1
x_4 <- problem4$x
y_4 <- problem4$y
cor(x_4,y_4)
[1] 0.4967192
x and y?Problem 1 has a strong positive linear relationship. Problem 2 has no linear relationship. Problem 3 has a strong negative linear relationship. Problem 4 has a weak linear relationship.
x and y data, this means calculate the z-score for each \(x_i\) and \(y_i\), it is possible to do this with one line of R code.Hint the z-score of a vector of data, x, is the vector \(z\) given by \(z = \frac{x - \bar{x}}{s_x}\) where \(s_x\) is the standard deviation of \(x\).
fit_1 <- lm( y_1 ~ x_1, data = problem1)
summary(fit_1)
Call:
lm(formula = y_1 ~ x_1, data = problem1)
Residuals:
Min 1Q Median 3Q Max
-28.2072 -7.7558 0.3587 7.9145 23.7530
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.52478 1.48347 9.117 1.12e-13 ***
x_1 -0.80463 0.04565 -17.626 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 11.37 on 73 degrees of freedom
Multiple R-squared: 0.8097, Adjusted R-squared: 0.8071
F-statistic: 310.7 on 1 and 73 DF, p-value: < 2.2e-16
x_1
[1] -15.7438027 6.3466511 24.5400111 -22.0203522 22.4608333 16.9940902 18.8187851 52.8481370
[9] -24.2222947 57.8231678 -8.5777927 -21.3302128 -7.6398302 24.3375282 21.0175083 5.8473378
[17] -15.4495719 -5.3920128 56.4023496 22.5937831 -3.0899628 -15.0959242 9.2769701 -38.9936697
[25] 0.0129814 -8.4553978 54.3003207 49.3982151 13.6214104 -21.5138158 45.7206161 44.1084248
[33] 40.0146007 40.3045708 4.3797226 39.2820119 58.9118136 17.2623165 -16.3163644 42.1941206
[41] 41.9547268 5.7547167 72.0631999 -10.2215924 27.4984446 -58.7582354 10.6472258 53.3185447
[49] 0.9669828 -13.6724884 39.9856838 -10.7115498 24.8138088 -41.3297024 -30.5770295 1.0328095
[57] -18.1712121 60.9507155 46.2760723 -9.9096913 34.9260997 46.3004749 24.4574800 27.6148963
[65] 54.7531328 0.1320696 2.1786192 47.5187225 -26.5372182 22.1445153 14.9662584 31.4142112
[73] 49.7985232 24.8228289 23.6502614
mean(x_1)
[1] 15.13373
sd(x_1)
[1] 28.95073
z_score_1 = (x_1 - mean(x_1))/sd(x_1)
z_score_1
[1] -1.06655437 -0.30351828 0.32490658 -1.28335543 0.25308878 0.06425960 0.12728719 1.30271002
[9] -1.35941370 1.47455475 -0.81903006 -1.25951702 -0.78663149 0.31791252 0.20323426 -0.32076528
[17] -1.05639122 -0.70898864 1.42547764 0.25768106 -0.62947252 -1.04417572 -0.20230084 -1.86963827
[25] -0.52229236 -0.81480237 1.35287053 1.18354475 -0.05223762 -1.26585894 1.05651518 1.00082777
[33] 0.85942119 0.86943717 -0.37145882 0.83411652 1.51215814 0.07352452 -1.08633148 0.93470495
[41] 0.92643594 -0.32396455 1.96642596 -0.87580927 0.42709513 -2.55233483 -0.15497024 1.31895858
[49] -0.48933977 -0.99500819 0.85842236 -0.89273311 0.33436394 -1.95032820 -1.57891539 -0.48706602
[57] -1.15040059 1.58258475 1.07570144 -0.86503576 0.68365704 1.07654433 0.32205583 0.43111754
[65] 1.36851131 -0.51817888 -0.44748811 1.11862436 -1.43937449 0.24216270 -0.00578463 0.56235135
[73] 1.19737196 0.33467551 0.29417334
lm calculate the regression coefficients. What do you notice?summary(lm(x_1 ~ y_1, data = problem1))
Call:
lm(formula = x_1 ~ y_1, data = problem1)
Residuals:
Min 1Q Median 3Q Max
-31.914 -6.822 -0.336 7.087 27.507
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.4901 1.4702 11.22 <2e-16 ***
y_1 -1.0063 0.0571 -17.63 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.71 on 73 degrees of freedom
Multiple R-squared: 0.8097, Adjusted R-squared: 0.8071
F-statistic: 310.7 on 1 and 73 DF, p-value: < 2.2e-16
fit_1 <- lm(x_1 ~ y_1, data = problem1)
summary(fit_1)
Call:
lm(formula = x_1 ~ y_1, data = problem1)
Residuals:
Min 1Q Median 3Q Max
-31.914 -6.822 -0.336 7.087 27.507
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.4901 1.4702 11.22 <2e-16 ***
y_1 -1.0063 0.0571 -17.63 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.71 on 73 degrees of freedom
Multiple R-squared: 0.8097, Adjusted R-squared: 0.8071
F-statistic: 310.7 on 1 and 73 DF, p-value: < 2.2e-16
df <- nrow(problem1) - 1
df
[1] 74
res_var1 = sum(residuals(fit_1)^2)
x_hat1 = predict(fit_1)
reg_var1 = sum((x_hat1 - mean(x_1))^2)
tot_var1 = sum((x_1 - mean(x_1))^2)
tot_var1
[1] 62022.72
res_var1 + reg_var1
[1] 62022.72
tot_var1 - (res_var1 + reg_var1)
[1] -2.182787e-11
Rsq = reg_var1/tot_var1
Rsq
[1] 0.8097315
cor(x_1,y_1)^2 #square of the correlation coefficient
[1] 0.8097315
Rsq - cor(x_1,y_1)^2
[1] 4.440892e-16
#8.b)
fit_2 <- lm(x_2 ~ y_2, data = problem2)
summary(fit_2)
Call:
lm(formula = x_2 ~ y_2, data = problem2)
Residuals:
Min 1Q Median 3Q Max
-45.0 -22.5 0.0 22.5 45.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.500e+01 1.191e+01 3.778 0.000975 ***
y_2 1.474e-15 1.170e+00 0.000 1.000000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 28.19 on 23 degrees of freedom
Multiple R-squared: 4.419e-32, Adjusted R-squared: -0.04348
F-statistic: 1.016e-30 on 1 and 23 DF, p-value: 1
res_var2 = sum(residuals(fit_2)^2)
x_hat2 = predict(fit_2)
reg_var2 = sum((x_hat2 - mean(x_2))^2)
tot_var2 = sum((x_2 - mean(x_2))^2)
tot_var2
[1] 18281.25
res_var2 + reg_var2
[1] 18281.25
tot_var2 - (res_var2 + reg_var2)
[1] 0
Rsq2 = reg_var2/tot_var2
Rsq2
[1] 8.008893e-32
cor(x_2,y_2)^2 #square of the correlation coefficient
[1] 8.772177e-32
Rsq2 - cor(x_2,y_2)^2
[1] -7.632839e-33
#8.c)
fit_3 <- lm(x_3 ~ y_3, data = problem3)
summary(fit_3)
Warning in summary.lm(fit_3) :
essentially perfect fit: summary may be unreliable
Call:
lm(formula = x_3 ~ y_3, data = problem3)
Residuals:
Min 1Q Median 3Q Max
-4.956e-16 -2.600e-19 8.510e-18 1.609e-17 5.663e-17
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.256e-16 1.896e-17 -6.626e+00 2.76e-08 ***
y_3 1.000e+00 3.405e-17 2.937e+16 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.402e-17 on 48 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 8.625e+32 on 1 and 48 DF, p-value: < 2.2e-16
res_var3 = sum(residuals(fit_3)^2)
x_hat3 = predict(fit_3)
reg_var3 = sum((x_hat3 - mean(x_3))^2)
tot_var3 = sum((x_3 - mean(x_3))^2)
tot_var3
[1] 4.725709
res_var3 + reg_var3
[1] 4.725709
tot_var3 - (res_var3 + reg_var3)
[1] -2.664535e-15
Rsq3 = reg_var3/tot_var3
Rsq3
[1] 1
cor(x_3,y_3)^2 #square of the correlation coefficient
[1] 1
Rsq3 - cor(x_3,y_3)^2
[1] 6.661338e-16
8.d)
fit_4 <- lm(x_4 ~ y_4, data = problem4)
summary(fit_4)
Call:
lm(formula = x_4 ~ y_4, data = problem4)
Residuals:
Min 1Q Median 3Q Max
-2.1522 -1.1929 0.3144 1.0636 2.0662
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.09574 0.37736 0.254 0.800720
y_4 1.31191 0.31788 4.127 0.000133 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.337 on 52 degrees of freedom
Multiple R-squared: 0.2467, Adjusted R-squared: 0.2322
F-statistic: 17.03 on 1 and 52 DF, p-value: 0.0001332
res_var4 = sum(residuals(fit_4)^2)
x_hat4 = predict(fit_4)
reg_var4 = sum((x_hat4 - mean(x_4))^2)
tot_var4 = sum((x_4 - mean(x_4))^2)
tot_var4
[1] 123.4709
res_var4 + reg_var4
[1] 123.4709
tot_var4 - (res_var4 + reg_var4)
[1] -1.421085e-14
Rsq4 = reg_var4/tot_var4
Rsq4
[1] 0.24673
cor(x_4,y_4)^2 #square of the correlation coefficient
[1] 0.24673
Rsq4 - cor(x_4,y_4)^2
[1] 1.387779e-16
Anthropological researchers collected body measurements from 507 individuals (247 men and 260 women.) The data are contained in the file bdims.csv. A description of the variables can be found here
bdims <- read.csv("bdims.csv")
Using R Perform the following tasks, give the most complete specific answers you can given the data.
head(bdims)
bdims %>%
ggplot(aes(x = sho_gi, y = hgt)) +
geom_point()
lm?
slope <- cor(bdims$sho_gi,bdims$hgt) * (sd(bdims$hgt)/ sd(bdims$sho_gi))
slope
[1] 0.6036442
intercept <- mean(bdims$hgt) - slope * mean(bdims$sho_gi)
intercept
[1] 105.8325
height <- (intercept + (slope*bdims$sho_gi))
x_4 = problem4$x
y_4 = problem4$y
beta = sum(x*y)/sum(x^2)
beta
[1] 0.4409789
intercept + slope * 100
[1] 166.1969
160 - 166.1969
[1] -6.1969
intercept + slope * 59
[1] 141.4475
min(bdims$sho_gi)
[1] 85.9
max(bdims$sho_gi)
[1] 134.8
murders %>%
ggplot(aes(x = perc_pov, y = annual_murders_per_mil)) +
geom_point()
summary(lm(annual_murders_per_mil~ perc_pov, data = murders))
Call:
lm(formula = annual_murders_per_mil ~ perc_pov, data = murders)
Residuals:
Min 1Q Median 3Q Max
-9.1663 -2.5613 -0.9552 2.8887 12.3475
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -29.901 7.789 -3.839 0.0012 **
perc_pov 2.559 0.390 6.562 3.64e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.512 on 18 degrees of freedom
Multiple R-squared: 0.7052, Adjusted R-squared: 0.6889
F-statistic: 43.06 on 1 and 18 DF, p-value: 3.638e-06
Just referring to the plot and the summary above:
cor(murders$annual_murders_per_mil,murders$perc_pov)
[1] 0.8397782