This project is intended to extend and supplement what we talked about last week in discussing Linear Regression.
In some of the sections you will be asked to write R code to solve a particular problem. In some you will be asked to answer questions in your own words.
We are trying to fit a straight line to a scatter plot. This means finding the equation of the line that comes closest to all the points. More specifically, the problem becomes, given \(X = \{x_1,x_1,\ldots,x_n\}\) and \(Y = \{y_1, y_2,\dots,y_n\}\) find \(\beta_0\) and \(beta_1\) such that \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\] where, \(\sum_{i}^{n}\epsilon_i = 0\) and \(\sigma^2 = \sum_{i}^{n}\epsilon_i ^2\) is as minimized. This is called least squares regression.
Given estimates of \(\beta_0\) and \(\beta_1\) called \(\hat{\beta_0}\) and \(\hat{\beta_1}\) we define: \[\hat{y_i} = \hat{\beta_0} + \hat{\beta_1}x_i\] The \(\hat{y_i}\) are called the predicted values.
The residuals, \(e_i\), are the differences between the predicted values and the actual values.\[e_i = y_i - \hat{y_i}\]
Note: the residuals are estimates of the \(\epsilon_i\) in the model.
Please work through this document before continuing with this project.
There are four data sets in Project3 data folder the files section, problem1.csv, problem2.csv, problem3.csv, and problem4.csv These data are some of the groups from this data. You will need to load the files into your project using read_csv.
You will also need to read in the file `bdims.csv’ to do the problems in Section II.
read.csv("problem1.csv")
read.csv("problem2.csv")
read.csv("problem3.csv")
read.csv("problem4.csv")
read.csv("bdims.csv")
x and y (x on the horizontal axis.)lm report the coefficients, and plot the regression line over the scatter plot.x. Record any observations you have in this document.x and y.x and y?x and y data, this means calculating the z-score for each \(x_i\) and \(y_i\), it is possible to do this with one line of R code.Hint the z-score of a vector of data, x, is the vector \(z\) given by \(z = \frac{x - \bar{x}}{s_x}\) where \(s_x\) is the standard deviation of \(x\).
lm calculate the regression coefficients. What do you notice?#Problem1.csv
x and y (x on the horizontal axis.)ggplot(data = problem1) +
geom_point(mapping = aes(x = x, y = y))
lm report the coefficients, and plot the regression line over the scatter plot.fit_1p <- lm(problem1$x ~ problem1$y, data = problem1)
summary(fit_1p)
Call:
lm(formula = problem1$x ~ problem1$y, data = problem1)
Residuals:
Min 1Q Median 3Q Max
-31.914 -6.822 -0.336 7.087 27.507
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.4901 1.4702 11.22 <2e-16 ***
problem1$y -1.0063 0.0571 -17.63 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.71 on 73 degrees of freedom
Multiple R-squared: 0.8097, Adjusted R-squared: 0.8071
F-statistic: 310.7 on 1 and 73 DF, p-value: < 2.2e-16
problem1 %>%
ggplot(aes(x = x, y = y)) +
geom_point(mapping = aes(x = x, y = y)) +
geom_smooth(method = "lm")
`geom_smooth()` using formula 'y ~ x'
x. Record any observations you have in this document.The residuals are kind of all over the place. I can sort of see a trend upwards but it’s not very clear.
problem1 %>%
ggplot() +
geom_point(aes(x = x, y = fit_1p$residuals))
x and y.cor(problem1$x, problem1$y)
[1] -0.8998508
x and y?The value is close to -1 so that indicates a strong and negative relationship between the 2. That makes sense considering that the scatterplot shows a downward trend.
x and y data, this means calculating the z-score for each \(x_i\) and \(y_i\), it is possible to do this with one line of R code.Z = (x - mean)/standard deviation so I can calculate it by defining mean and sd for x and y.
x_bar_1p = mean(problem1$x)
sdx_1p = sd(problem1$x)
zx_1p <- (problem1$x - x_bar_1p)/sdx_1p
zx_1p
[1] -1.06655437 -0.30351828 0.32490658 -1.28335543 0.25308878 0.06425960 0.12728719
[8] 1.30271002 -1.35941370 1.47455475 -0.81903006 -1.25951702 -0.78663149 0.31791252
[15] 0.20323426 -0.32076528 -1.05639122 -0.70898864 1.42547764 0.25768106 -0.62947252
[22] -1.04417572 -0.20230084 -1.86963827 -0.52229236 -0.81480237 1.35287053 1.18354475
[29] -0.05223762 -1.26585894 1.05651518 1.00082777 0.85942119 0.86943717 -0.37145882
[36] 0.83411652 1.51215814 0.07352452 -1.08633148 0.93470495 0.92643594 -0.32396455
[43] 1.96642596 -0.87580927 0.42709513 -2.55233483 -0.15497024 1.31895858 -0.48933977
[50] -0.99500819 0.85842236 -0.89273311 0.33436394 -1.95032820 -1.57891539 -0.48706602
[57] -1.15040059 1.58258475 1.07570144 -0.86503576 0.68365704 1.07654433 0.32205583
[64] 0.43111754 1.36851131 -0.51817888 -0.44748811 1.11862436 -1.43937449 0.24216270
[71] -0.00578463 0.56235135 1.19737196 0.33467551 0.29417334
y_bar_1p = mean(problem1$y)
sdy_1p = sd(problem1$y)
zy_1p <- (problem1$y - y_bar_1p)/sdy_1p
zy_1p
[1] 1.32183109 0.85678665 -0.09191802 0.71123718 -0.27807998 0.76808027 -0.25612720
[8] -0.82522234 1.80487316 -0.92999073 0.75086145 1.24951035 0.31392153 -0.05951540
[15] -0.57165411 -0.08774540 0.50727757 0.35366250 -0.76572130 -0.39067126 0.55151613
[22] 0.74107760 0.45279473 2.12564480 0.39133553 1.03622035 -1.37709409 -0.54966027
[29] 0.34698756 1.62524228 -0.64226077 -0.77852258 -1.62753629 -1.07321545 0.57652277
[36] -0.93984133 -1.46150543 -0.23648526 1.74335915 -1.14944317 -1.01783066 0.31751458
[43] -1.63823568 0.89217938 -0.38429645 1.61137305 0.33022206 -0.91925677 0.23434865
[50] 1.22410311 -1.10038263 0.99873474 -0.58588528 0.96665337 1.96969441 0.77830471
[57] 1.11665915 -0.70284402 -1.69793010 0.85425273 -1.70481171 -0.27276224 -0.67448592
[64] -0.97025011 -1.60994011 -0.05508548 -0.49750072 -1.43863526 1.23852347 0.22739851
[71] 0.21521336 -1.05787695 -1.34812077 0.61640280 -0.05197951
lm calculate the regression coefficients. What do you notice?The coefficients are 16.4901 and -1.0063. The slope is negative which makes sense considering the downward direction of the datapoints.
fit_1p <- lm(problem1$x ~ problem1$y, data = problem1)
summary(fit_1p)
Call:
lm(formula = problem1$x ~ problem1$y, data = problem1)
Residuals:
Min 1Q Median 3Q Max
-31.914 -6.822 -0.336 7.087 27.507
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.4901 1.4702 11.22 <2e-16 ***
problem1$y -1.0063 0.0571 -17.63 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.71 on 73 degrees of freedom
Multiple R-squared: 0.8097, Adjusted R-squared: 0.8071
F-statistic: 310.7 on 1 and 73 DF, p-value: < 2.2e-16
R^2 is Rsq_1p
Rsq_1p = (cor(problem1$x, problem1$y))^2
Rsq_1p
[1] 0.8097315
res_var_1p = sum(residuals(fit_1p)^2)
y_hat_1p = predict(fit_1p)
reg_var_1p = sum((y_hat_1p - x_bar_1p)^2)
tot_var_1p = sum((problem1$x - x_bar_1p)^2)
tot_var_1p
[1] 62022.72
res_var_1p + reg_var_1p
[1] 62022.72
When I calculated total variability (tot_var_1p) I found that it was the same as residual variability + regression variability (res_var_1p + reg_var_1p) as it should be.
The R^2 is 0.81 and since that is close to +1 that means that there is a strong and positive linear relationship.
#Problem2.csv
x and y (x on the horizontal axis.)ggplot(data = problem2) +
geom_point(mapping = aes(x = x, y = y))
lm report the coefficients, and plot the regression line over the scatter plot.fit_2p <- lm(problem2$x ~ problem2$y, data = problem2)
summary(fit_2p)
Call:
lm(formula = problem2$x ~ problem2$y, data = problem2)
Residuals:
Min 1Q Median 3Q Max
-45.0 -22.5 0.0 22.5 45.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.500e+01 1.191e+01 3.778 0.000975 ***
problem2$y 1.474e-15 1.170e+00 0.000 1.000000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 28.19 on 23 degrees of freedom
Multiple R-squared: 4.419e-32, Adjusted R-squared: -0.04348
F-statistic: 1.016e-30 on 1 and 23 DF, p-value: 1
problem2 %>%
ggplot(aes(x = x, y = y)) +
geom_point(mapping = aes(x = x, y = y)) +
geom_smooth(method = "lm")
`geom_smooth()` using formula 'y ~ x'
x. Record any observations you have in this document.The residuals are all lined up together in what looks like a strong positive linear trend.
problem2 %>%
ggplot() +
geom_point(aes(x = x, y = fit_2p$residuals))
x and y.cor(problem2$x, problem2$y)
[1] 2.961786e-16
x and y?The value is very far from +1 and -1 which does not indicate a strong linear relationship between x and y. This makes sense considering the non-linear shape of the graph.
x and y data, this means calculating the z-score for each \(x_i\) and \(y_i\), it is possible to do this with one line of R code.x_bar_2p = mean(problem2$x)
sdx_2p = sd(problem2$x)
zx_2p <- (problem2$x - x_bar_2p)/sdx_2p
zx_2p
[1] -1.6304789 -1.4946057 -1.3587324 -1.2228592 -1.0869860 -0.9511127 -0.8152395 -0.6793662
[9] -0.5434930 -0.4076197 -0.2717465 -0.1358732 0.0000000 0.1358732 0.2717465 0.4076197
[17] 0.5434930 0.6793662 0.8152395 0.9511127 1.0869860 1.2228592 1.3587324 1.4946057
[25] 1.6304789
y_bar_2p = mean(problem2$y)
sdy_2p = sd(problem2$y)
zy_2p <- (problem2$y - y_bar_2p)/sdy_2p
zy_2p
[1] -1.82326297 -1.43330628 -1.05002186 -0.67996781 -0.32947586 -0.00454302 0.28927102
[8] 0.54693901 0.76405220 0.93689570 1.06251212 1.13875214 1.16431125 1.13875214
[15] 1.06251212 0.93689570 0.76405220 0.54693901 0.28927102 -0.00454302 -0.32947586
[22] -0.67996781 -1.05002186 -1.43330628 -1.82326297
lm calculate the regression coefficients. What do you notice?The coefficients are 45 and 1.47E-15. The slope is very big compared to the intercept.
fit_2p <- lm(problem2$x ~ problem2$y, data = problem2)
summary(fit_2p)
Call:
lm(formula = problem2$x ~ problem2$y, data = problem2)
Residuals:
Min 1Q Median 3Q Max
-45.0 -22.5 0.0 22.5 45.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.500e+01 1.191e+01 3.778 0.000975 ***
problem2$y 1.474e-15 1.170e+00 0.000 1.000000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 28.19 on 23 degrees of freedom
Multiple R-squared: 4.419e-32, Adjusted R-squared: -0.04348
F-statistic: 1.016e-30 on 1 and 23 DF, p-value: 1
Rsq_2p = (cor(problem2$x, problem2$y))^2
Rsq_2p
[1] 8.772177e-32
res_var_2p = sum(residuals(fit_2p)^2)
y_hat_2p = predict(fit_2p)
reg_var_2p = sum((y_hat_2p - x_bar_2p)^2)
tot_var_2p = sum((problem2$x - x_bar_2p)^2)
tot_var_2p
[1] 18281.25
res_var_2p + reg_var_2p
[1] 18281.25
When I calculated total variability I found that it was the same as residual variability + regression variability as it should be.
R^2 is very far from -1 and +1 which indicates that there is not a strong linear relationship.
#Problem3.csv
x and y (x on the horizontal axis.)ggplot(data = problem3) +
geom_point(mapping = aes(x = x, y = y))
lm report the coefficients, and plot the regression line over the scatter plot.fit_3p <- lm(problem3$x ~ problem3$y, data = problem3)
summary(fit_3p)
Warning in summary.lm(fit_3p) :
essentially perfect fit: summary may be unreliable
Call:
lm(formula = problem3$x ~ problem3$y, data = problem3)
Residuals:
Min 1Q Median 3Q Max
-4.956e-16 -2.600e-19 8.510e-18 1.609e-17 5.663e-17
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.256e-16 1.896e-17 -6.626e+00 2.76e-08 ***
problem3$y 1.000e+00 3.405e-17 2.937e+16 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.402e-17 on 48 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 8.625e+32 on 1 and 48 DF, p-value: < 2.2e-16
problem3 %>%
ggplot(aes(x = x, y = y)) +
geom_point(mapping = aes(x = x, y = y)) +
geom_smooth(method = "lm")
`geom_smooth()` using formula 'y ~ x'
x. Record any observations you have in this document.The residuals are mostly within the same range and even the one residual by itself is still only a decimal place off.
problem3 %>%
ggplot() +
geom_point(aes(x = x, y = fit_3p$residuals))
x and y.cor(problem3$x, problem3$y)
[1] 1
x and y?The calculation indicates a perfectly linear relationship between x and y. Something close to perfect was expected since the regression line perfectly lines up with the data points.
x and y data, this means calculating the z-score for each \(x_i\) and \(y_i\), it is possible to do this with one line of R code.x_bar_3p = mean(problem3$x)
sdx_3p = sd(problem3$x)
zx_3p <- (problem3$x - x_bar_3p)/sdx_3p
zx_3p
[1] -0.64367687 -0.13217104 -0.31421737 1.54035206 -1.09617799 -1.26906610 1.61058719
[8] -0.06966477 -0.30227491 -0.94638054 -1.32007917 0.62367295 0.36714280 1.68395548
[15] 0.44967462 -1.28543292 -0.97257415 0.03623611 -1.48835200 -0.07304894 -0.65436527
[22] 1.52717692 0.81045348 -0.96943009 0.03878028 0.72809531 -0.01046365 1.58105824
[29] 0.79984929 -0.21573390 -1.11550789 -0.72138597 1.28637026 -0.09070372 0.10859138
[36] 0.73347587 0.95368194 -0.99417389 1.24073076 1.55423858 0.40017189 0.12270416
[43] -0.88347228 -1.48866105 1.33128210 -1.06272869 -1.42134764 1.52950490 -0.55126207
[50] -0.96543368
y_bar_3p = mean(problem3$y)
sdy_3p = sd(problem3$y)
zy_3p <- (problem3$y - y_bar_3p)/sdy_3p
zy_3p
[1] -0.64367687 -0.13217104 -0.31421737 1.54035206 -1.09617799 -1.26906610 1.61058719
[8] -0.06966477 -0.30227491 -0.94638054 -1.32007917 0.62367295 0.36714280 1.68395548
[15] 0.44967462 -1.28543292 -0.97257415 0.03623611 -1.48835200 -0.07304894 -0.65436527
[22] 1.52717692 0.81045348 -0.96943009 0.03878028 0.72809531 -0.01046365 1.58105824
[29] 0.79984929 -0.21573390 -1.11550789 -0.72138597 1.28637026 -0.09070372 0.10859138
[36] 0.73347587 0.95368194 -0.99417389 1.24073076 1.55423858 0.40017189 0.12270416
[43] -0.88347228 -1.48866105 1.33128210 -1.06272869 -1.42134764 1.52950490 -0.55126207
[50] -0.96543368
lm calculate the regression coefficients. What do you notice?RStudio tells us that the summary is unreliable because the fit is essentially perfect. This confirms the belief that the relationship is perfectly linear because the correlation = 1.
fit_3p <- lm(problem3$x ~ problem3$y, data = problem3)
summary(fit_3p)
Warning in summary.lm(fit_3p) :
essentially perfect fit: summary may be unreliable
Call:
lm(formula = problem3$x ~ problem3$y, data = problem3)
Residuals:
Min 1Q Median 3Q Max
-4.956e-16 -2.600e-19 8.510e-18 1.609e-17 5.663e-17
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.256e-16 1.896e-17 -6.626e+00 2.76e-08 ***
problem3$y 1.000e+00 3.405e-17 2.937e+16 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.402e-17 on 48 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 8.625e+32 on 1 and 48 DF, p-value: < 2.2e-16
Rsq_3p = (cor(problem3$x, problem3$y))^2
Rsq_3p
[1] 1
res_var_3p = sum(residuals(fit_3p)^2)
y_hat_3p = predict(fit_3p)
reg_var_3p = sum((y_hat_3p - x_bar_3p)^2)
tot_var_3p = sum((problem3$x - x_bar_3p)^2)
tot_var_3p
[1] 4.725709
res_var_3p + reg_var_3p
[1] 4.725709
When I calculated total variability I found that it was the same as residual variability + regression variability as it should be.
R^2 being 1 indicates that the relationship between x and y is perfectly linear. This was expected because of the result of the correlation function.
#Problem4.csv
x and y (x on the horizontal axis.)ggplot(data = problem4) +
geom_point(mapping = aes(x = x, y = y))
lm report the coefficients, and plot the regression line over the scatter plot.fit_4p <- lm(problem4$x ~ problem4$y, data = problem4)
summary(fit_4p)
Call:
lm(formula = problem4$x ~ problem4$y, data = problem4)
Residuals:
Min 1Q Median 3Q Max
-2.1522 -1.1929 0.3144 1.0636 2.0662
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.09574 0.37736 0.254 0.800720
problem4$y 1.31191 0.31788 4.127 0.000133 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.337 on 52 degrees of freedom
Multiple R-squared: 0.2467, Adjusted R-squared: 0.2322
F-statistic: 17.03 on 1 and 52 DF, p-value: 0.0001332
problem4 %>%
ggplot(aes(x = x, y = y)) +
geom_point(mapping = aes(x = x, y = y)) +
geom_smooth(method = "lm")
`geom_smooth()` using formula 'y ~ x'
x. Record any observations you have in this document.In comparison to the scatterplot, the residuals seem to be fitting to more of a trend than the data points.
problem4 %>%
ggplot() +
geom_point(aes(x = x, y = fit_4p$residuals))
x and y.cor(problem4$x, problem4$y)
[1] 0.4967192
x and y?The calculation indicate a linear relationship between x and y but not a really strong one because 0.5 isn’t that close to 1. Compared to problem 2 though, this is much stronger.
x and y data, this means calculating the z-score for each \(x_i\) and \(y_i\), it is possible to do this with one line of R code.x_bar_4p = mean(problem4$x)
sdx_4p = sd(problem4$x)
zx_4p <- (problem4$x - x_bar_4p)/sdx_4p
zx_4p
[1] -1.286867588 -1.222223502 -1.446498546 1.432376147 0.595256714 -1.300990643
[7] 0.001945231 -0.099751562 -0.382574167 1.634972846 -1.034031243 1.052969972
[13] -1.387510163 -0.299913621 -1.149362530 -0.978475404 1.144420615 0.746607313
[19] -0.736381889 0.009825397 -1.339462233 -0.452455817 1.563262859 0.434745861
[25] 0.565466075 -0.588058784 -0.282748541 1.651280588 1.189400303 1.511961511
[31] 1.049267867 0.950589137 -0.734200015 0.884971691 1.619289227 -0.649920524
[37] -0.303512614 1.048699280 -1.358993675 -0.420311190 -0.161859863 -1.098357984
[43] 0.295492485 1.566385979 1.629733477 -1.033699324 0.164212800 -0.352805222
[49] 0.603288532 -0.729562537 -0.956559433 -0.956559433 -0.301386930 -0.301386930
y_bar_4p = mean(problem4$y)
sdy_4p = sd(problem4$y)
zy_4p <- (problem4$y - y_bar_4p)/sdy_4p
zy_4p
[1] -0.078542927 -0.110099044 -0.073330393 0.806389528 0.531980985 0.057160582
[7] 1.290626050 0.990944093 -1.678594019 0.736640568 -0.249329080 0.773633206
[13] -0.088568439 1.322161404 -0.163184169 -0.121919435 0.595839764 0.255259297
[19] -1.750083769 1.052321485 0.032954962 -1.947898625 0.580866315 0.454077556
[25] 0.152211980 -1.748859732 1.095404765 0.759919557 0.917114336 0.904547705
[31] 0.668965741 -0.003743964 -1.492614008 0.358973324 0.776127811 -1.801120850
[37] -1.736803107 0.689425718 -0.017377795 -1.848648573 0.973880866 0.052300990
[43] 1.328938034 0.671689188 0.555642675 0.009841240 1.136730520 -2.012416343
[49] 0.277280058 -1.876912762 -0.976160401 -1.076364791 0.152468506 -0.109746585
lm calculate the regression coefficients. What do you notice?The slope is positive 0.096 and the intercept is 1.31. The positive nature of the slope makes it seem like the linear relationship is positive.
fit_4p <- lm(problem4$x ~ problem4$y, data = problem4)
summary(fit_4p)
Call:
lm(formula = problem4$x ~ problem4$y, data = problem4)
Residuals:
Min 1Q Median 3Q Max
-2.1522 -1.1929 0.3144 1.0636 2.0662
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.09574 0.37736 0.254 0.800720
problem4$y 1.31191 0.31788 4.127 0.000133 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.337 on 52 degrees of freedom
Multiple R-squared: 0.2467, Adjusted R-squared: 0.2322
F-statistic: 17.03 on 1 and 52 DF, p-value: 0.0001332
Rsq_4p = (cor(problem4$x, problem4$y))^2
Rsq_4p
[1] 0.24673
res_var_4p = sum(residuals(fit_4p)^2)
y_hat_4p = predict(fit_4p)
reg_var_4p = sum((y_hat_4p - x_bar_4p)^2)
tot_var_4p = sum((problem4$x - x_bar_4p)^2)
tot_var_4p
[1] 123.4709
res_var_4p + reg_var_4p
[1] 123.4709
When I calculated total variability I found that it was the same as residual variability + regression variability as it should be.
The R^2 is 0.25 and since that is not that close to +1 that means that the relationship is only somewhat strong and positive.
Anthropological researchers collected body measurements from 507 individuals (247 men and 260 women.) The data are contained in the file bdims.csv. A description of the variables can be found here
Using R Perform the following tasks, give the most complete specific answers you can given the data.
I can see by making a scatter plot of sho_gi vs. hgt that as shoulder girth increases, so does height. Although the difference varies, we can see a clear linear trend upwards (positive trend).
ggplot(data = bdims) +
geom_point(mapping = aes(x = sho_gi, y = hgt))
The relationship shouldn’t change if we correctly transform the shoulder girth data from inches to centimeters so that the 2 variables could be compared.
lm?y = 0.6036442x + 105.8324618
You would need to find m and b using coef as I did to get the m and b in the above equation.
m = sho_gi_hgt$coef[[2]]
b = sho_gi_hgt$coef[[1]]
c(m,b)
In this context, the slope (m) is beta_1 and the intercept (b) is beta_0. So the slope and intercept are used to represent the relationship between x and y. This means that we can use x and the equation to predict y although there may be some error. The relationship means that for every increase in shoulder girth cm (x) we (using the model) predict an additional 105.8324618 cm in height (y). 105.8324618 is the b intercept.
R^2 is equal to the square of the correlation coefficient. Therefore if I square the correlation between x (sho_gi) and y (hgt), I should get R^2 (R_sq_s).
R_sq_s = (cor(bdims$sho_gi, bdims$hgt))^2
R_sq_s
If x (shoulder girth) is 100 cm, then y (height) is 166.1969 cm.
(m*100)+b
The difference between the model-calculated height and the actual height is -6.1969 cm. The negative residual means that this model overestimated the height (which we already knew since 166 > 160).
160-166.1969
You could try but since the dataset says that the data was collected from “physically active men and women”, the model is therefore going to be more representative of adults not babies. Since babies grow very much in a short amount of time (compared to adults), a specific model for baby growth is necessitated. Therefore, trying to calculate a one year old’s height from shoulder girth using this model would only be useful for comparison purposes, to see how the model’s answer compares to the height of the one year old.
murders %>%
ggplot(aes(x = perc_pov, y = annual_murders_per_mil)) +
geom_point()
summary(lm(annual_murders_per_mil~ perc_pov, data = murders))
Just referring to the plot and the summary above:
y = 2.559x + -29.901
The intercept -29.901 is how much is subtracted from annual murders with each increase in percent poverty. That seems strange considering the positive linear nature of the scatterplot. However a bigger intercept may be necessary considering that the datapoints look sparse.
The slope 2.559 is the adjustment to percent poverty (x).
R^2 (R_sq_m in code below) is 0.705. That means that 70% of the variability in annual murders can be traced back to the model or explained by percent poverty.
cor(murders$perc_pov, murders$annual_murders_per_mil)
[1] 0.8397782
R_sq_m = (cor(murders$perc_pov, murders$annual_murders_per_mil))^2
R_sq_m
[1] 0.7052275