##enter the data, 2 variable quatitative data
FatGrams = c(19,31,34,35,39,39,43)
Calories = c(410,580,590,570,640,680,660)
## make the scatterplot
plot(FatGrams, Calories, col = "purple", type ='p', pch = 16)
## calculate the linear regression model
lm.r = lm(Calories~FatGrams)
## add the regression line to the scatterplot
abline(lm.r, col = "dark green")
## state the correlation coeficient
cor(FatGrams, Calories)
## [1] 0.9606329
## summary provides lots of information
summary(lm.r)
##
## Call:
## lm(formula = Calories ~ FatGrams)
##
## Residuals:
## 1 2 3 4 5 6 7
## -11.009 26.325 3.159 -27.897 -2.119 37.881 -26.341
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 210.95 50.10 4.211 0.008404 **
## FatGrams 11.06 1.43 7.732 0.000578 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.33 on 5 degrees of freedom
## Multiple R-squared: 0.9228, Adjusted R-squared: 0.9074
## F-statistic: 59.78 on 1 and 5 DF, p-value: 0.0005782
##look at the residuals
resid(lm.r)
## 1 2 3 4 5 6
## -11.008600 26.325254 3.158718 -27.896794 -2.118843 37.881157
## 7
## -26.340891
plot(Calories,resid(lm.r), col = "red", type ='p', pch = 16, main = "Residual Plot")
You cannot esime the fat content from a burger since the model would be different when estimating the reverse values from x to y to y to x.
##enter the data, 2 variable quatitative data
FatGrams = c(19,31,34,35,39,39,43)
Calories = c(410,580,590,570,640,680,660)
## make the scatterplot
plot(Calories, FatGrams, col = "purple", type ='p', pch = 16)
## calculate the linear regression model
lm.r = lm(FatGrams~Calories)
## add the regression line to the scatterplot
abline(lm.r, col = "dark green")
## state the correlation coeficient
cor(FatGrams, Calories)
## [1] 0.9606329
## summary provides lots of information
summary(lm.r)
##
## Call:
## lm(formula = FatGrams ~ Calories)
##
## Residuals:
## 1 2 3 4 5 6 7
## -0.2609 -2.4510 -0.2857 2.3837 0.5407 -2.7981 2.8713
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14.96222 6.43253 -2.326 0.067545 .
## Calories 0.08347 0.01080 7.732 0.000578 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.375 on 5 degrees of freedom
## Multiple R-squared: 0.9228, Adjusted R-squared: 0.9074
## F-statistic: 59.78 on 1 and 5 DF, p-value: 0.0005782
##look at the residuals
resid(lm.r)
## 1 2 3 4 5 6
## -0.2609209 -2.4510035 -0.2857143 2.3837072 0.5407320 -2.7981110
## 7
## 2.8713105
plot(Calories,resid(lm.r), col = "red", type ='p', pch = 16, main = "Residual Plot")
lm.r$coefficients[1]+lm.r$coefficients[2]*600
## (Intercept)
## 35.12043
I removed the costa rica data becuase it’s unlikely that woman on average give birth to 25 kids
##enter the data, 2 variable quatitative data
BirthsPerWoman = c(2.3,2.3,1.7,3.0,3.7,2.3,1.5,2.0,2.4,2.8,2.7,2.8,4.4,3.6,2.4,2.2,3.2,2.6,3.7,2.8,1.9,2.0,2.1,2.7,2.2)
LifeExp = c(74.6,70.5,75.4,71.9,64.5,70.9,79.8,78.0,72.6,67.8,74.5,71.1,67.6,68.2,70.8,75.1,70.1,75.1,71.2,70.4,77.5,77.4,75.2,73.7,78.6)
## make the scatterplot
plot(BirthsPerWoman, LifeExp, col = "purple", type ='p', pch = 16)
## calculate the linear regression model
lm.r = lm(LifeExp~BirthsPerWoman)
## add the regression line to the scatterplot
abline(lm.r, col = "dark green")
## state the correlation coeficient
cor(BirthsPerWoman, LifeExp)
## [1] -0.7956443
## summary provides lots of information
summary(lm.r)
##
## Call:
## lm(formula = LifeExp ~ BirthsPerWoman)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2653 -1.5492 0.3147 1.9628 3.8707
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.4971 1.9019 44.427 < 2e-16 ***
## BirthsPerWoman -4.4399 0.7048 -6.299 1.99e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.387 on 23 degrees of freedom
## Multiple R-squared: 0.633, Adjusted R-squared: 0.6171
## F-statistic: 39.68 on 1 and 23 DF, p-value: 1.991e-06
##look at the residuals
resid(lm.r)
## 1 2 3 4 5 6
## 0.31474220 -3.78525780 -1.54921510 0.72269239 -3.56935743 -3.38525780
## 7 8 9 10 11 12
## 1.96279913 2.38276355 -1.24126491 -4.26529338 1.99071374 -0.96529338
## 13 14 15 16 17 18
## 2.63859276 -0.31335031 -3.04126491 0.37074932 -0.18932184 2.14672085
## 19 20 21 22 23 24
## 3.13064257 -1.66529338 1.43877067 1.78276355 0.02675644 1.19071374
## 25
## 3.87074932
plot(BirthsPerWoman,resid(lm.r), col = "red", type ='p', pch = 16, main = "Residual Plot")
Correlation is -0.7956443, meaning there is a fairly strong negative linear relationsip of births per woman and life expectancy. r^2 is 0.633, meaning that about 63 % of the life expectancy results are explained by the linear relationship to briths per woman.
LifeExpectance = 84.49 + -4.44 * BirthsPerWoman
Yes, the line is apprioiate since the residuals are random
for about every 1 birth per woman increase, the life expectancy will decrease by 4.4 years. The intercept predicts that if a country had an average of 0 births per woman, the life expectance woud be 85 years old, but thisis extrapolation.
They could, but they also do more direct help, such as building more hospitals, providing better education and services, etc.
##enter the data, 2 variable quatitative data
AverageSpeed = c(25.3,24.3,27.3,40.3,39.56,40.02,39.93,40.94,40.53,41.65,40.78,38.97,40.50)
YearPassed1900 = c(3,4,5,99,100,101,102,103,104,105,106,107,108)
## make the scatterplot
plot(YearPassed1900, AverageSpeed, col = "purple", type ='p', pch = 16)
## calculate the linear regression model
lm.r = lm(AverageSpeed~YearPassed1900)
## add the regression line to the scatterplot
abline(lm.r, col = "dark green")
## state the correlation coeficient
cor(YearPassed1900, AverageSpeed)
## [1] 0.9900163
## summary provides lots of information
summary(lm.r)
##
## Call:
## lm(formula = AverageSpeed ~ YearPassed1900)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.85605 -0.23521 0.07753 0.65206 1.49483
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.068848 0.574205 43.66 1.11e-13 ***
## YearPassed1900 0.147264 0.006322 23.30 1.03e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9573 on 11 degrees of freedom
## Multiple R-squared: 0.9801, Adjusted R-squared: 0.9783
## F-statistic: 542.7 on 1 and 11 DF, p-value: 1.035e-10
##look at the residuals
resid(lm.r)
## 1 2 3 4 5 6
## -0.2106385 -1.3579021 1.4948343 0.6520568 -0.2352068 0.0775296
## 7 8 9 10 11 12
## -0.1597340 0.7030024 0.1457388 1.1184752 0.1012116 -1.8560519
## 13
## -0.4733155
plot(YearPassed1900,resid(lm.r), col = "red", type ='p', pch = 16, main = "Residual Plot")
The correlation between Avgerage speed and Year is very strong, postive, and linear, with an r of 0.99. Yet we’re missing a lot of data inbetween the years 1905 and 1999, making me wonder if the correlation is really that strong in reality.
AverageSpeed = 25.06 + 0.147 * YearPassed1900
Yes, the relationship is clealry linear and positive with a strong correlation
##enter the data, 2 variable quatitative data
AverageSpeed = c(40.3,39.56,40.02,39.93,40.94,40.53,41.65,40.78,38.97,40.50)
YearPassed1900 = c(99,100,101,102,103,104,105,106,107,108)
## make the scatterplot
plot(YearPassed1900, AverageSpeed, col = "purple", type ='p', pch = 16)
## calculate the linear regression model
lm.r = lm(AverageSpeed~YearPassed1900)
## add the regression line to the scatterplot
abline(lm.r, col = "dark green")
## state the correlation coeficient
cor(YearPassed1900, AverageSpeed)
## [1] 0.1518561
## summary provides lots of information
summary(lm.r)
##
## Call:
## lm(formula = AverageSpeed ~ YearPassed1900)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.4799 -0.2995 0.0820 0.3241 1.2754
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.41636 8.98195 4.054 0.00366 **
## YearPassed1900 0.03770 0.08675 0.435 0.67537
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7879 on 8 degrees of freedom
## Multiple R-squared: 0.02306, Adjusted R-squared: -0.09906
## F-statistic: 0.1888 on 1 and 8 DF, p-value: 0.6754
##look at the residuals
resid(lm.r)
## 1 2 3 4 5 6
## 0.15163636 -0.62606061 -0.20375758 -0.33145455 0.64084848 0.19315152
## 7 8 9 10
## 1.27545455 0.36775758 -1.47993939 0.01236364
plot(YearPassed1900,resid(lm.r), col = "red", type ='p', pch = 16, main = "Residual Plot")
the regression is very weak and does not meet the conditions for regression, since there is no clear relationship from only this data.
The slope is almost 0, meaning that the year and avergae speed are almost not related at all.
Bernad Hinault becuase for his time he was more standard deviations away from the mean than Lance Armstrong was in 2005.