library(tidyverse)
library(Stat2Data)
library(skimr)
t = 15.3/3.4
n = 82
p = 2 * pt(abs(t), df=(n-2), lower.tail = FALSE)
\[ t=\frac{\hat{B_1}}{\hat{SE_{B_1}}}=\frac{15.3}{3.4}=4.5 \] - The p-value is a very small value, which will be significant and we can reject the null hypothesis.
t1 = qt(0.025, 38)
b1 = 15.3
se = 3.4
b1 + t1*se
## [1] 8.41706
b1 - t1*se
## [1] 22.18294
data(TextPrices)
skim(TextPrices)
## Skim summary statistics
## n obs: 30
## n variables: 2
##
## ── Variable type:integer ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## Pages 0 30 30 464.53 287.08 51 212 456.5 672 1060 ▇▅▇▅▃▅▃▁
##
## ── Variable type:numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100
## Price 0 30 30 65.02 51.42 4.25 17.59 55.12 95.75 169.75
## hist
## ▇▂▂▃▂▂▂▂
ggplot(data=TextPrices) + geom_point(aes(x=Pages, y=Price), alpha=0.5)
m1 = lm(Price ~ Pages, data=TextPrices)
summary(m1)
##
## Call:
## lm(formula = Price ~ Pages, data = TextPrices)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.475 -12.324 -0.584 15.304 72.991
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.42231 10.46374 -0.327 0.746
## Pages 0.14733 0.01925 7.653 2.45e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.76 on 28 degrees of freedom
## Multiple R-squared: 0.6766, Adjusted R-squared: 0.665
## F-statistic: 58.57 on 1 and 28 DF, p-value: 2.452e-08
\[ H_0:No\ relationship\ between\ price\ and\ pages: \beta_1 =0\\H_A: relationship\ between\ price\ and\ pages:\beta_1 \ne0 \] - The pages in the textbook has a positive relationship with the price of the textbook, meaning as the pages in the textbook increase, the price for the textbook also increases. We see from the summary that our test statistic is 7.653 and the p-value is 2.42e^-8, which is really small and so we conclude that our null hypothesis is not true (statistically significant p-value). Therefore, the pages in the textbook is a good predictor of how much the textbook costs.
t1 = qt(0.025, 28)
b1 = 0.14733
se = 0.01925
b1 + t1*se
## [1] 0.1078982
b1 - t1*se
## [1] 0.1867618
For this question, use R as a calculator to show your arithmetic.
\[ r^2=\frac{SSModel}{SSTotal}=38/102\approx0.372549 \]
SSModel = 38
SSE = 64
SSTotal = 102
r2 = SSModel/SSTotal
data(RailsTrails)
skim(RailsTrails)
## Skim summary statistics
## n obs: 104
## n variables: 30
##
## ── Variable type:factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n n_unique
## AcreGroup 0 104 104 2
## BedGroup 0 104 104 3
## DistGroup 0 104 104 2
## GarageGroup 0 104 104 2
## SFGroup 0 104 104 2
## StreetName 0 104 104 73
## top_counts ordered
## <= : 54, > 1: 50, NA: 0 FALSE
## 3 b: 52, 4+ : 36, 1-2: 16, NA: 0 FALSE
## Far: 64, Clo: 40, NA: 0 FALSE
## yes: 53, no: 51, NA: 0 FALSE
## > 1: 53, <= : 51, NA: 0 FALSE
## Lau: 8, Rya: 6, Bri: 3, Lon: 3 FALSE
##
## ── Variable type:integer ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50
## Bedrooms 0 104 104 3.25 0.91 1 3 3
## BikeScore 0 104 104 57.28 22.67 18 36 54.5
## GarageSpaces 0 104 104 0.76 0.86 0 0 1
## HouseNum 0 104 104 52.5 30.17 1 26.75 52.5
## NumFullBaths 0 104 104 1.45 0.62 1 1 1
## NumHalfBaths 0 104 104 0.22 0.42 0 0 0
## NumRooms 0 104 104 6.62 1.67 4 5 6.5
## StreetNum 0 104 104 137.84 201.96 1 27 63.5
## WalkScore 0 104 104 38.88 26.25 2 14.75 36
## Zip 0 104 104 1061.17 0.99 1060 1060 1062
## p75 p100 hist
## 4 6 ▁▂▁▇▅▁▁▁
## 77.25 97 ▂▇▃▃▂▅▅▃
## 1 4 ▇▅▁▃▁▁▁▁
## 78.25 104 ▇▇▇▇▇▇▇▇
## 2 4 ▇▁▅▁▁▁▁▁
## 0 1 ▇▁▁▁▁▁▁▂
## 7.25 14 ▇▇▇▇▁▁▁▁
## 155 1086 ▇▂▁▁▁▁▁▁
## 60.75 94 ▇▇▂▅▃▅▂▂
## 1062 1062 ▆▁▁▁▁▁▁▇
##
## ── Variable type:numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50
## Acre 0 104 104 0.26 0.12 0.05 0.17 0.25
## Adj1998 0 104 104 208.57 66.46 60.66 167 200.62
## Adj2007 0 104 104 327.55 104.98 162.6 260.56 303.65
## Adj2011 0 104 104 284.53 93.29 141.72 215.81 258.87
## Diff2014 0 104 104 84.53 76.57 -199.87 44.33 71.39
## Distance 0 104 104 1.11 0.94 0.039 0.33 0.76
## Latitude 0 104 104 42.33 0.013 42.3 42.32 42.32
## Longitude 0 104 104 -72.66 0.024 -72.73 -72.68 -72.66
## PctChange 0 104 104 42.2 30.2 -46.75 26.54 37.61
## Price1998 0 104 104 142.69 45.47 41.5 114.25 137.25
## Price2007 0 104 104 285.05 91.36 141.5 226.75 264.25
## Price2011 0 104 104 268.62 88.07 133.8 203.75 244.4
## Price2014 0 104 104 293.09 110.89 132.13 212.94 272.92
## SquareFeet 0 104 104 1.57 0.56 0.52 1.21 1.52
## p75 p100 hist
## 0.33 0.56 ▆▅▇▇▇▂▁▂
## 228.39 470.67 ▁▃▇▆▁▁▁▁
## 349.47 798.62 ▃▇▃▁▁▁▁▁
## 325.09 698.54 ▃▇▃▂▁▁▁▁
## 106.87 497.82 ▁▁▆▇▂▁▁▁
## 1.9 3.98 ▇▅▁▃▂▂▁▁
## 42.33 42.35 ▁▅▃▇▅▃▃▂
## -72.64 -72.61 ▁▁▆▅▇▅▇▂
## 51.17 130.49 ▁▁▃▇▆▂▁▂
## 156.25 322 ▁▃▇▆▁▁▁▁
## 304.12 695 ▃▇▃▁▁▁▁▁
## 306.93 659.5 ▃▇▃▂▁▁▁▁
## 334.17 879.33 ▇▇▅▂▁▁▁▁
## 1.83 4.03 ▃▇▇▃▁▁▁▁
myRailsTrails = lm(Adj2007 ~ SquareFeet, data = RailsTrails)
skim(myRailsTrails)
## No skim method exists for class lm.
ggplot(data=myRailsTrails) + geom_point(aes(x=SquareFeet, y=Adj2007), alpha=0.5)
summary(myRailsTrails)
##
## Call:
## lm(formula = Adj2007 ~ SquareFeet, data = RailsTrails)
##
## Residuals:
## Min 1Q Median 3Q Max
## -131.940 -36.635 -4.109 32.470 153.323
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72.973 15.541 4.695 8.32e-06 ***
## SquareFeet 162.526 9.351 17.381 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 53 on 102 degrees of freedom
## Multiple R-squared: 0.7476, Adjusted R-squared: 0.7451
## F-statistic: 302.1 on 1 and 102 DF, p-value: < 2.2e-16
\[ \hat{Adj_{2007}} = 72.973 + 162.526*Squarefeet \\ \hat{Adj_{2007}} = 72.973 + 162.526*1,500 = 24,861.97 \] - A home with 1,500 sf is about $24,861.97 in adjusted 2007 price.
t1 <- qt(0.025, 102)
b1 = 162.526
se = 9.351
b1 + t1*se
## [1] 143.9783
b1 - t1*se
## [1] 181.0737
plot(myRailsTrails, which = 1)
plot(myRailsTrails, which = 2)
plot(myRailsTrails, which = 3)
- We see from the scatterplot that the data is a little curvy, making it fail one of the conditions for linear regression. We see that in the Normal QQ Plot, there are many points that are not fitted to the straight line, making it failing the normality of the data. In the Residual vs. Fitted plot, it shows that the variance is not the same for the residuals since the line is not straight.
logRailsTrails = RailsTrails %>%
mutate(logSquareFeet = log(SquareFeet), logAdj2007 = log(Adj2007))
logmyRailsTrails = lm(logAdj2007 ~ logSquareFeet, data=logRailsTrails)
ggplot(data=logmyRailsTrails) + geom_point(aes(x=logSquareFeet, y=logAdj2007), alpha=0.5)
summary(logmyRailsTrails)
##
## Call:
## lm(formula = logAdj2007 ~ logSquareFeet, data = logRailsTrails)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.36543 -0.11094 -0.01864 0.10474 0.35402
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.47958 0.02137 256.41 <2e-16 ***
## logSquareFeet 0.69334 0.04082 16.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1461 on 102 degrees of freedom
## Multiple R-squared: 0.7388, Adjusted R-squared: 0.7362
## F-statistic: 288.5 on 1 and 102 DF, p-value: < 2.2e-16
\[ \hat{logAdj_{2007}} = 5.47958 + 0.69334*logSquarefeet \] - The fit and the model of the log data fits a linear relationship better than the original.
plot(logmyRailsTrails, which = 1)
plot(logmyRailsTrails, which = 2)
plot(logmyRailsTrails, which = 3)
t1 <- qt(0.025, 102)
b1 = 0.69334
se = 0.04082
b1 + t1*se
## [1] 0.6123737
b1 - t1*se
## [1] 0.7743063
For this question, use R as a calculator to show your arithmetic.
r2 = 0.701
sy = 104807
sx = 657
y = 247235
x = 2009
b1 = r2*(sy/sx)
b0 = y-b1*x
\[ \hat{Gate} = 22,576.49 + 111.826*Enroll \]
r2^2
## [1] 0.491401
22576.49+111.826*1445
## [1] 184165.1
\[ \hat{Gate} \approx 184,166 \] - The predicted number of persons is 184166 people. (d)
predict = 22576.49+111.826*2200
residual = predict - 130000
residual
## [1] 138593.7
\[ Residual = 138,593.69 \]