library(tidyverse)
library(Stat2Data)
library(skimr)

Exercise 2.12

t = 15.3/3.4
n = 82
p = 2 * pt(abs(t), df=(n-2), lower.tail = FALSE)

\[ t=\frac{\hat{B_1}}{\hat{SE_{B_1}}}=\frac{15.3}{3.4}=4.5 \] - The p-value is a very small value, which will be significant and we can reject the null hypothesis.

t1 = qt(0.025, 38)
b1 = 15.3
se = 3.4
b1 + t1*se
## [1] 8.41706
b1 - t1*se
## [1] 22.18294

Exercise 2.16

data(TextPrices)
skim(TextPrices)
## Skim summary statistics
##  n obs: 30 
##  n variables: 2 
## 
## ── Variable type:integer ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##  variable missing complete  n   mean     sd p0 p25   p50 p75 p100     hist
##     Pages       0       30 30 464.53 287.08 51 212 456.5 672 1060 ▇▅▇▅▃▅▃▁
## 
## ── Variable type:numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##  variable missing complete  n  mean    sd   p0   p25   p50   p75   p100
##     Price       0       30 30 65.02 51.42 4.25 17.59 55.12 95.75 169.75
##      hist
##  ▇▂▂▃▂▂▂▂
ggplot(data=TextPrices) + geom_point(aes(x=Pages, y=Price), alpha=0.5)

m1 = lm(Price ~ Pages, data=TextPrices)
summary(m1)
## 
## Call:
## lm(formula = Price ~ Pages, data = TextPrices)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.475 -12.324  -0.584  15.304  72.991 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.42231   10.46374  -0.327    0.746    
## Pages        0.14733    0.01925   7.653 2.45e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.76 on 28 degrees of freedom
## Multiple R-squared:  0.6766, Adjusted R-squared:  0.665 
## F-statistic: 58.57 on 1 and 28 DF,  p-value: 2.452e-08

\[ H_0:No\ relationship\ between\ price\ and\ pages: \beta_1 =0\\H_A: relationship\ between\ price\ and\ pages:\beta_1 \ne0 \] - The pages in the textbook has a positive relationship with the price of the textbook, meaning as the pages in the textbook increase, the price for the textbook also increases. We see from the summary that our test statistic is 7.653 and the p-value is 2.42e^-8, which is really small and so we conclude that our null hypothesis is not true (statistically significant p-value). Therefore, the pages in the textbook is a good predictor of how much the textbook costs.

t1 = qt(0.025, 28)
b1 = 0.14733
se = 0.01925
b1 + t1*se
## [1] 0.1078982
b1 - t1*se
## [1] 0.1867618

Exercise 2.22

For this question, use R as a calculator to show your arithmetic.

\[ r^2=\frac{SSModel}{SSTotal}=38/102\approx0.372549 \]

SSModel = 38
SSE = 64
SSTotal = 102
r2 = SSModel/SSTotal

Exercise 2.46

data(RailsTrails)
skim(RailsTrails)
## Skim summary statistics
##  n obs: 104 
##  n variables: 30 
## 
## ── Variable type:factor ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##     variable missing complete   n n_unique
##    AcreGroup       0      104 104        2
##     BedGroup       0      104 104        3
##    DistGroup       0      104 104        2
##  GarageGroup       0      104 104        2
##      SFGroup       0      104 104        2
##   StreetName       0      104 104       73
##                        top_counts ordered
##           <= : 54, > 1: 50, NA: 0   FALSE
##  3 b: 52, 4+ : 36, 1-2: 16, NA: 0   FALSE
##           Far: 64, Clo: 40, NA: 0   FALSE
##            yes: 53, no: 51, NA: 0   FALSE
##           > 1: 53, <= : 51, NA: 0   FALSE
##    Lau: 8, Rya: 6, Bri: 3, Lon: 3   FALSE
## 
## ── Variable type:integer ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##      variable missing complete   n    mean     sd   p0     p25    p50
##      Bedrooms       0      104 104    3.25   0.91    1    3       3  
##     BikeScore       0      104 104   57.28  22.67   18   36      54.5
##  GarageSpaces       0      104 104    0.76   0.86    0    0       1  
##      HouseNum       0      104 104   52.5   30.17    1   26.75   52.5
##  NumFullBaths       0      104 104    1.45   0.62    1    1       1  
##  NumHalfBaths       0      104 104    0.22   0.42    0    0       0  
##      NumRooms       0      104 104    6.62   1.67    4    5       6.5
##     StreetNum       0      104 104  137.84 201.96    1   27      63.5
##     WalkScore       0      104 104   38.88  26.25    2   14.75   36  
##           Zip       0      104 104 1061.17   0.99 1060 1060    1062  
##      p75 p100     hist
##     4       6 ▁▂▁▇▅▁▁▁
##    77.25   97 ▂▇▃▃▂▅▅▃
##     1       4 ▇▅▁▃▁▁▁▁
##    78.25  104 ▇▇▇▇▇▇▇▇
##     2       4 ▇▁▅▁▁▁▁▁
##     0       1 ▇▁▁▁▁▁▁▂
##     7.25   14 ▇▇▇▇▁▁▁▁
##   155    1086 ▇▂▁▁▁▁▁▁
##    60.75   94 ▇▇▂▅▃▅▂▂
##  1062    1062 ▆▁▁▁▁▁▁▇
## 
## ── Variable type:numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##    variable missing complete   n   mean      sd       p0    p25    p50
##        Acre       0      104 104   0.26   0.12     0.05    0.17   0.25
##     Adj1998       0      104 104 208.57  66.46    60.66  167    200.62
##     Adj2007       0      104 104 327.55 104.98   162.6   260.56 303.65
##     Adj2011       0      104 104 284.53  93.29   141.72  215.81 258.87
##    Diff2014       0      104 104  84.53  76.57  -199.87   44.33  71.39
##    Distance       0      104 104   1.11   0.94     0.039   0.33   0.76
##    Latitude       0      104 104  42.33   0.013   42.3    42.32  42.32
##   Longitude       0      104 104 -72.66   0.024  -72.73  -72.68 -72.66
##   PctChange       0      104 104  42.2   30.2    -46.75   26.54  37.61
##   Price1998       0      104 104 142.69  45.47    41.5   114.25 137.25
##   Price2007       0      104 104 285.05  91.36   141.5   226.75 264.25
##   Price2011       0      104 104 268.62  88.07   133.8   203.75 244.4 
##   Price2014       0      104 104 293.09 110.89   132.13  212.94 272.92
##  SquareFeet       0      104 104   1.57   0.56     0.52    1.21   1.52
##     p75   p100     hist
##    0.33   0.56 ▆▅▇▇▇▂▁▂
##  228.39 470.67 ▁▃▇▆▁▁▁▁
##  349.47 798.62 ▃▇▃▁▁▁▁▁
##  325.09 698.54 ▃▇▃▂▁▁▁▁
##  106.87 497.82 ▁▁▆▇▂▁▁▁
##    1.9    3.98 ▇▅▁▃▂▂▁▁
##   42.33  42.35 ▁▅▃▇▅▃▃▂
##  -72.64 -72.61 ▁▁▆▅▇▅▇▂
##   51.17 130.49 ▁▁▃▇▆▂▁▂
##  156.25 322    ▁▃▇▆▁▁▁▁
##  304.12 695    ▃▇▃▁▁▁▁▁
##  306.93 659.5  ▃▇▃▂▁▁▁▁
##  334.17 879.33 ▇▇▅▂▁▁▁▁
##    1.83   4.03 ▃▇▇▃▁▁▁▁
myRailsTrails = lm(Adj2007 ~ SquareFeet, data = RailsTrails)
skim(myRailsTrails)
## No skim method exists for class lm.
ggplot(data=myRailsTrails) + geom_point(aes(x=SquareFeet, y=Adj2007), alpha=0.5)

summary(myRailsTrails)
## 
## Call:
## lm(formula = Adj2007 ~ SquareFeet, data = RailsTrails)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -131.940  -36.635   -4.109   32.470  153.323 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   72.973     15.541   4.695 8.32e-06 ***
## SquareFeet   162.526      9.351  17.381  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53 on 102 degrees of freedom
## Multiple R-squared:  0.7476, Adjusted R-squared:  0.7451 
## F-statistic: 302.1 on 1 and 102 DF,  p-value: < 2.2e-16

\[ \hat{Adj_{2007}} = 72.973 + 162.526*Squarefeet \\ \hat{Adj_{2007}} = 72.973 + 162.526*1,500 = 24,861.97 \] - A home with 1,500 sf is about $24,861.97 in adjusted 2007 price.

t1 <- qt(0.025, 102)
b1 = 162.526
se =  9.351
b1 + t1*se
## [1] 143.9783
b1 - t1*se
## [1] 181.0737
plot(myRailsTrails, which = 1)

plot(myRailsTrails, which = 2)

plot(myRailsTrails, which = 3)

- We see from the scatterplot that the data is a little curvy, making it fail one of the conditions for linear regression. We see that in the Normal QQ Plot, there are many points that are not fitted to the straight line, making it failing the normality of the data. In the Residual vs. Fitted plot, it shows that the variance is not the same for the residuals since the line is not straight.

logRailsTrails = RailsTrails %>%
  mutate(logSquareFeet = log(SquareFeet), logAdj2007 = log(Adj2007))
logmyRailsTrails = lm(logAdj2007 ~ logSquareFeet, data=logRailsTrails)
ggplot(data=logmyRailsTrails) + geom_point(aes(x=logSquareFeet, y=logAdj2007), alpha=0.5)

summary(logmyRailsTrails)
## 
## Call:
## lm(formula = logAdj2007 ~ logSquareFeet, data = logRailsTrails)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36543 -0.11094 -0.01864  0.10474  0.35402 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.47958    0.02137  256.41   <2e-16 ***
## logSquareFeet  0.69334    0.04082   16.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1461 on 102 degrees of freedom
## Multiple R-squared:  0.7388, Adjusted R-squared:  0.7362 
## F-statistic: 288.5 on 1 and 102 DF,  p-value: < 2.2e-16

\[ \hat{logAdj_{2007}} = 5.47958 + 0.69334*logSquarefeet \] - The fit and the model of the log data fits a linear relationship better than the original.

plot(logmyRailsTrails, which = 1)

plot(logmyRailsTrails, which = 2)

plot(logmyRailsTrails, which = 3)

t1 <- qt(0.025, 102)
b1 = 0.69334
se = 0.04082
b1 + t1*se
## [1] 0.6123737
b1 - t1*se
## [1] 0.7743063

Exercise 2.61

For this question, use R as a calculator to show your arithmetic.

r2 = 0.701
sy = 104807
sx = 657
y = 247235
x = 2009
b1 = r2*(sy/sx)
b0 = y-b1*x

\[ \hat{Gate} = 22,576.49 + 111.826*Enroll \]

r2^2
## [1] 0.491401
22576.49+111.826*1445
## [1] 184165.1

\[ \hat{Gate} \approx 184,166 \] - The predicted number of persons is 184166 people. (d)

predict = 22576.49+111.826*2200
residual = predict - 130000
residual
## [1] 138593.7

\[ Residual = 138,593.69 \]