\[Y \approx \beta_0 + \beta_1X\] Simple as it gets. Use the training data \(X \in \Bbb R^n\) to produce estimates for \(\hat{\beta_0}\) and \(\hat{\beta_1}\) then create predictions \(\hat{y}\). ###Estimating Coefficients Measure closeness of fit. Least sqaures is the most common method.
library(broom)
y <-rnorm(10)
x <-1:10
mod <- lm(y ~ x)
df <- augment(mod)
ggplot(df, aes(x = x, y = y)) +
geom_point() +
geom_point(aes(y = .fitted), shape = 1) +
geom_segment(aes(xend = x, yend = .fitted), color = "red")
Let \(\hat{y}_i = \hat{\beta_0} + \hat{\beta_1}x_i\) be the prediction for \(Y\) based on the \(i\)th value of \(X\). Then the residual (error) is \(e_i = y_i - \hat{y_i}\). The residual sum of squares defines as \[RSS = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y_i - \hat{\beta_0} + \hat{\beta_1}x_i)^2\] Using some calculus (set the derivative to zero) we calculate the minimized coefficients as \[\begin{eqnarray} \hat{\beta_1} &=& \frac{\sum_{i=1}^n (x_i - \bar{x}) (y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})} \\ \hat{\beta_1} &=& \bar{y} - \hat{\beta_1}\bar{x} \end{eqnarray}\] where \(\bar{y} \equiv \frac{1}{n} \sum_{i=1}^n y_i\) and \(\bar{x} \equiv \frac{1}{n} \sum_{i=1}^n x_i\), the means of \(Y\) and \(X\) respectively.
In the same way we estimate the population mean with the sample mean \(\mu \approx \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i\), the least squares coefficients estimate the population regression line. This analogy is based on the concept of bias. If the sample mean \(\hat\mu\) is used to estimate the population mean \(\mu\), the estimate is unbiased, in that on average \(\hat\mu \approx \mu\). Given a huge set of observations \(Y \in \Bbb R^n\) where \(n >>> 1\), there is no systematice over or under estimate of the true parameter, thus \(\hat\mu = \mu\). So how accurate is a single \(\hat\mu\) to \(\mu\)? Compute the standard error defined as \[\text{Var}(\hat\mu) = \text{SE}(\hat\mu)^2 = \frac{\sigma^2}{n}\] where \(\sigma\) is the standard deviation of each of the realizations \(y_i\) of \(Y\). The formulas for computing the least squares coefficient errors are the following \[ \begin{eqnarray} \text{SE}(\hat\beta_0)^2 &=& \sigma^2 \left \lbrack \frac{1}{n} + \frac{\bar x^2}{\sum_{i=1}^2 (x_i - \bar x)^2} \right \rbrack\\ \text{SE}(\hat\beta_1)^2 &=& \frac{\sigma^2}{\sum_{i=1}^2 (x_i - \bar x)^2} \\ \sigma^2 &=& \text{Var}(\epsilon) \\ \sigma = \text{RSE} &=& \sqrt{\text{RSS} / (n - 2)} \end{eqnarray} \] This assumes the residual variance is uncorrelated, doesn’t change with respect to any variable. The residual standard error (RSE) is computed from the dataset, because \(\sigma\) is generally unknown. Then a confidence interval may be computed. A 95% confidence interval means there is approximately a 95% chance the interval \(\hat\beta_1 \pm 2 \cdot \text{SE}(\hat\beta_1)\) contains the true value of \(\beta_1\). Standard errors can alos be used for hypothesis testing. The null hypothesis \(H_0\) versus the alternative hypothesis \(H_a\): \[ \begin{eqnarray} H_0 : \text{There is no relationship between } X \text{ and } Y \Rightarrow \beta_1 &=& 0\\ H_a : \text{There is some relationship between } X \text{ and } Y \Rightarrow \beta_1 &\neq& 0 \end{eqnarray} \] So we need to test if \(\hat\beta_1\), our estimate for \(\beta_1\), is sufficiently far away enough from zero to be non-zero. How far is far enough? Use the t-statistic! This measures the number of standard deviations \(\hat\beta_1\) is away from \(0\). \[t = \frac{\hat\beta_1 - 0}{\text{SE} (\hat\beta_1)}\] If there is no relationship between \(X\) and \(Y\), then the expected t-distribution has \(n-2\) degrees of freedom. We check the probability of \(|t|\) if \(\beta_1=0\) using a p-value. Small p-values indicate it is unlikely to observe such a substantial association solely because of chance. Small p-values allow us to reject the null-hypothesis and declare there is some relationship between \(X\) and \(Y\). Typical cutoff p-values are 5 or 1%.
Even if we knew the true regression line, we could not perfectly predict \(Y\) from \(X\) because of errors \(\epsilon\). The RSE is an estimate of SE(\(\epsilon\)), roughly speaking the average amount the response will deviate from the true regression line. \[ \text{RSE} = \sqrt{\frac{1}{n-2} \text{RSS}} = \sqrt{\frac{1}{n-2} \sum_{i=1}^n (y_i - \hat y_i)^2} \] Again, RSS is the residual sum of squares. In general, RSE measures the lack of fit of the model to the data.
Similar to the RSE, but the \(R^2\) statistic measures proportion so the value is always between 0 and 1, and independent of the scale of \(Y\). \[ R^2 = \frac{\text{TSS} - \text{RSS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\sum_{i=1}^n (y_i - \hat y_i)^2}{\sum_{i=1}^n (y_i - \bar y)^2} \] TSS measures the total variance in the response \(Y\).In contrast, RSS measures the amount of unexplained variability after regression. Hence, TSS - RSS measures the amount of explainable response variability. If the model well explains the response variability, RSS << 1 and \(R^2 \approx 1\). Thus \(R^2\) measures how well \(X\) explains the variability in \(Y\).
Adding the effects of multiple predictors. We’ll add the \(\mathbf 1\) vector in this equation, but omit it later for ease of notation. \[Y_n = \begin{bmatrix}\mathbf 1 & X \end{bmatrix}_{n \times p+1} \cdot\beta_{p+1}\] ### Estimating Regression Coefficients Since \(\beta\) is unknown, we estimate it using the same least squares method. \[ \begin{eqnarray}
\hat Y &=& X \cdot \hat\beta \\
\text{RSS} &=& (Y - \hat Y)^T (Y - \hat Y)
\end{eqnarray} \] Estimating multiple regression can have different outcomes than single regression. The slope coefficients are the change in the response to that one variable while holding all the others constant. We can test this by correlating the predictors together. Strong correlations mean one variable is strongly dependent upon another, and it’s effect on the response it superfluous.
This counter-intuitive idea can be displayed by stating high ice cream sales cause more shark attacks. While no one has banned ice cream sales to reduce shark attacks, a more realistic explanation is higher temperatures draw more people to the beach where they buy ice cream and succomb to shark attacks.
Is at least one of the predictors useful in predicting the response? Test the following hypothesis using the F-statistic: \[ \begin{eqnarray} H_0 &:& \beta = 0 \\ H_a &:& \text{ at least one } \beta_j \neq 0 \\ \\ F &=& \frac{(TSS - RSS) / p}{RSS / (n - p - 1)} \end{eqnarray} \] It can be shown that the denominator evalutates to \(\sigma^2\) and, if \(H__0\) is true, the numerator is also \(\sigma^2\), thus if there is no relationship between the response and the predictors we expect the F-statistic to evaluate close to 1. If \(H_a\) is true, then the numerator is \(> \sigma^2\) and \(F > 1\).
Do all of the predictors help explain \(Y\), or is only a subset useful?
The F-statistic and it’s associated p-value explains if at least one of the predictors is useful. Then which one’s are useless? Variable selection pares down the total column space to a set of independent vectors. There are a total of \(2^p\) models that contain subsets of \(p\) variables, so brute-force is infeasible. Use a more automated and efficient approach with three classical methods:
+Forward Selection: begin with the null model – contains only an intercept and no predictors– then fit \(p\) simple linear regressions and add to the null model the variable that results in the lowest RSS. Continue adding predictors until some stopping rule is satisfied.
+Backward Selection: begin with all the model variables, and remove the variable with the largest p-value (the least statistically significant). Continue removing predictors until a stopping rule is reached.
+Mixed Selection: begin with the null model and add predictors until one of the variable’s p-values rises above a certain threshold, then remove that one. Continue adding and removing predictors until all the model variables are significant and all possible variables to add would have large p-values.
\(R^2\): the proportion of variance explained, or the correlation of the response and the variable. In multiple linear regression this equals \(\text{Cor}(Y,\hat Y)^2\). Adding more variables always increases the \(R^2\) score, even if the predictor is weakly associated with the response. More predictors allows more degrees of freedom to fit the training data (but not the test data). However, RSE increases slightly when weak predictors add to the model. This is because the RSE is inversely proportional to its degrees of freedom \[\text{RSE} = \sqrt{\frac{1}{n-p-1} \text{RSS}} \]
Three sorts of fitted model prediction uncertainty:
Suppose a predictor has two qualitative (factor) levels: male
and female
. Incorporating them into a regression model is simple: create a dummy variable that takes on two numerical values
\[x_i = \begin{cases}
1 \quad &\text{if the } i \text{th person is female} \\
0 \quad &\text{if the } i \text{th person is male}
\end{cases} \] \[y_i = \beta_0 + \beta_1 x_i + \epsilon =
\begin{cases}
\beta_0 + \beta_1 x_i + \epsilon \quad &\text{if the } i \text{th person is female} \\
\beta_0 + \epsilon \quad &\text{if the } i \text{th person is male}
\end{cases} \] So \(\beta_0\) is the average predicted response for male
and \(\beta_0 + \beta_1\) is the average predicted response for female
. But remember the p-value! High p-value means the predictor is not statistically significant. The coefficient ordering is arbitrary, only the numerical interpretation matters.
An example is ethnicity
where the levels are Asian, Caucasion, and African American. The model just adds another dummy variable \[y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i + \epsilon =
\begin{cases}
\beta_0 + \beta_1 x_i + \beta_2 x_i + \epsilon \quad &\text{if the } i \text{th person is Asian} \\
\beta_0 + \beta_1 x_i + \epsilon \quad &\text{if the } i \text{th person is Caucasion} \\
\beta_0 + \epsilon \quad &\text{if the } i \text{th person is African American}
\end{cases} \] These new coefficients act as average values for the new factor levels in reference to the baseline (in this case African Americans).
Two of the most important assumptions are that the relationship between predictors and response is additive and linear.
A simple way to extend linear into nonlinear regression is with polynomial regression. We can add second, third, fourth, as many degrees polynomials as we want and it’s still linear regression! \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 (X_1)^2 + \epsilon \]
Linear regression assumes a straight line (or plane) relationship between the predictors and response, but non-linearity is the rule not the exception. Residual plots are useful visualizations for identifying non-linearity. We plot the residuals \(e_i = y_i - \hat y_i\) versus the fitted response values \(\hat y_i\). Ideally the plot will show no discernible pattern indicating non-linear effects.
Given the error terms \(e_1, e_2, ... , e_n\), they are uncorrelated if the value of \(e_i\) gives no information about \(e_{i+1}\). If they are correlated, then the estimated standard errors will tend to underestimate the true standard errors. Meaning, actual confidence intervals are narrower than predicted and p-values are smaller than they should be, leading to erroneous conclusions.
Homoscedasticity: An important assumption of linear models is the error terms have constant variance \(\text{Var}(\epsilon_i) = \sigma^2\). The standard errors, confidence intervals, and hypothesis tests rely on this assumption. A residual plot with a funnel shape indicates heterscedasticity.
Outlier: a point \(y_i\) that is far away from its predicted value \(\hat y_i\). Outliers increase the standard error, throwing off the associated statistics. A useful idea is standardizing the residuals normalizing by the estimated standard deviation. Observations outside \(3\sigma\) would be considered outliers.
Outliers are unsual \(y_i\) values given a predictor \(x_i\). High-leverage observations have an unusual \(x_i\) value. These can tweak the regression coefficients. The leverage statistic \(h_i\) computes the weighted distance of the \(x_i\) with an average value equal to \((p+1)/n\).
Collinear: two variables are correlated with each other. One can predict the other (dependent). The predictor variance increases, and the power of the t-statistic for each predictor decreases. The variance inflation factor measures collinearity.
library(MASS); library(ISLR)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# medv: median value of Boston homes. lstat: percent of households with low socioeconomic status
lm.fit <- lm(medv ~ lstat, data = Boston) # fit a linear model to the data
summary(lm.fit) # linear model information
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
confint(lm.fit) # coefficient confidence intervals
## 2.5 % 97.5 %
## (Intercept) 33.448457 35.6592247
## lstat -1.026148 -0.8739505
# now use the model to predict on unseen values of lstat (x)
# Confidence and prediction intervals are centered on the same point,
# but prediction intervals include irreducible error as well as model error.
predict(lm.fit, tibble(lstat = c(5,10,15)),
interval = "confidence")
## fit lwr upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461
predict(lm.fit, tibble(lstat = c(5,10,15)),
interval = "prediction")
## fit lwr upr
## 1 29.80359 17.565675 42.04151
## 2 25.05335 12.827626 37.27907
## 3 20.30310 8.077742 32.52846
ggplot(Boston, aes(lstat, medv)) +
geom_point() +
geom_smooth(method = "lm")
plot(lm.fit)
# Multiple Linear Regression
lm.fit <- lm(medv ~ lstat + age, data = Boston)
summary(lm.fit)
##
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.981 -3.978 -1.283 1.968 23.158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.22276 0.73085 45.458 < 2e-16 ***
## lstat -1.03207 0.04819 -21.416 < 2e-16 ***
## age 0.03454 0.01223 2.826 0.00491 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495
## F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16
# short hand for all the predictors (.)
lm.fit <- lm(medv ~ ., data = Boston)
summary(lm.fit)
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
# exclude a predictor
lm.fit1 <- lm(medv ~ . -age, data = Boston)
lm.fit1 <- update(lm.fit, ~. -age)
summary(lm.fit1)
##
## Call:
## lm(formula = medv ~ crim + zn + indus + chas + nox + rm + dis +
## rad + tax + ptratio + black + lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.6054 -2.7313 -0.5188 1.7601 26.2243
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.436927 5.080119 7.172 2.72e-12 ***
## crim -0.108006 0.032832 -3.290 0.001075 **
## zn 0.046334 0.013613 3.404 0.000719 ***
## indus 0.020562 0.061433 0.335 0.737989
## chas 2.689026 0.859598 3.128 0.001863 **
## nox -17.713540 3.679308 -4.814 1.97e-06 ***
## rm 3.814394 0.408480 9.338 < 2e-16 ***
## dis -1.478612 0.190611 -7.757 5.03e-14 ***
## rad 0.305786 0.066089 4.627 4.75e-06 ***
## tax -0.012329 0.003755 -3.283 0.001099 **
## ptratio -0.952211 0.130294 -7.308 1.10e-12 ***
## black 0.009321 0.002678 3.481 0.000544 ***
## lstat -0.523852 0.047625 -10.999 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.74 on 493 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7343
## F-statistic: 117.3 on 12 and 493 DF, p-value: < 2.2e-16
# Interaction Terms
# lstat:black means interaction term
# lstat*black is shorthand for lstat + black + lstat:black
summary(lm(medv ~ lstat*age, data = Boston))
##
## Call:
## lm(formula = medv ~ lstat * age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.806 -4.045 -1.333 2.085 27.552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.0885359 1.4698355 24.553 < 2e-16 ***
## lstat -1.3921168 0.1674555 -8.313 8.78e-16 ***
## age -0.0007209 0.0198792 -0.036 0.9711
## lstat:age 0.0041560 0.0018518 2.244 0.0252 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared: 0.5557, Adjusted R-squared: 0.5531
## F-statistic: 209.3 on 3 and 502 DF, p-value: < 2.2e-16
# Non-linear predictor transformations
lm.fit2 <- lm(medv ~ lstat + I(lstat^2), data = Boston)
summary(lm.fit2)
##
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.862007 0.872084 49.15 <2e-16 ***
## lstat -2.332821 0.123803 -18.84 <2e-16 ***
## I(lstat^2) 0.043547 0.003745 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
lm.fit <- lm(medv ~ lstat, data = Boston)
# anova performs a hypothesis test comparing the two models.
# H_0: both models fit the data equally well
# H_a: the full model is superior
anova(lm.fit, lm.fit2)
## Analysis of Variance Table
##
## Model 1: medv ~ lstat
## Model 2: medv ~ lstat + I(lstat^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 504 19472
## 2 503 15347 1 4125.1 135.2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(lm.fit2)
lm.fit5 <- lm(medv ~ poly(lstat, 5), data = Boston)
summary(lm.fit5)
##
## Call:
## lm(formula = medv ~ poly(lstat, 5), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5433 -3.1039 -0.7052 2.0844 27.1153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.5328 0.2318 97.197 < 2e-16 ***
## poly(lstat, 5)1 -152.4595 5.2148 -29.236 < 2e-16 ***
## poly(lstat, 5)2 64.2272 5.2148 12.316 < 2e-16 ***
## poly(lstat, 5)3 -27.0511 5.2148 -5.187 3.10e-07 ***
## poly(lstat, 5)4 25.4517 5.2148 4.881 1.42e-06 ***
## poly(lstat, 5)5 -19.2524 5.2148 -3.692 0.000247 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.215 on 500 degrees of freedom
## Multiple R-squared: 0.6817, Adjusted R-squared: 0.6785
## F-statistic: 214.2 on 5 and 500 DF, p-value: < 2.2e-16
summary(lm(medv ~ log(rm), data = Boston))
##
## Call:
## lm(formula = medv ~ log(rm), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.487 -2.875 -0.104 2.837 39.816
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -76.488 5.028 -15.21 <2e-16 ***
## log(rm) 54.055 2.739 19.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.915 on 504 degrees of freedom
## Multiple R-squared: 0.4358, Adjusted R-squared: 0.4347
## F-statistic: 389.3 on 1 and 504 DF, p-value: < 2.2e-16
# Qualitative predictors
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
lm.fit <- lm(Sales ~ . +Income:Advertising + Price:Age, data = Carseats)
summary(lm.fit)
##
## Call:
## lm(formula = Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9208 -0.7503 0.0177 0.6754 3.3413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5755654 1.0087470 6.519 2.22e-10 ***
## CompPrice 0.0929371 0.0041183 22.567 < 2e-16 ***
## Income 0.0108940 0.0026044 4.183 3.57e-05 ***
## Advertising 0.0702462 0.0226091 3.107 0.002030 **
## Population 0.0001592 0.0003679 0.433 0.665330
## Price -0.1008064 0.0074399 -13.549 < 2e-16 ***
## ShelveLocGood 4.8486762 0.1528378 31.724 < 2e-16 ***
## ShelveLocMedium 1.9532620 0.1257682 15.531 < 2e-16 ***
## Age -0.0579466 0.0159506 -3.633 0.000318 ***
## Education -0.0208525 0.0196131 -1.063 0.288361
## UrbanYes 0.1401597 0.1124019 1.247 0.213171
## USYes -0.1575571 0.1489234 -1.058 0.290729
## Income:Advertising 0.0007510 0.0002784 2.698 0.007290 **
## Price:Age 0.0001068 0.0001333 0.801 0.423812
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.011 on 386 degrees of freedom
## Multiple R-squared: 0.8761, Adjusted R-squared: 0.8719
## F-statistic: 210 on 13 and 386 DF, p-value: < 2.2e-16
contrasts(Carseats$ShelveLoc) # shows the ordering of dummy variables
## Good Medium
## Bad 0 0
## Good 1 0
## Medium 0 1
Explain the conclusions stated by Table 3.4 pg. 74 Coefficient, Std. error, t-statistic, p-value Std.error quantifies how close the coefficient is to the true population parameter. It’s the average amount the \(\hat\beta\) differs from \(\beta\). The further \(x_i\) are spread out, the smaller the standard error because they provide more leverage. Larger standard errors cause larger confidence intervals. If the confidence interval is statistically different (using the t-statistic to compute a p-value) from the null hypothesis \(H_0: \beta = 0\), then we can reject it. Table 3.4 states that advertising on TV
and radio
have a significant effect on Sales
while newspaper
advertising does not have a significant effect (p-value = 0.8599).
What’s the difference between KNN classifier and KNN regression methods? KNN regression averages the \(y_i\) values for the closest \(k\) \(x_i\) points. KNN classifier picks a point \(x_0\) and classifies it based on the closest \(k\) \(x_i\) points.
Not sure what this \(i'\) number is… Basically this is an algebra problem.
use Eq 3.4 to prove a linear model always passes through \(\bar x, \bar y\). Since the intercept is dependent on \(\bar x\) and \(\bar y\),, and \(\beta_1\) is a straight line, it will always go through the average. \[ \begin{eqnarray} \text{Example: }\bar x, \bar y = 0 \\ \beta_0 = \bar y - \beta_1 \bar x = 0 \end{eqnarray} \]
Prove eq 3.18 \(\text R^2 = \text{Cor}(X,Y)^2\) given \(\bar x = \bar y = 0\).
lm.fit <- lm(mpg ~ horsepower, data = Auto)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
print("predict mpg for horsepower = 98")
## [1] "predict mpg for horsepower = 98"
print("confidence interval")
## [1] "confidence interval"
predict(lm.fit, tibble(horsepower = 98), interval = "confidence")
## fit lwr upr
## 1 24.46708 23.97308 24.96108
print("prediction interval")
## [1] "prediction interval"
predict(lm.fit, tibble(horsepower = 98), interval = "prediction")
## fit lwr upr
## 1 24.46708 14.8094 34.12476
Strong negative relationship between horsepower and mpg.
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
##
## Call:
## lm(formula = mpg ~ ., data = Auto1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
library(ISLR)
lm.fit <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lm.fit) # Urban not significant
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
lm.fit1 <- lm(Sales ~ Price + US, data = Carseats)
summary(lm.fit1)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
anova(lm.fit, lm.fit1) # basically the models are the same
## Analysis of Variance Table
##
## Model 1: Sales ~ Price + Urban + US
## Model 2: Sales ~ Price + US
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 396 2420.8
## 2 397 2420.9 -1 -0.03979 0.0065 0.9357
plot(lm.fit1)
set.seed(1)
x <- rnorm(100)
y <- 2 * x + rnorm(100)
ggplot() + geom_point(aes(x, y))
ggplot() + geom_point(aes(y, x))
lm.fit <- lm(y ~ x + 0)
summary(lm.fit)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9154 -0.6472 -0.1771 0.5056 2.3109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.9939 0.1065 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
lm.fit1 <- lm(x ~ y + 0)
summary(lm.fit1)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8699 -0.2368 0.1030 0.2858 0.8938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.39111 0.02089 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4246 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
x <- rnorm(100)
y <- rnorm(100)
ggplot() + geom_point(aes(x, y))
ggplot() + geom_point(aes(y, x))
lm.fit <- lm(y ~ x + 0)
summary(lm.fit)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.69463 -0.51184 0.03249 0.72865 1.89998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.10757 0.09585 1.122 0.264
##
## Residual standard error: 0.9868 on 99 degrees of freedom
## Multiple R-squared: 0.01256, Adjusted R-squared: 0.00259
## F-statistic: 1.26 on 1 and 99 DF, p-value: 0.2644
lm.fit1 <- lm(x ~ y + 0)
summary(lm.fit1)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.70567 -0.54321 0.04474 0.65648 2.65721
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.1168 0.1041 1.122 0.264
##
## Residual standard error: 1.028 on 99 degrees of freedom
## Multiple R-squared: 0.01256, Adjusted R-squared: 0.00259
## F-statistic: 1.26 on 1 and 99 DF, p-value: 0.2644
set.seed(1)
x <- rnorm(100)
eps <- rnorm(100, sd = 0.25)
y <- -1 + 0.5 * x + eps
ggplot() +
geom_point(aes(x, y)) +
geom_smooth(method = "lm", se = F, aes(x = x, y = y, color = "lsq")) +
geom_abline(aes(slope = 0.5, intercept = -1, color = "pop")) +
labs(color = "Regression Lines")
?geom_abline
summary(lm(y ~ x))
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46921 -0.15344 -0.03487 0.13485 0.58654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.00942 0.02425 -41.63 <2e-16 ***
## x 0.49973 0.02693 18.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2407 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
summary(lm(y ~ x + I(x^2)))
##
## Call:
## lm(formula = y ~ x + I(x^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4913 -0.1563 -0.0322 0.1451 0.5675
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.98582 0.02941 -33.516 <2e-16 ***
## x 0.50429 0.02700 18.680 <2e-16 ***
## I(x^2) -0.02973 0.02119 -1.403 0.164
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2395 on 97 degrees of freedom
## Multiple R-squared: 0.7828, Adjusted R-squared: 0.7784
## F-statistic: 174.8 on 2 and 97 DF, p-value: < 2.2e-16
set.seed (1)
x1=runif (100)
x2 =0.5* x1+rnorm (100) /10
y=2+2* x1 +0.3* x2+rnorm (100)
cor(x1, x2)
## [1] 0.8351212
ggplot() + geom_point(aes(x1, x2))
summary(lm(y ~ x1 + x2))
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8311 -0.7273 -0.0537 0.6338 2.3359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1305 0.2319 9.188 7.61e-15 ***
## x1 1.4396 0.7212 1.996 0.0487 *
## x2 1.0097 1.1337 0.891 0.3754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925
## F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
summary(lm(y ~ x1))
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89495 -0.66874 -0.07785 0.59221 2.45560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1124 0.2307 9.155 8.27e-15 ***
## x1 1.9759 0.3963 4.986 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1942
## F-statistic: 24.86 on 1 and 98 DF, p-value: 2.661e-06
summary(lm(y ~ x2))
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.62687 -0.75156 -0.03598 0.72383 2.44890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3899 0.1949 12.26 < 2e-16 ***
## x2 2.8996 0.6330 4.58 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared: 0.1763, Adjusted R-squared: 0.1679
## F-statistic: 20.98 on 1 and 98 DF, p-value: 1.366e-05
x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)
summary(lm(y ~ x1 + x2))
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.73348 -0.69318 -0.05263 0.66385 2.30619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2267 0.2314 9.624 7.91e-16 ***
## x1 0.5394 0.5922 0.911 0.36458
## x2 2.5146 0.8977 2.801 0.00614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared: 0.2188, Adjusted R-squared: 0.2029
## F-statistic: 13.72 on 2 and 98 DF, p-value: 5.564e-06
summary(lm(y ~ x1))
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8897 -0.6556 -0.0909 0.5682 3.5665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2569 0.2390 9.445 1.78e-15 ***
## x1 1.7657 0.4124 4.282 4.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared: 0.1562, Adjusted R-squared: 0.1477
## F-statistic: 18.33 on 1 and 99 DF, p-value: 4.295e-05
summary(lm(y ~ x2))
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64729 -0.71021 -0.06899 0.72699 2.38074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3451 0.1912 12.264 < 2e-16 ***
## x2 3.1190 0.6040 5.164 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared: 0.2122, Adjusted R-squared: 0.2042
## F-statistic: 26.66 on 1 and 99 DF, p-value: 1.253e-06
library(MASS);library(GGally)
ggpairs(Boston)
lm.fit <- lm(medv ~ ., data = Boston)
summary(lm.fit)
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## crim -1.080e-01 3.286e-02 -3.287 0.001087 **
## zn 4.642e-02 1.373e-02 3.382 0.000778 ***
## indus 2.056e-02 6.150e-02 0.334 0.738288
## chas 2.687e+00 8.616e-01 3.118 0.001925 **
## nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## age 6.922e-04 1.321e-02 0.052 0.958229
## dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## tax -1.233e-02 3.760e-03 -3.280 0.001112 **
## ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## black 9.312e-03 2.686e-03 3.467 0.000573 ***
## lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16