How to read the summary of Linear Regression Model in R

A simple example

library(ISLR)
data(Carseats)
model=lm(Sales~Price+Urban+US,data=Carseats)
summary(model)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Model

Above shows a simple example of a linear regression model. The model is: \[Sales_i =\beta_0+\beta_1 Price_i +\beta_2 Urban_i + \beta_3US_i+\epsilon_i\].

Here $\epsilon$ is assumed to be an error, which is normally distrbuted.

Coefficients

Even if the above model is true, we never know the true values of $\beta s$. What we do is estimate for these. The tables shows the estimates of $\beta s$.

The $\beta s$ can be estimated through certain formulas. If $$ in above is assumed to be normal, then it can be shown that these estimates are also random variables, having a t-distribution (which can be approximated normal in certain conditions). Std. Error are literally what it says of the sampling distributions of $\beta s$.

t values are like the z-scores of the t-distributions. These are used to judge the Null hypotheses $(\beta_0=0, \beta_1=0, \beta_2=0,\beta_3=0)$. What the hyotheses are trying to figure out, are these values really significantly different from zero. If this is zero, then there is no linear relationship.

p-values Pr(>|t|) are like the magic numbers. Theoretically, it is the probability that the associated estiamte is this extreme when the Null is true. Low value indicates there is little chance of having these values when Null is true. Low p-value indicates significance. If p-value is lower than 0.05, then that variable is significant with at least 95% confidence, if lower than 0.01, then confidence level is at least 99%. What significance level is to be used, is subjective.

Residual standard error

This is the estimated standard error of $\epsilon$ in the model. The degree of freedom is determoined by the difference of numnber of observations and number of coefficients to estimate. In this case we have 400 observations, 4 coeff to estimate $(\beta_0\dots\beta_3)$.

Multiple R-squared

This tells us how much of the variablity of the data is expalined by our model. Theoretically, it is the ratio between regression sum of square and total sum of ssquare. Here 23.93% of the sales variablity is expalined by the model.

Inclusion of more and more predictors monotonically increases the Multiple R-squared value. So with increased number of predictors, this should be adjusted. The adjustment is done based on the loss of degree of freedom. Here the adjusted value is 23.35%.

Note: If there is only one continuous predictor and a continuous response, Multiple R-squared is simply the square of the correlation of response and predictor.

F-statistic

This is for testing the model fitness. We test (the test is called F-test) whether our model assumptions is correct or not. Higher p-value indicates lack of fit. If higher than threshold (0.05 or 0.1) then model should not be accepted.