#Set up the values of the X, known values.
X=matrix(nrow=2,ncol=2,c(1,1,1,4))
Y=matrix(nrow=2,ncol=1,c(3,6))
# first value is b_0, second value b_1
solve(X)%*%Y
[,1]
[1,] 2
[2,] 1
[,1]
[1,] 2
[2,] 1
The inverse of a square matrix is another square matrix such that when multiplied with the original matrix will create an identity matrix.
An identity matrix is a square matrix where off diagonal elements are 0s and the diagonal elements are all 1.
These properties are used to find solutions to equations with unknowns.
\((\mathbf{X'X})^{-1}\) is the inverse of the matrix created by multiplying the transpose of matrix \(\mathbf{X}\) with \(\mathbf{X}\) leading to a square matrix of (K+1) dimensions.
There is only a unique line that can fit through any two points on the Euclidean space.
This imaginary line, fitted through these two points can be conceptualized as a line of extrapolation.
An extrapolation line will give you a predicted Y value for any hypothetical X value.
You can check whether the equation we have identified as \(Y=X+2\) is correct by substituting values to (1) and (2).
This extrapolation line works perfectly only if the relationship between X and Y is deterministic.
\[ \mathbf{Y}=\mathbf{X}\mathbf{B} \] - Where \(\mathbf{Y}\) is an \(n\) by 1 column vector, \(\mathbf{X}\) is an \(n\) by \(n\) square matrix and\(\mathbf{B}\) is an \(n\) by 1 column vector.
Why do we have 1 s on the first column of X?
What would be the consequences if there were more rows on X but no additional unknowns in the \(\mathbf{\beta}\) matrix? Is this realistic?
What we do when we have n set of equations with n unknowns will be different from what we do when we have n set of equations (observations)
As we will discuss on the next set of slides regression requires us to think about a situation where we would like to know and outcome Y when we know about some sets of variables \((X_{1},\ldots,X_{K})\). Not only will we not know everything that affects Y but we are not going to know the structural form of how X is going to effect Y. We will have to make several assumptions involving X and how it relates to Y. Our main advantage will be that \(n\), the number of observations should be a lot larger than \(K\), the number of known unknowns.
Think about a situation where you know some features (X,independent variables, covariates, predictors,…) that is somewhat and somehow associated with a variable to be predicted (Y, target variable, dependent variable, outcome variable, response variable…)1
Most of us are neither omnipotent nor omniscient. This means for the majority of phenomena, not only do we know how values of X effect values of Y, we will not ever observe all possible Xs to “know” what Y is going to be.
\[Y=f(X)\]
Regression is about declaring the structure of \(f\) and finding the parameters of \(f\).
To do this we need 2 things.
\[Y=\beta_{0}+\beta_{1} \times X_{1}+\epsilon \]
What is the difference between the equation you see on the previous slide vs the one with 2 sets of coordinates.
Think about the equation \(Y=\beta_{0}+\beta_{1}X_{1}+\epsilon\), what does \(\epsilon\) represent?
Assume for a moment that the additive model of S.L.R. is the correct one. If all the uncertainty that can not be accounted by X is represented by \(\epsilon\) is there a scenario where you can reduce \(\epsilon\) to 0? What would this mean?
\[\mathbf{Y}=\beta_{0}+\beta_{1}\mathbf{X_{1}}+\mathbf{\epsilon}\] \[ \begin{bmatrix} y_{1} \\ y_{2} \\ y_{3} \\ \vdots\\ y_{n}\\ \end{bmatrix} = \beta_{0} + \beta_{1} \begin{bmatrix}x_{1}\\ x_{2} \\x_{3}\\ \vdots \\ x_{n} \\ \end{bmatrix} + \begin{bmatrix}\epsilon_{1}\\ \epsilon_{2} \\ \epsilon_{3}\\ \vdots \\ \epsilon_{n} \\ \end{bmatrix} \] - X and Y are data. \(\mathbf{\epsilon}\) are the error terms (more on them later). How do we estimate \(\beta_{0}\) and \(\beta_{1}\)?
Let us think about the discrepancy function between Y and f(Y) which is going to lead us to pick \(\beta_{0}\) and \(\beta_{1}\) estimates.
What are some choices?
\(\hat{Y_{i}}\) is predicted value of Y for observation i.
The matrix composed of \[(X'X)^{-1}X'\] is called the hat matrix and can be used to identify influential observations.
The matrix multiplication \[X (X'X)^{-1}X'Y=\hat{\mathbf{\beta}}\]
And \(X\hat{\mathbf{\beta}}= \hat{\mathbf{Y}}\)
We should ask ourselves how \(\hat{\mathbf{\beta}}\) are derived.
\(\mathbf{beta}\) defined as B.L.U.E. Best Linear Unbiased Estimators.
\[Min (\mathbf{Y}-\mathbf{X}\mathbf{\beta})'(\mathbf{Y}-\mathbf{X}\mathbf{\beta})\] \[Min \mathbf{Y}'\mathbf{Y}-2\mathbf{\beta}'\mathbf{X}'+\mathbf{\beta}'\mathbf{X}'\mathbf{X}\mathbf{\beta} \] - Take the derivative of the function with respect to \(\mathbf{\beta}\) and equate it to 0 to find the \(\mathbf{\hat{\beta}}\) that minimizes the function. Note \(\mathbf{Y}'\mathbf{Y}\) does not contain any \(\beta\) terms so its’ derivative is 0.
\[ -2\mathbf{X}'\mathbf{y}+2\mathbf{X}'\mathbf{X}\mathbf{\hat{\beta}}=0\]
\[ \mathbf{X}'\mathbf{X}\mathbf{\hat{\beta}}=\mathbf{X}'\mathbf{y}\]
\[ (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X}\mathbf{\hat{\beta}}=\mathbf{I}\mathbf{\hat{\beta}}=\mathbf{\hat{\beta}}= =(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} \]
Note how we have multiplied and summed \(\mathbf{X}\) variables. Each value contributed to the sum independently of the value of other multiplications.
This in turn means that the \(\beta\) values are constants, they do not change according to the differing levels of \(\mathbf{X}\)
Further assumptions will need to be made for additional predictive and inferential results.
Before going further we need to test a hypothesis about \(\mathbf{\beta}\).
To understand what that hypothesis is, we need to discuss a bit of visual intuition.
How would we predict the value of Y if we do not know any X and the only thing we know is Y itself?
To answer this question you need to know the discrepancy function between Y and what you will predict it to be.
If it is the squared loss function that we have discussed earlier \(\sum_{i=1}^{N} (Y_{i}-\hat{Y})^2\):
The 5 vertical lines drawn from y-coordinate to the mean of Y demonstrate the error the using Y only model would create for these 5 points. To distinguish this prediction from predictions with a regression model with X we use \(\bar{Y}\)
Each \((Y_{i}-\bar{Y})\) term for each i would be the error term if we used \(\bar{Y}\) as our prediction for \(Y_{i}\).
If we obtain each and every \((Y_{i}-\bar{Y})\) term and add them up we get 0.
\[\sum_{i=1}^{i=N} (Y_{i}-\bar{Y})=0 \]
Show \[\sum_{i=1}^{i=N} (Y_{i}-\bar{Y})=0 \] using math.
Hint: \[\sum_{i=1}^{i=N} (Y_{i}-\bar{Y})=Y_{1}+\cdots+Y_{N}- \]
\[S.S.T.= \sum_{i=1}^{i=N} (Y_{i}-\bar{Y})^{2}\]
If we divided this with \(N-1\), we would obtain sample variance of Y.
SST is the total quantity of error we would commit if we used \(\bar{Y}\) as the prediction for each \(Y_{i}\).
It is also a measure of how much variation there is in Y.
We use information from a set of covariate(s) X(\(\mathbf{X}\)), come up with \(b_{0},b_{1},\cdots,b_{K}\) in order to predict each value for \(Y_{i}\) that minimizes the sum of squared errors.
We would like to quantify how much better the regression model with covariates does compared to \(\bar{Y}\) as the predicted value. Note that for the purpose of predicting the observed data, the predictions using \(b_{0},b_{1},\cdots,b_{K}\) overall, never do worse than \(\bar{Y}\).
This aggregate construct leads to the quantity referred to Sum of Squares Regression.
\[\sum_{i}^{i=N}(\hat{Y}_{i}-\bar{Y})^{2}\]
\[\sum_{i=1}^{i=N}(Y_{i}-\hat{Y_{i}})^{2}=\epsilon_{i}^{2} \]
\[\sum_{i=1}^{i=N}(Y_{i}-\bar{Y})^{2}=\sum_{i=1}^{i=N}(\hat{Y}_{i}-\bar{Y})^{2}+\sum_{i=1}^{i=N}(Y_{i}-\hat{Y}_{i})^{2} \]
Sum of Squares Total = Total amount of variation in Y
Sum of Squared Regression = Total amount of variation that regression model accounts for.
Sum of Squares Error = Total amount of variation in Y that can not be explained with regression model.
\[r^{2}=\frac{SSR}{SST} \]
r^{2} is the percentage of variation that the regression model can explain in Y.
So far we have made no use of distributional assumptions.
\(\epsilon\) is normally distributed.
We can use mathematical statistics to prove that if \(\epsilon\) has a normal distribution it leads to SSR and SSE to have a \(\chi^{2}\) distribution.
Furthermore if SSR is divided by what is referred to its’ degrees of freedom (K) you obtain mean square regression. \(\left(Mean \quad Square \quad Regression=\frac{SSR}{K}\right)\) where K is number of covariates.
Analogously SSE divided by its degrees of freedom (n-K-1) leads to Mean Square Error \(\left(Mean Square Error=\frac{SSE}{n-K-1}\right)\)
All of this leads in turn to \(F Ratio=\frac{MSR}{MSE}\) which has F distribution with K and n-K-1 degrees of freedom.
All of these relationships follows from \(\epsilon\) being normally distributed.
The calculated F-Ratio is a test statistic equivalent in purpose to \(t_{computed}\) which allows the following Hypothesis Test referred to as Global F Test.
Null hypothesis states that none of the covariates have any effects. The alternative hypothesis test states that at least one covariate has an effect on Y (at least one coefficient is different from 0).
It does NOT say all covariates have an effect on Y. That would be very different test.
First steps involve adding Data Analysis toolpak, click on File, More, Options
Click on Add-ins, Click on Manage Excel Add-ins
Make sure Analsis ToolPak is checked
Data Analysis should appear on the Data tab
Select Regression, Click OK
Input Y and X ranges designate dependent and independent variables, respectively. The data has labels Price and Size so they are selected. It is not neccessary but check the box residuals.Click OK.
The first thing to look at in the regression output is the p value associated with the global hypothesis test. If you can reject the null hypothesis
Next you can look at R Square, Coefficient Estimate etc… in no particular order
\(r^{2}\) is calculated as \(\frac{SSR}{SST}\). Recall SSR is the quantity of variation in Y (Price in this example) you can explain in Y with your model and SST is the total quantity of variation in Y. Therefore \(r^{2}\) is a number between 0 and 1. It quantifies the percentage of variation that your regression model explains in your dependent variable.
\(r^{2}\) is a useful quantity but interpretation can be limited when extended to multiple linear regression. It has the property that as you add more variables \(r^{2}\) never goes down even if the variable you add has nothing to do with the dependent variable. If you solely focus on \(r^{2}\) as to the strength of your model you will be misled.
If you have n observations and K-1 independent variables, \(r^{2}\) will be 1. If this is the case you clearly have an overfitted model where information from the model can not be generalized to new data observed from the same process.
\(r^{2}\) is approximately 51 percent. You are able to explain 51% of the changes in the dependent/target variable Price of a house.
The estimate of \(\beta_{0}\), \(b_{0}\) the intercept estimate is approximately 22,939. This is what you expect the price to be if the Size of the house is 0.
The coefficient estimate of Size is approximately 37.71. For each unit increase in Size increases the price of the house approximately by this amount.
The intercept’s general interpretation is that it is the value of the expected dependent variable if the independent variable is 0.
In this example the intercept is 22,939 dollars. How do you have a house that is 0 square feet and have a + value? Why is it not 0? Would you pay for a house that is 0 square feet?
We can all speculate answers however the answer simply lies in \[\mathbf{\hat{\beta}}=b_{0},b_{1}=(X'X)^{-1}X'Y\]
The first of the houses have a square footage of 1,076 square footage. The actual price of this particular house is 62,400 dollars.
Let us first find the predicted price for this house.
\[22939+37.7*1076=63504.2\] - The residual is simply \(Y-E(Y|X)\) and therefore \(62400-63504.2\) is the residual for this house \(-1104.2\).
There is more than one house that is the same size. The predicted values are the same. What changes off-course is the calculated residual.
If you had 100 houses for sale, each of the same size you would get 100 residuals.
What properties should these residuals have?
#file parameter writes out the file location including name of the file and its' extension. NOTE THAT SLASHES ARE FORWARD / NOT BACKWARD \
# header = TRUE specifies that the file has labels on row 1
# sep="\t" means that each column is separated from each other with tab if it were comma seperated it \t would be replaced with ,
real_estate=read.table(file='C:/Users/rm84/Downloads/realestate.txt',header=TRUE,sep="\t")
# This is not always best practice but this attach function allows us to refer to the labels of the columns in the script.
attach(real_estate)
#The attach function allows us to specify Price or Size instead of referring to the data object real_estate first or column indices. o1 is the regression object we will use functions on.
o1=lm(Price~Size)
Call:
lm(formula = Price ~ Size)
Residuals:
Min 1Q Median 3Q Max
-26100 -10552 -1141 11000 28267
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22938.737 3110.831 7.374 6.65e-13 ***
Size 37.708 1.637 23.037 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12620 on 516 degrees of freedom
Multiple R-squared: 0.507, Adjusted R-squared: 0.5061
F-statistic: 530.7 on 1 and 516 DF, p-value: < 2.2e-16
\(r^{2}\) is 51%, \(b_{0}\) is 22,939 and $b_{Size}=37.7
There is a linear relationship between X and Y
Residuals are Normally Distributed.
Residuals have an expectation of 0
Residuals have a constant variance