Introduction to Statistics

Fall 2017

Estimation: One Variable

Estimate the height of one of these people: Heights (inches) ~ \(N(67,3)\)
Let's say, estimate = \(c\)
estimation error = actual height - \(c\)
The "best" \(c\) is the one that makes the smallest root mean squared (r.m.s) error

The r.m.s of the errors will be smallest if \(c = \mu\)
least squared estimate = \(\mu\) = 67 and least squared error = \(\sigma\) = 3

Bivariate Normal Distributions

Estimation: Two Variables

Given the value of one variable, estimate the value of the other.
Assume, both variables are approximately normally distributed.

Estimation: Two Variables

Regression Line

Bivariate Normal - in Standard Units

Derivation - Bivariate Regression Model

\[ \begin{align} \text {From bivariate scatter plot in standard units:} \\ z_y & = r.z_x \\ \frac {y-\mu_y}{\sigma_y} & = r. \frac {x-\mu_x}{\sigma_x} \\ y-\mu_y & = r. \frac {\sigma_y}{\sigma_x} (x-\mu_x) \\ y & = r. \frac {\sigma_y}{\sigma_x} (x-\mu_x) + \mu_y\\ y & = (\mu_y - r. \frac {\sigma_y}{\sigma_x}\mu_x) + (r. \frac {\sigma_y}{\sigma_x}).x \\ y & = b_0 + b_1.x \\ \text {Where, } & \begin{cases} slope(b_1) & = r. \frac {\sigma_y}{\sigma_x} \\ intercept(b_0) & = \mu_y - b_1.\mu_x \\ \end{cases} \\ \text {When, } x & = \mu_x, y = \mu_y \\ \end{align} \]

\[ \bbox[yellow,5px] { \color {black} {\implies \text {The regression line passes through the points of averages } (\mu_x, \mu_y).} } \]

Finding the Equation

[1] "Height (inches) (x): mean = 67 sd = 3"

[1] "Weight (lb) (y): mean = 174 sd = 21"

[1] "r = 0.304"

Find the equation of the regression line for estimating weight based on height.

\[ \begin{align} slope(b_1) & = r. \frac {\sigma_y}{\sigma_x} \\ intercept(b_0) & = \mu_y - b_1.\mu_x \\ \end{align} \]

[1] "slope (b1) = 2.07 lb per inch"

[1] "intercept (b0) = 35 lb"

[1] "Regression Equation: Est. weight = 35 + 2.07.(height)"

[1] "A person who is 60 inches tall is estimated to be 159 lb"

Plotting the Regression Line

Interpretation of Intercept

Mathematically, the intercept is described as the mean response \((Y)\) value when all predictor variables \((X)\) are set to zero. Sometimes a zero setting for the predictor variable(s) is nonsensical, which makes the intercept noninterpretable.
For example, in the following equation: \(\hat {Weight} = 35 + 2*Height\)
\(Height = 0\) is nonsensical; therefore, the model intercept has no interpretation.

Why is it still crucial to include the intercept in the model?

The constant in regression model guarantees that the residuals have a mean of zero, which is a key assumption in regression analysis. If we don't include the constant, the regression line is forced to go through the origin. This means that all of the predictors and the response variable must equal to zero at that point. If the fitted line doesn't naturally go through the origin, the regression coefficients and predictions will be biased. The constant guarantees that the residuals don't have an overall positive or negative bias.

Interpretation of Slope

The slope of a straight line measures how much the value of \(Y\) changes for every unit of change in \(X\).

For example, in the following equation: \(\hat {Weight} = 35 + 2*Height\)

The slope is \(2 \text { lb per inch}\) - meaning that if a group of people is one inch taller than another group, the former group will be on average 2 lb heavier than the later.

In other words,

Take all the people of any given height
Then take all the people who are one inch taller
the taller group is 2 lb heavier, on average.

Remember, the slope should NOT be interpreted as: if one person gets taller by 1 inch, he/she will put on 2 lb of weight.

Comparing Two Lines

Which line to use?

Objectively, we want a line that produces the least estimation error.

Data = Fit + Residual

Residuals are the leftover variation in the data after accounting for the model fit.

Residual: difference between observed and expected

The residual of the \(i^{th}\) observation \((x_i, y_i)\) is the difference between the observed response \((y_i)\) and its predicted value based on model fit \((\hat y_i)\): \(e_i = y_i - \hat y_i\)

Least Squared Line

In the scatter plot, residual (in other words, estimation error) is shown as the verticle distance between the observed point and the line. If an observation is above the line, then its residual is positive. Observations below the line have negative residuals. One goal in picking the right linear model is for these residuals to be as small as possible.

Common practice is to choose the line that minimizes the sum of squared residuals: \(e_1^2 + e_2^2 + ... + e_n^2\)

There is only one line that minimizes the sum of squared residuals. It is called the least squared line.

Mathematically, it can be shown that the regression line is the least squared line.

root mean squared (r.m.s) error of regression = r.m.s of residuals =

\[ \bbox[yellow,5px] { \color{black}{\sqrt {1-r^2}.\sigma_y} } \]

r.m.s error of regression

\(\color{black}{\sqrt {1-r^2}.\sigma_y}\)

\(r = 1 \text { or } -1: \space\) Scatter is a perfect straight line; r.m.s error of regression = 0
\(r = 0: \space\) No linear association; r.m.s error of regression = \(\sigma_y\)
All other \(r: \space\) Regression is not perfect, but better than using the average; r.m.s error of regression < \(\sigma_y\)

r.m.s error of regression

Conditions for the Least Squared Line

Linearity - The relationship between \(x\) and \(y\) should show a linear trend.

Nearly Normal Residuals - Generally the residuals must be nearly normal with a mean of zero. There should be no linear association between the resuduals and \(x\), meaning \(cor(x,res)=0\). When this condition is violated, it is usually because of outliers.

Conditions for the Least Squared Line

Constant Variability The variability of points around the least squares line remains roughly constant.

Independent Observations Be cautious about applying regression to time series data, which are sequential observations in time such as a stock price each day. Such data may have an underlying structure that should be considered in a model and analysis.

\(R^2: \text { Coefficient of Determination}\)

\(R^2\) is the proportion of the variance in the dependent variable that is predictable from the independent variable.

\(R^2 = 58\%\) suggests that \(58\%\) of the variability in y can be explained by the variability in x.

\(R^2 \space\) provides a measure of how useful the regression line is as a prediction tool.
If \(R^2 \space\) is close to 1, then the regression line is useful.
If \(R^2 \space\) is close to 0, then the regression line is not useful.

Estimation: One Variable

Bivariate Normal Distributions

Bivariate Normal Distributions

Estimation: Two Variables

Estimation: Two Variables

Regression Line

Bivariate Normal - in Standard Units

Derivation - Bivariate Regression Model

Finding the Equation

Plotting the Regression Line

Interpretation of Intercept

Interpretation of Slope

Comparing Two Lines

Data = Fit + Residual

Least Squared Line

r.m.s error of regression

r.m.s error of regression

Conditions for the Least Squared Line

Conditions for the Least Squared Line

\(R^2: \text { Coefficient of Determination}\)

Next Week

Chapter 9-11: Overview of Data Collection Principles