Chapter 3

Chapter 3 gives definition to the simple linear regression which has two parameters, slope and intercept. Later in the chapter we will discuss the least squares point estimate, assumptions, significance of the independent variable, confidence intervals and the F-test. We will discuss each of these ideas individually and go through the R code on how apply these ideas. Throughout this assignment we will be using data on US Arrests which comes from the US Justice System.

3.1: The Simple Linear Regression Model

Important terms:

Dependent variable: this is the response variable which is denoted by y.
Independent variable: this is the predictor variable which is denoted by x.

There is a relationship between those two which is denoted within the regression. Each value of y is plotted against its corresponding value of x.

states<-read.csv(url("http://www.lock5stat.com/datasets/USStates.csv"))
head(states)

##        State HouseholdIncome Region Population EighthGradeMath HighSchool
## 1    Alabama          43.253      S      4.849           269.2       84.9
## 2     Alaska          70.760      W      0.737           281.6       92.8
## 3    Arizona          49.774      W      6.731           279.7       85.6
## 4   Arkansas          40.768      S      2.966           277.9       87.1
## 5 California          61.094      W     38.803           275.9       84.1
## 6   Colorado          58.433      W      5.356           289.7       89.5
##   College    IQ    GSP Vegetables Fruit Smokers PhysicalActivity Obese
## 1    24.9  95.7 32.615       74.2  54.1    21.5             45.4  32.4
## 2    24.7  99.0 61.156       80.8  60.3    22.6             55.3  28.4
## 3    25.5  97.4 35.195       76.2  60.5    16.3             51.9  26.8
## 4    22.4  97.5 31.837       72.0  49.5    25.9             41.2  34.6
## 5    31.4  95.5 46.029       82.7  69.6    12.5             56.3  24.1
## 6    37.0 101.6 46.242       80.9  64.3    17.7             60.4  21.3
##   NonWhite HeavyDrinkers Electoral ObamaVote ObamaRomney TwoParents
## 1     30.7           4.3         9     0.384           R       58.7
## 2     33.1           8.2         3     0.408           R       69.6
## 3     20.8           6.3        11     0.446           R       62.7
## 4     21.7           5.0         6     0.369           R       62.0
## 5     37.7           6.4        55     0.602           O       65.3
## 6     15.8           6.7         9     0.515           O       69.9
##   StudentSpending Insured
## 1           8.755    78.8
## 2          18.175    79.8
## 3           7.208    74.7
## 4           9.394    71.7
## 5           9.220    79.7
## 6           8.647    80.0

names(states)

##  [1] "State"            "HouseholdIncome"  "Region"          
##  [4] "Population"       "EighthGradeMath"  "HighSchool"      
##  [7] "College"          "IQ"               "GSP"             
## [10] "Vegetables"       "Fruit"            "Smokers"         
## [13] "PhysicalActivity" "Obese"            "NonWhite"        
## [16] "HeavyDrinkers"    "Electoral"        "ObamaVote"       
## [19] "ObamaRomney"      "TwoParents"       "StudentSpending" 
## [22] "Insured"

attach(states)

We do these four steps to gather the data from an outside website, see what the data looks like, how to call each piece of data and attach the data so we don’t need to call it every time.

Next we plot the data in order to see what is looks like and be able to check our results at the end with basic logic. It is also important to label the axes and graphs as accurately as possible.

plot(HouseholdIncome, Obese, ylab = "Obesity Rate (%)",
     xlab = "Household Income (USD)", 
     main = "US States' Obesity Rates and Mean Household Incomes")

The formula for regression is: \[\hat{y_i}= \hat{\beta_0}+\hat{\beta_0} x_i\] Now we can accurately plot a line of best fit or a linear regression or line of means on the graph.

mymod <- lm(Obese ~ HouseholdIncome)
mymod

## 
## Call:
## lm(formula = Obese ~ HouseholdIncome)
## 
## Coefficients:
##     (Intercept)  HouseholdIncome  
##         42.1759          -0.2517

plot(HouseholdIncome,Obese,ylab="Obesity Rate (%)",xlab="Household Income (USD)")
abline(42.1759,-0.2571)

Now it is time to interpret the meaning of the numbers displayed above -0.2571 is the slope $\hat{\beta_0}$ of the data which means that for every $1000 increase in income there is a .2571 percentage point decrease in obesity rate. 42.1759 is the intercept $\hat{\beta_1}$ which can be interpreted because it is reasonable to assume that there is a household whose income is zero dollars. This is also an important time to look over the graph and see if the plotted line seems reasonable to make sure a silly mistake was not made. Both time series (a look at the change in a variable over time) and cross-sectional data (a look at multiple data sets during the same time period) can be used in linear regressions. In this case, we have cross-sectional data.

The next thing to discuss is a residual plot. We do this in order to make sure that the plot is as random as possible and conclude that a linear regression is the appropriate fit. The resid command is used.

mymod <- lm(Obese ~ HouseholdIncome)
mymod.res = resid(mymod)
plot(states$HouseholdIncome, mymod.res, ylab="Residuals", xlab="Household Income", main="Chapter 3 Example")
abline(0, 0)

As you can see, the points are evenly distributed above and below the zero line. This means that a linear regression is appropriate for this data set.

Another way to check for normality and to see if a data set follows a linear relationship is to look at the qq plots for the data. The quantile-quantile or q-q plot is an exploratory graphical device used to check the validity of a distributional assumption for a data set. In general, the basic idea is to compute the theoretically expected value for each data point based on the distribution in question.

qqplot(Obese, HouseholdIncome, plot.it = TRUE, xlab = "Obesity", ylab = "Household Income")

Again, looking at the qq plot, it follows a reasonably linear relationship and we can move on to futher examination knowing that assumptions have been met.

3.2 The Least Squares Point Estimates

The formula is the same as in part 3.1 but this section shows the calculation behind the Least Squares Regression. We will continue to use the same example in order to show that the calculations are similar within the data set.

First we calculate slope: $\hat{\beta_1}$ = $\sum_{1}^n$((x-$\bar{x}$)(y-$\bar{y}$)) where n is the number of observations.

sum( (HouseholdIncome - mean(HouseholdIncome))*(Obese - mean(Obese) ) )/sum((HouseholdIncome - mean(HouseholdIncome) )^2)

## [1] -0.2516667

Next is intercept: $\hat{\beta_0}$ = $\bar{y}$-$\hat{\beta_1}$($\bar{x}$)

b1 <- (sum( (HouseholdIncome - mean(HouseholdIncome))*(Obese - mean(Obese) ) )/sum((HouseholdIncome - mean(HouseholdIncome) )^2))
(mean(Obese))-(b1*mean(HouseholdIncome))

## [1] 42.17588

It is important to remember that this is a point prediction.

Next we look at the sum of squares residuals: $\sum_{1}^n$(($\bar{y}$-$\hat{\beta_1}$($\bar{x}$))^2)

sum(((mean(Obese))-(b1*mean(HouseholdIncome)))^2)

## [1] 1778.805

As you can see, this is quite large. This is logical because many of the points are quite far away from the simple linear regression. And since this method sums the distance from the point to the linear regression the number should be large.

3.3 Point Estimates and Point Predictions

Let $\hat{\beta_0}$ and $\hat{\beta_1}$ be the least squares point estimate of the y-intercept $\hat{\beta_0}$ and the slope $\hat{\beta_1}$ in the simple linear regression model, and suppose that $x_{0}$ a specified value of the independent variable x, is inside the experimental region. Then y = $\hat{\beta_0}$ + $\hat{\beta_1}$(x) is the point estimate of the mean value of the dependent variable when the value of the independent varianbel is $x_{0}$.In addition $\hat{y}$ is the point prediction of an individual value of the dependent variable when the value of the independent variable is $x_{0}$. Here we predict the error term to be 0. For this example x will equal 50 which we can use because it is in the range of x

b0 <- (mean(Obese))-(b1*mean(HouseholdIncome))
b0+(b1*50)

## [1] 29.59254

Now we plot it for proof; I plotted it in red so that it is easy to see how the predition formula works.

plot(HouseholdIncome,Obese,ylab="Obesity Rate (%)",xlab="Household Income (USD)")
abline(42.1759,-0.2516667)

points.default(50, b0+(b1*50), type="p", pch=19, col="red")

3.4 Model Assumptions and the Standard Error

There are several assumptions necessary to perform a regression:

At any given x, the population of potential error term values has a mean equal to 0.
Constant variance assumption. At any given value of x, the population of potential error term values has a variance that does not depend on the value of x. That is, the different populations of potential error term values corresponding to different values of x have equal variances. We denote the constant variance as ??^2.
Normality assumption. At any given value of x, the population of potential error term values has a normal distribution.
Independence assumption. Any one value of the error term e is statistically independent of any other value of e.

The next part of this section is on the mean square error and the standard error.

The point estimate of s^2 is the mean square error. s^2 = SSE/(n-2) 2) The point estimate of s is the standard error s = square root (SSE/(n-2))

SSE <- sum(((mean(Obese))-(b1*mean(HouseholdIncome)))^2)
SSE/48

## [1] 37.05843

(SSE/48)^.5

## [1] 6.087564

3.5 Testing the Significance of the Slope and the y-intercept

We test the null hypothesis that ${\beta_1}$ = 0 versus the alternative hypothesis that ${\beta_1}$ does not equal 0 if we are doing a two sided test or greater than or less than 0 if we are performing a one sided test. One important formula in this section are that: t = b1/sb1 where sb1 equals s/square root (SSxx) which follows a t distribution with n-2 degrees of freedom. For ease of viewing I have sorted in the following code.

HA: ${\beta_1}$ not equal to 0

two sided test where the calculated t value can either be greater than the positive t value with n-2 degrees of freedom or smaller than the negative t value with n-2 degrees of freedom

p value: Twice the area under the t curve to the right of the absolute value of t

HA: ${\beta_1}$ >0

one sided test where the calculated t value is greater than the positive t value with n-2 degrees of freedom

p value: area under the t curve to the right of t

HA: ${\beta_1}$ <0

one sided test where the calculated t value is less than the negative t value with n-2 degrees of freedom

p value: area under the t curve to the left of t

Below is the t test stat.

mymod <- lm(Obese ~ HouseholdIncome)
summary (mymod)

## 
## Call:
## lm(formula = Obese ~ HouseholdIncome)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1702 -1.7152  0.4645  1.7809  4.6312 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     42.17588    2.29764  18.356  < 2e-16 ***
## HouseholdIncome -0.25167    0.04257  -5.912 3.42e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.59 on 48 degrees of freedom
## Multiple R-squared:  0.4213, Adjusted R-squared:  0.4093 
## F-statistic: 34.95 on 1 and 48 DF,  p-value: 3.416e-07

With the t value of -5.912 and the p value of 3.42e-07 we are going to reject the null at any confidence level saying that we found significant evidence of a linear relationship between household income and obesity.

3.6 Confidence and Prediction Intervals

Next we are going to test confidence intervals. In a practical application we can then say that we are a certain percent certain that the true population of slope is in between the calculated values. In a probabilistic application we can say that a certain percentage of confidence intervals constructed in the same manner will contain the true population mean for slope. This next piece of code shows how this is done.

mymod <- lm(Obese ~ HouseholdIncome, data = states)
confint(mymod, level=.95)

##                      2.5 %     97.5 %
## (Intercept)     37.5561608 46.7955955
## HouseholdIncome -0.3372578 -0.1660756

We can do this by calculating a confidence interval for the mean value of y and a prediction interval for and indicidual value of y. The formula for distance value = (1/n)+((${x_0}$-$\bar{x}$)$^{1/2}$)/${SS_(xx)}$ The confidence intercal for a mean value of y: $\hat{y}$+/-tstat$\times$s$\times$(distance value)$^{1/2}$ an easier R shortcut can be seen above. If you want a prediction interval under the squareroot is distance value +1.

thismod <- lm(Obese ~ HouseholdIncome)
newdata <- data.frame(HouseholdIncome = 50)
(predy <- predict(thismod, newdata, interval="predict") )

##        fit      lwr      upr
## 1 29.59254 24.32658 34.85851

(confy <- predict(thismod, newdata, interval="confidence") )

##        fit      lwr      upr
## 1 29.59254 28.80438 30.38071

As you can see the prediction interval is larger which is logical. This is because prediction intervals are about fitting an interval for one person and the standard deviation will be significantly larger than the standard deviation of the mean of a population. It is also important to check that the intervals are centered in the same place, which they are.

confy[1] == predy[1]

## [1] TRUE

3.7 Simple Coefficients of Determination and Correlation

This is denoted $r^{2}$ which equals explained variation/total variation. Total variation is equal to the $\sum_{1}^n$((${y_i}$-$\bar{y}$)^2) and explained variation is equal to $\sum_{1}^n$(($\hat{y}$-$\bar{y}$)^2). $r^{2}$ is the proportion of the total variation is the n observed values of the dependent variable that is explained by the simple linear regression model. The first two calculations denote $r^{2}$, while the second give the calculation for r.

cor(Obese, HouseholdIncome)^2

## [1] 0.4213459

cor(HouseholdIncome, Obese)^2

## [1] 0.4213459

cor(Obese, HouseholdIncome)

## [1] -0.6491116

cor(HouseholdIncome, Obese)

## [1] -0.6491116

It is important to note that the correlation is the same in either order which is logical. r>0 when ${\beta_1}$ is positive and r<0 when ${\beta_1}$ is negative. r is also negative which makes since because ${\beta_1}$ is negative.This is also logical by looking at the graph, which is why we printed out for the viewer in the first place. It is also important to note that this output matches the earlier display in the summary command.

3.8 An F-Test for the Model

This is defined as the explained variation/(unexplained variation/(n-2)) based on 1 numerator and n-2 denominator degrees of freedom. The following is how to perform this function in R.

var.test(HouseholdIncome, Obese)

## 
##  F test to compare two variances
## 
## data:  HouseholdIncome and Obese
## F = 6.6525, num df = 49, denom df = 49, p-value = 6.117e-10
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##   3.775156 11.723024
## sample estimates:
## ratio of variances 
##           6.652537

We have an F stat = 6.6525 and a p-value = 6.117e-10. This is significant at any reasonable alpha. This means we have found statistically significant evidence of a linear correlation between Household Income and Obesity. It is important to note that this agrees with the t test in an earlier section. Overall this is summary of all the statistically principles covered within Chapter 3 of the Forecasting, Time Series and Regression textbook written by Bowerman, O’Connell and Koehler.

Chapter 3: R Guide

Cassandra Carter

March 20, 2018