Chapter 3 is all about the simple linear regression which has two parameters, slope and intercept. Later in the chapter we will discuss the least squares point estimate, assumptions, significance of the independent variable, confidence intervals and the F-test. We will discuss each of these ideas idividually and go through the R code on how apply these ideas.Throughout this assignment we will be using data on US Arrests which comes from the US Justice system.

Chapter 3.1: The Simple Linear Regression Model Important terms: 1. dependent variable: this is the response variable which is denoted by y 2. independent variable: this is the predictor variable which is denoted by x There is a realtionship between those two which is denoted within the regression. Each value of y is plotted against its corresponding value of x.

states<-read.csv(url("http://www.lock5stat.com/datasets/USStates.csv"))
head(states)
##        State HouseholdIncome Region Population EighthGradeMath HighSchool
## 1    Alabama          43.253      S      4.849           269.2       84.9
## 2     Alaska          70.760      W      0.737           281.6       92.8
## 3    Arizona          49.774      W      6.731           279.7       85.6
## 4   Arkansas          40.768      S      2.966           277.9       87.1
## 5 California          61.094      W     38.803           275.9       84.1
## 6   Colorado          58.433      W      5.356           289.7       89.5
##   College    IQ    GSP Vegetables Fruit Smokers PhysicalActivity Obese
## 1    24.9  95.7 32.615       74.2  54.1    21.5             45.4  32.4
## 2    24.7  99.0 61.156       80.8  60.3    22.6             55.3  28.4
## 3    25.5  97.4 35.195       76.2  60.5    16.3             51.9  26.8
## 4    22.4  97.5 31.837       72.0  49.5    25.9             41.2  34.6
## 5    31.4  95.5 46.029       82.7  69.6    12.5             56.3  24.1
## 6    37.0 101.6 46.242       80.9  64.3    17.7             60.4  21.3
##   NonWhite HeavyDrinkers Electoral ObamaVote ObamaRomney TwoParents
## 1     30.7           4.3         9     0.384           R       58.7
## 2     33.1           8.2         3     0.408           R       69.6
## 3     20.8           6.3        11     0.446           R       62.7
## 4     21.7           5.0         6     0.369           R       62.0
## 5     37.7           6.4        55     0.602           O       65.3
## 6     15.8           6.7         9     0.515           O       69.9
##   StudentSpending Insured
## 1           8.755    78.8
## 2          18.175    79.8
## 3           7.208    74.7
## 4           9.394    71.7
## 5           9.220    79.7
## 6           8.647    80.0
names(states)
##  [1] "State"            "HouseholdIncome"  "Region"          
##  [4] "Population"       "EighthGradeMath"  "HighSchool"      
##  [7] "College"          "IQ"               "GSP"             
## [10] "Vegetables"       "Fruit"            "Smokers"         
## [13] "PhysicalActivity" "Obese"            "NonWhite"        
## [16] "HeavyDrinkers"    "Electoral"        "ObamaVote"       
## [19] "ObamaRomney"      "TwoParents"       "StudentSpending" 
## [22] "Insured"
attach(states)

We do these four steps to 1) gather the data from an outside website 2) see what the data looks like and then 3) how to call each piece of data and 4) then we attach the data so we don’t need to call it every time.

Next we plot the data in order to see what is looks like and be able to check our results at the end with basic logic. It is also important to label the axes and graphs as accurately as possible.

plot(HouseholdIncome, Obese, ylab = "Obesity Rate (%)",
     xlab = "Household Income (USD)", 
     main = "US States' Obesity Rates and Mean Household Incomes")

The formula for regression is y = ??0 + ??1(x) + ??. Now we can accurately put a line of best fit or a linear regression or line of means on the graph.

mymod <- lm(Obese ~ HouseholdIncome)
mymod
## 
## Call:
## lm(formula = Obese ~ HouseholdIncome)
## 
## Coefficients:
##     (Intercept)  HouseholdIncome  
##         42.1759          -0.2517
plot(HouseholdIncome,Obese,ylab="Obesity Rate (%)",xlab="Household Income (USD)")
abline(42.1759,-0.2571)

Now it is time to interperate the meaning of the numbers displayed above.0.2571 is the slope (??0) of the data which means that for every $1000 increase in income there is a .2571 percentage point decrease in obesity rate. 42.1759 is the intercept (??1) which can be interperated because it is reasonable to assume that there is a household whose income is zero dollars. This is also an important time to look over the graph and see if the plotted line seems reasonable to make sure a silly mistake was not made. ?? is not included in this formula because it is a theoretical principle and this is an application. Both time series and cross-sectional data can be used in linear regressions.

3.2 The Least Squares Point Estimates

The formula is the same as in part 3.1 but this is just the math for how it is calulated. We will continue to use the same example in order to show that the calculations are similar within the data set.

First we calculate slope: sum((x-xbar)(y-ybar)) This is summed from 1 until n, the number of observations

sum( (HouseholdIncome - mean(HouseholdIncome))*(Obese - mean(Obese) ) )/sum((HouseholdIncome - mean(HouseholdIncome) )^2)
## [1] -0.2516667

Next is intercept: mean(y)-slope(mean(x))

b1 <- (sum( (HouseholdIncome - mean(HouseholdIncome))*(Obese - mean(Obese) ) )/sum((HouseholdIncome - mean(HouseholdIncome) )^2))
(mean(Obese))-(b1*mean(HouseholdIncome))
## [1] 42.17588

It is important to remember that this is a point prediction.

Next we look at the sum of squares residuals: sum((mean(y)-slope(mean(x)))^2)

sum(((mean(Obese))-(b1*mean(HouseholdIncome)))^2)
## [1] 1778.805

As you can see, this is quite large. This is logical because many of the points are quite far away from the simple linear regression. And since this method sums the distance from the point to the linear regression the number should be large.

3.3 Point Estimates and Point Predictions Let b0 and b1 be the least squares point estimate of the y-intercept ??0 and the slope ??1 in the simple linear regression model, and suppose that x0 a specified value of the independent variable x, is inside the experimental region. Then y = b0 + b1(x) is the point estimate of the mean value of the dependent variable when the value of the independent varianbel is x0.In addition yhat is the point prediction of an individual value of the dependent variable when the value of the independent variable is x0. Here we predict the error term to be 0. For this example x will equal 50 which we can use because it is in the range of x

b0 <- (mean(Obese))-(b1*mean(HouseholdIncome))
b0+(b1*50)
## [1] 29.59254

Now we plot it for proof, I plotted it in red so that it is easy to see how the predition formula works.

plot(HouseholdIncome,Obese,ylab="Obesity Rate (%)",xlab="Household Income (USD)")
abline(42.1759,-0.2516667)

points.default(50, b0+(b1*50), type="p", pch=19, col="red")

LOOK AT HOW AMAZING THAT LOOKS!

3.4 Model Assumptions and the Standard Error There are several assumptions necessary to perform a regression: 1) At any given x, the population of potential error term values has a mean equal to 0. 2) Constant variance assumption. At any given value of x, the population of potential error term values has a variance that does not depend on the value of x. That is, the different populations of potential error term values corresponding to different values of x have equal variances. We denote the constant variance as ??^2. 3) Normality assumption. At any given value of x, the population of potential error term values has a normal distribution. 4) Independence assumption. Any one value of the error term e is statistically independent of any other value of e. The next part of this section is on the mean square error and the standard error. 1) The point estimate of ??^2 is the mean square error. s^2 = SSE/(n-2) 2) The point estimate of ?? is the standard error s = square root (SSE/(n-2)) In my opinon this is implied but I am trying to give an indepth analysis.

SSE <- sum(((mean(Obese))-(b1*mean(HouseholdIncome)))^2)
SSE/48
## [1] 37.05843
(SSE/48)^.5
## [1] 6.087564

3.5 Testing the Significance of the slope and the y-intercept We test the null hypothesis that ??1 equals 0 versus the alternative hypothesis that ??1 does not equal 0 if we are doing a two sided test or greater than or less than 0 if we are performing a one sided test.Some important formulas in this section are that: t = b1/sb1 where sb1 equals s/square root (SSxx) which follows a t distribution with n-2 degrees of freedom. For ease of viewing I have sorted in the following code.

# HA: beta1 not equal to zero
# two sided test where the calculated t value can either be greater than the positive t value with n-2 degrees of freedom or smaller than the negative t value with n-2 degrees of freedom
# p value: Twice the area under the t curve to the right of the absolute value of t
# HA: beta1 >0
# one sided test where the calculated t value is greater than the positive t value with n-2 degrees of freedom
# p value: area under the t curve to the right of t 
# HA: beta1 <0
# one sided test where the calculated t value is less than the negative t value with n-2 degrees of freedom
# p value: area under the t curve to the left of t 

Below is the calculation of the sb1 value in this example: The s was calculated in the last step and the SSxx was calculated earlier.

((SSE/48)^.5)/((mean(Obese)-(b1*mean(HouseholdIncome)))^.5)
## [1] 0.9373709

Now we have all the information to calculate the t value:

sb1 <- ((SSE/48)^.5)/((mean(Obese)-(b1*mean(HouseholdIncome)))^.5)
b1/sb1
## [1] -0.2684814

This means that the t value is -0.2684814.

qt(c(.025, .975), df=48)
## [1] -2.010635  2.010635

The critical values for the t value with n-2 degrees of freedom (-2.010635, 2.010635). This means that we fail to reject the null in this case because -.26884814 is not less than -2.010635 in a two sided test.

2*pt(-abs(-0.2684814),df=48)
## [1] 0.7894793

With the t value being quite high, it makes sense that we would fail to reject the null hypothesis. This would fail to reject at any reasonable level of alpha .001, .01 or .05.

Next we are going to test confidence intervals. In a practical application we can then say that we are a certain percent certain that the true population of slope is in between the calculated values. In a probablist application we can say that a certain percentage of confidence intervals constructed in the same mannor will contain the true population mean for slope. This next piece of code shows how this is done.

mymod <- lm(Obese ~ HouseholdIncome, data = states)
confint(mymod, level=.95)
##                      2.5 %     97.5 %
## (Intercept)     37.5561608 46.7955955
## HouseholdIncome -0.3372578 -0.1660756

3.6 Confidence and Prediction Intervals We can do this by calculating a confidence intercal for the mean value of y and a prediction interval for and indicidual value of y. The formula for distance value = (1/n)+((x0-xbar)^2)/SSxx The confidence intercal for a mean value of y: yhat+/-tstat * s * squareroot(distance value) an easier R shortcut can be seen above. If you want a prediction interval under the squareroot is distance value +1.

thismod <- lm(Obese ~ HouseholdIncome)
newdata <- data.frame(HouseholdIncome = 50)
(predy <- predict(thismod, newdata, interval="predict") )
##        fit      lwr      upr
## 1 29.59254 24.32658 34.85851
(confy <- predict(thismod, newdata, interval="confidence") )
##        fit      lwr      upr
## 1 29.59254 28.80438 30.38071

As you can see the prediction interval is larger which is correct given the formula. This is because prediction intervals are about one person and the standard deviation will be significantly larger than the standard deviation of a population. It is also important to check that the intervals are centered in the same place, which they are.

confy[1] == predy[1]
## [1] TRUE

3.7 Simple Coefficients of Determination and Correlation This is denoted r^2 which equals explained variation/total variation. Total variation is equal to the sum of ((yi-ybar)^2) and explained variation is equal to sum of ((yhat-ybar)^2). r^2 is the proportion of the total variation is the n observed values of the dependent variable that is explained by the simple linear regression model.

cor(Obese, HouseholdIncome)
## [1] -0.6491116
cor(HouseholdIncome, Obese)
## [1] -0.6491116

It is important to note that the correlation is the same in either order which is logical. r>0 when Beta1 is positive and r<0 when Beta1 is negative. r is also negative which makes since because Beta1 is negative.This is also logical by looking at the graph, which is why we printed out for the viewer in the first place.The last part of this section are the calculation of the population correlation coefficient which is (r*sqrt(n-2))/sqrt(1-r^2) which is calculated below.

(cor(Obese, HouseholdIncome)*(48^.5))/((1-cor(Obese, HouseholdIncome)^2)^.5)
## [1] -5.911946

This implies that we can reject H0: p = 0 in favor of Ha not equal to 0 at level of significance .05, .01, .001. It follows that we have extremely strong evidence of a linear relationship or correlation between obesity and household income. 3.8 An F-Test for the model This is defined as the explained variation/(unexplained variation/(n-2)) based on 1 numerator and n-2 denominator degrees of freedom. The following is how to perform this function in R.

var.test(HouseholdIncome, Obese)
## 
##  F test to compare two variances
## 
## data:  HouseholdIncome and Obese
## F = 6.6525, num df = 49, denom df = 49, p-value = 6.117e-10
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##   3.775156 11.723024
## sample estimates:
## ratio of variances 
##           6.652537

Holy huge F stat batman! F = 6.6525, p-value = 6.117e-10 this is significant at any reasonable alpha. Overall this is summary of all the statistically principles covered within Chapter 3 of the Forecasting, Time Series and Regression textbook written by Bowerman, O’Connell and Koehler.