Chapter 3 gives definition to the simple linear regression which has two parameters, slope and intercept. Later in the chapter we will discuss the least squares point estimate, assumptions, significance of the independent variable, confidence intervals and the F-test. We will discuss each of these ideas individually and go through the R code on how apply these ideas. Throughout this assignment we will be using data on US Arrests which comes from the US Justice System.
Important terms:
Dependent variable: this is the response variable which is denoted by y.
Independent variable: this is the predictor variable which is denoted by x.
There is a relationship between those two which is denoted within the regression. Each value of y is plotted against its corresponding value of x.
states<-read.csv(url("http://www.lock5stat.com/datasets/USStates.csv"))
head(states)
## State HouseholdIncome Region Population EighthGradeMath HighSchool
## 1 Alabama 43.253 S 4.849 269.2 84.9
## 2 Alaska 70.760 W 0.737 281.6 92.8
## 3 Arizona 49.774 W 6.731 279.7 85.6
## 4 Arkansas 40.768 S 2.966 277.9 87.1
## 5 California 61.094 W 38.803 275.9 84.1
## 6 Colorado 58.433 W 5.356 289.7 89.5
## College IQ GSP Vegetables Fruit Smokers PhysicalActivity Obese
## 1 24.9 95.7 32.615 74.2 54.1 21.5 45.4 32.4
## 2 24.7 99.0 61.156 80.8 60.3 22.6 55.3 28.4
## 3 25.5 97.4 35.195 76.2 60.5 16.3 51.9 26.8
## 4 22.4 97.5 31.837 72.0 49.5 25.9 41.2 34.6
## 5 31.4 95.5 46.029 82.7 69.6 12.5 56.3 24.1
## 6 37.0 101.6 46.242 80.9 64.3 17.7 60.4 21.3
## NonWhite HeavyDrinkers Electoral ObamaVote ObamaRomney TwoParents
## 1 30.7 4.3 9 0.384 R 58.7
## 2 33.1 8.2 3 0.408 R 69.6
## 3 20.8 6.3 11 0.446 R 62.7
## 4 21.7 5.0 6 0.369 R 62.0
## 5 37.7 6.4 55 0.602 O 65.3
## 6 15.8 6.7 9 0.515 O 69.9
## StudentSpending Insured
## 1 8.755 78.8
## 2 18.175 79.8
## 3 7.208 74.7
## 4 9.394 71.7
## 5 9.220 79.7
## 6 8.647 80.0
names(states)
## [1] "State" "HouseholdIncome" "Region"
## [4] "Population" "EighthGradeMath" "HighSchool"
## [7] "College" "IQ" "GSP"
## [10] "Vegetables" "Fruit" "Smokers"
## [13] "PhysicalActivity" "Obese" "NonWhite"
## [16] "HeavyDrinkers" "Electoral" "ObamaVote"
## [19] "ObamaRomney" "TwoParents" "StudentSpending"
## [22] "Insured"
attach(states)
We do these four steps to gather the data from an outside website, see what the data looks like, how to call each piece of data and attach the data so we don’t need to call it every time.
Next we plot the data in order to see what is looks like and be able to check our results at the end with basic logic. It is also important to label the axes and graphs as accurately as possible.
plot(HouseholdIncome, Obese, ylab = "Obesity Rate (%)",
xlab = "Household Income (USD)",
main = "US States' Obesity Rates and Mean Household Incomes")
The formula for regression is: \[\hat{y_i}= \hat{\beta_0}+\hat{\beta_0} x_i\] Now we can accurately plot a line of best fit or a linear regression or line of means on the graph.
mymod <- lm(Obese ~ HouseholdIncome)
mymod
##
## Call:
## lm(formula = Obese ~ HouseholdIncome)
##
## Coefficients:
## (Intercept) HouseholdIncome
## 42.1759 -0.2517
plot(HouseholdIncome,Obese,ylab="Obesity Rate (%)",xlab="Household Income (USD)")
abline(42.1759,-0.2571)
Now it is time to interpret the meaning of the numbers displayed above -0.2571 is the slope \(\hat{\beta_0}\) of the data which means that for every $1000 increase in income there is a .2571 percentage point decrease in obesity rate. 42.1759 is the intercept \(\hat{\beta_1}\) which can be interpreted because it is reasonable to assume that there is a household whose income is zero dollars. This is also an important time to look over the graph and see if the plotted line seems reasonable to make sure a silly mistake was not made. Both time series (a look at the change in a variable over time) and cross-sectional data (a look at multiple data sets during the same time period) can be used in linear regressions. In this case, we have cross-sectional data.
The next thing to discuss is a residual plot. We do this in order to make sure that the plot is as random as possible and conclude that a linear regression is the appropriate fit. The resid command is used.
mymod <- lm(Obese ~ HouseholdIncome)
mymod.res = resid(mymod)
plot(states$HouseholdIncome, mymod.res, ylab="Residuals", xlab="Household Income", main="Chapter 3 Example")
abline(0, 0)
As you can see, the points are evenly distributed above and below the zero line. This means that a linear regression is appropriate for this data set.
Another way to check for normality and to see if a data set follows a linear relationship is to look at the qq plots for the data. The quantile-quantile or q-q plot is an exploratory graphical device used to check the validity of a distributional assumption for a data set. In general, the basic idea is to compute the theoretically expected value for each data point based on the distribution in question.
qqplot(Obese, HouseholdIncome, plot.it = TRUE, xlab = "Obesity", ylab = "Household Income")
Again, looking at the qq plot, it follows a reasonably linear relationship and we can move on to futher examination knowing that assumptions have been met.
The formula is the same as in part 3.1 but this section shows the calculation behind the Least Squares Regression. We will continue to use the same example in order to show that the calculations are similar within the data set.
First we calculate slope: \(\hat{\beta_1}\) = \(\sum_{1}^n\)((x-\(\bar{x}\))(y-\(\bar{y}\))) where n is the number of observations.
sum( (HouseholdIncome - mean(HouseholdIncome))*(Obese - mean(Obese) ) )/sum((HouseholdIncome - mean(HouseholdIncome) )^2)
## [1] -0.2516667
Next is intercept: \(\hat{\beta_0}\) = \(\bar{y}\)-\(\hat{\beta_1}\)(\(\bar{x}\))
b1 <- (sum( (HouseholdIncome - mean(HouseholdIncome))*(Obese - mean(Obese) ) )/sum((HouseholdIncome - mean(HouseholdIncome) )^2))
(mean(Obese))-(b1*mean(HouseholdIncome))
## [1] 42.17588
It is important to remember that this is a point prediction.
Next we look at the sum of squares residuals: \(\sum_{1}^n\)((\(\bar{y}\)-\(\hat{\beta_1}\)(\(\bar{x}\)))^2)
sum(((mean(Obese))-(b1*mean(HouseholdIncome)))^2)
## [1] 1778.805
As you can see, this is quite large. This is logical because many of the points are quite far away from the simple linear regression. And since this method sums the distance from the point to the linear regression the number should be large.
Let \(\hat{\beta_0}\) and \(\hat{\beta_1}\) be the least squares point estimate of the y-intercept \(\hat{\beta_0}\) and the slope \(\hat{\beta_1}\) in the simple linear regression model, and suppose that \(x_{0}\) a specified value of the independent variable x, is inside the experimental region. Then y = \(\hat{\beta_0}\) + \(\hat{\beta_1}\)(x) is the point estimate of the mean value of the dependent variable when the value of the independent varianbel is \(x_{0}\).In addition \(\hat{y}\) is the point prediction of an individual value of the dependent variable when the value of the independent variable is \(x_{0}\). Here we predict the error term to be 0. For this example x will equal 50 which we can use because it is in the range of x
b0 <- (mean(Obese))-(b1*mean(HouseholdIncome))
b0+(b1*50)
## [1] 29.59254
Now we plot it for proof; I plotted it in red so that it is easy to see how the predition formula works.
plot(HouseholdIncome,Obese,ylab="Obesity Rate (%)",xlab="Household Income (USD)")
abline(42.1759,-0.2516667)
points.default(50, b0+(b1*50), type="p", pch=19, col="red")
There are several assumptions necessary to perform a regression:
At any given x, the population of potential error term values has a mean equal to 0.
Constant variance assumption. At any given value of x, the population of potential error term values has a variance that does not depend on the value of x. That is, the different populations of potential error term values corresponding to different values of x have equal variances. We denote the constant variance as ??^2.
Normality assumption. At any given value of x, the population of potential error term values has a normal distribution.
Independence assumption. Any one value of the error term e is statistically independent of any other value of e.
The next part of this section is on the mean square error and the standard error.
SSE <- sum(((mean(Obese))-(b1*mean(HouseholdIncome)))^2)
SSE/48
## [1] 37.05843
(SSE/48)^.5
## [1] 6.087564
We test the null hypothesis that \({\beta_1}\) = 0 versus the alternative hypothesis that \({\beta_1}\) does not equal 0 if we are doing a two sided test or greater than or less than 0 if we are performing a one sided test. One important formula in this section are that: t = b1/sb1 where sb1 equals s/square root (SSxx) which follows a t distribution with n-2 degrees of freedom. For ease of viewing I have sorted in the following code.
HA: \({\beta_1}\) not equal to 0
two sided test where the calculated t value can either be greater than the positive t value with n-2 degrees of freedom or smaller than the negative t value with n-2 degrees of freedom
p value: Twice the area under the t curve to the right of the absolute value of t
HA: \({\beta_1}\) >0
one sided test where the calculated t value is greater than the positive t value with n-2 degrees of freedom
p value: area under the t curve to the right of t
HA: \({\beta_1}\) <0
one sided test where the calculated t value is less than the negative t value with n-2 degrees of freedom
p value: area under the t curve to the left of t
Below is the t test stat.
mymod <- lm(Obese ~ HouseholdIncome)
summary (mymod)
##
## Call:
## lm(formula = Obese ~ HouseholdIncome)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1702 -1.7152 0.4645 1.7809 4.6312
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.17588 2.29764 18.356 < 2e-16 ***
## HouseholdIncome -0.25167 0.04257 -5.912 3.42e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.59 on 48 degrees of freedom
## Multiple R-squared: 0.4213, Adjusted R-squared: 0.4093
## F-statistic: 34.95 on 1 and 48 DF, p-value: 3.416e-07
With the t value of -5.912 and the p value of 3.42e-07 we are going to reject the null at any confidence level saying that we found significant evidence of a linear relationship between household income and obesity.
Next we are going to test confidence intervals. In a practical application we can then say that we are a certain percent certain that the true population of slope is in between the calculated values. In a probabilistic application we can say that a certain percentage of confidence intervals constructed in the same manner will contain the true population mean for slope. This next piece of code shows how this is done.
mymod <- lm(Obese ~ HouseholdIncome, data = states)
confint(mymod, level=.95)
## 2.5 % 97.5 %
## (Intercept) 37.5561608 46.7955955
## HouseholdIncome -0.3372578 -0.1660756
We can do this by calculating a confidence interval for the mean value of y and a prediction interval for and indicidual value of y. The formula for distance value = (1/n)+((\({x_0}\)-\(\bar{x}\))\(^{1/2}\))/\({SS_(xx)}\) The confidence intercal for a mean value of y: \(\hat{y}\)+/-tstat\(\times\)s\(\times\)(distance value)\(^{1/2}\) an easier R shortcut can be seen above. If you want a prediction interval under the squareroot is distance value +1.
thismod <- lm(Obese ~ HouseholdIncome)
newdata <- data.frame(HouseholdIncome = 50)
(predy <- predict(thismod, newdata, interval="predict") )
## fit lwr upr
## 1 29.59254 24.32658 34.85851
(confy <- predict(thismod, newdata, interval="confidence") )
## fit lwr upr
## 1 29.59254 28.80438 30.38071
As you can see the prediction interval is larger which is logical. This is because prediction intervals are about fitting an interval for one person and the standard deviation will be significantly larger than the standard deviation of the mean of a population. It is also important to check that the intervals are centered in the same place, which they are.
confy[1] == predy[1]
## [1] TRUE
This is denoted \(r^{2}\) which equals explained variation/total variation. Total variation is equal to the \(\sum_{1}^n\)((\({y_i}\)-\(\bar{y}\))^2) and explained variation is equal to \(\sum_{1}^n\)((\(\hat{y}\)-\(\bar{y}\))^2). \(r^{2}\) is the proportion of the total variation is the n observed values of the dependent variable that is explained by the simple linear regression model. The first two calculations denote \(r^{2}\), while the second give the calculation for r.
cor(Obese, HouseholdIncome)^2
## [1] 0.4213459
cor(HouseholdIncome, Obese)^2
## [1] 0.4213459
cor(Obese, HouseholdIncome)
## [1] -0.6491116
cor(HouseholdIncome, Obese)
## [1] -0.6491116
It is important to note that the correlation is the same in either order which is logical. r>0 when \({\beta_1}\) is positive and r<0 when \({\beta_1}\) is negative. r is also negative which makes since because \({\beta_1}\) is negative.This is also logical by looking at the graph, which is why we printed out for the viewer in the first place. It is also important to note that this output matches the earlier display in the summary command.
This is defined as the explained variation/(unexplained variation/(n-2)) based on 1 numerator and n-2 denominator degrees of freedom. The following is how to perform this function in R.
var.test(HouseholdIncome, Obese)
##
## F test to compare two variances
##
## data: HouseholdIncome and Obese
## F = 6.6525, num df = 49, denom df = 49, p-value = 6.117e-10
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 3.775156 11.723024
## sample estimates:
## ratio of variances
## 6.652537
We have an F stat = 6.6525 and a p-value = 6.117e-10. This is significant at any reasonable alpha. This means we have found statistically significant evidence of a linear correlation between Household Income and Obesity. It is important to note that this agrees with the t test in an earlier section. Overall this is summary of all the statistically principles covered within Chapter 3 of the Forecasting, Time Series and Regression textbook written by Bowerman, O’Connell and Koehler.