Data Set

To demonstrate and explain how to use apply Rstudio to chapter 3 topics, we will use the Skating2010 data set in the resampledata package, which has scoring data from men’s skating in the 2010 Olypmics. First it’s a good idea to get a summary of the data set so we know all of our variables.

library(resampledata)
## 
## Attaching package: 'resampledata'
## The following object is masked from 'package:datasets':
## 
##     Titanic
data(Skating2010)
attach(Skating2010)
summary(Skating2010)
##               Name              Country       Short            Free      
##  ABBOTT Jeremy  : 1   Japan         : 3   Min.   :57.22   Min.   :102.1  
##  AMODIO Florent : 1   United States : 3   1st Qu.:67.33   1st Qu.:116.9  
##  BACCHINI Paolo : 1   Canada        : 2   Median :72.57   Median :137.6  
##  BORODULIN Artem: 1   Czech Republic: 2   Mean   :74.13   Mean   :136.6  
##  BREZINA Michal : 1   France        : 2   3rd Qu.:81.36   3rd Qu.:154.5  
##  CHAN Patrick   : 1   Italy         : 2   Max.   :90.85   Max.   :167.4  
##  (Other)        :18   (Other)       :10                                  
##      Total      
##  Min.   :165.9  
##  1st Qu.:186.8  
##  Median :210.2  
##  Mean   :210.7  
##  3rd Qu.:238.6  
##  Max.   :257.7  
## 

As we can see, we have 5 variables, so let’s start by interesting ourselves with two of them. For this guide we will interest ourselves in whether or not short skate score is related to free skate score.

Plot

The first thing we should do is make a scatterplot of the data to get a visual representation of how the data is related. To do this we use the plot function. Notice how we can edit the labels of the axes and add a title. While maybe not necessary for this example, R will always use the variable name which is not always ideal. Also take note of the notation, of the variables. When using a comma, the response variable comes second, but if “~” is used instead of a comma, the response variable should come first. In this case my response variable is short skate score and free skate score is the explanatory variable.

plot(Free, Short, xlab = "Free Skate Score", ylab = "Short Skate Score", main = "Free and Short Skate Scores")

We can clearly see a general upward trend between short skate and free skate score. Also, the points are somewhat tightly packed together, so it looks like there might be a linear relationship, so the next step is to create a linear regression model.

Linear Regression Model

To create a linear regression model in R, we use the “lm” function. In this function, we have to use a “~” and the response variable must always come first. So the general format should be “lm(response~explanatory)”. It is also helpful to save this model using any name you choose. Thus, your code should look something like this:

mod <- lm(Short~Free)

Now we simply call up our model using whatever the saved name is and interpret the results.

mod
## 
## Call:
## lm(formula = Short ~ Free)
## 
## Coefficients:
## (Intercept)         Free  
##     19.0609       0.4033

R gives us the exact values for intercept and slope for the regression line. So in this case our equation would look like this: \[\hat{Short}=19.0609+0.4033*\hat{Free}\]

Intercept

The intercept of 19.0609 can be thought of as the short skate score we would expect from a person who got a free skate score of zero. However, we cannot reasonably expect a free skate score of zero so the intercept cannot truly be interpreted.

Slope

The slope of 0.4033 can be interpreted as the amount we would expect a person’s short skate score to increase for every point earned in free skate.

Plotting the Regression Line

Now that we have the equation for the regression line it would be useful to plot it with the data we plotted before. The first line will be the same as our plot from before. We simply add a second line using the “abline” function and entering the intercept followed by the slope

plot(Free, Short, xlab = "Free Skate Score", ylab = "Short Skate Score", main = "Free and Short Skate Scores")
abline(19.0609, 0.4033)

Predictions

We can also use our regression model to make predictions about data. For example, if we want to predict the short skate score of a skater who earned a free skate score of 135. We could either plug that into the equation and calculate the prediction manually or we could make a matrix of the coefficients from the model with the “coef” and multiply that with a vector consisting of 1 followed by the explanatory variable like this:

coef(mod)%*%c(1, 135)
##          [,1]
## [1,] 73.50164

So for a skater whose free skate score was 135 we would expect he/she to earn 73.50164 in short skate.

Checking Regression Assumptions

We can use residuals to test some of the assumptions we make when we make a regression model. First we should make a vector of our residuals.

myres <- mod$residuals

The first assumption is that the mean of the error is zero. We can check this easily by checking the mean of our residuals.

mean(myres)
## [1] 2.405212e-16

This number is very close to zero so we can justify the first assumption. Another one of the assumptions for regression is that the error is normally distributed. The easiest way to check this in R is by comparing the quantiles of the residuals with that of a normal distribution.

qqnorm(myres)
qqline(myres)

The points follow closely along the line so we can say that this assumption was met. Another assumption is that the variance of the data is constant. An easy way to check that is to plot the residuals against the explanatory variable and have a line through 0 on the y-axis.

plot(myres~Free)
abline(0,0)

What we are looking for in this plot is to see if the range of the residuals varies significantly with different values of the explanatory variable. In this case it looks like there is no evidence of such a thing so this assumption is met. The last assumption is that each error term is independent of all others. To verify this assumption we would need information on exactly how the data was collected.

F-Tests

An F-test is a test to see if two variables have a linear relationship. The null hypothesis is that there is no relationship, so the calculations for the test assume that to be true. Conversely, the alternative hypothesis is that there is a linear relationship and that there is a non-zero coefficient for the slope of the linear relationship between the two variables. To perform this test we can take a summary of our regression model.

summary(mod)
## 
## Call:
## lm(formula = Short ~ Free)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.3032  -3.0333   0.0394   3.8877   7.8846 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 19.06091    7.77587   2.451   0.0226 *  
## Free         0.40326    0.05635   7.157 3.56e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.476 on 22 degrees of freedom
## Multiple R-squared:  0.6995, Adjusted R-squared:  0.6859 
## F-statistic: 51.22 on 1 and 22 DF,  p-value: 3.562e-07

The output it gives us are the F-statistic of 51.22 on 1 and 22 degrees of freedom and a p-value of 3.562e-07.

In simple linear regression models such as this the degrees of freedom on the F-statistic are always 1 and (n-2), respectively. There is also a formula to calculate the F-stat manually but we’ll just continue to use R.

To reject the null hypothesis, we must have a p-value below our chosen significance level. For simplicity, we’ll just use a typical significance level of 0.05. Our p-value is much smaller than that so we can reject the null hypothesis in favor of the alternative hypothesis that there is a linear relationship between Free skate score and Short skate score.

Mean Squared Error

Another useful piece of information given to us in the summary of the regression model is mean squared error. The mean squared error is the population variance of the y variable for a given value of x. We could find it using the actual equations but instead we should let R do the work for us. In the summary of the mod the mean squared error is labeled as “Residual standard error” so in this case it has a value of 5.476.

R-Squared

A third statistic we can obtain from that same summary is the R-squared value. R-squared is the percentage of the variability of the response that is explained by its linear relationship with the predictor. In simpler terms, an R-squared value of 1 means there is a perfect linear relationship while a value of 0 means no linear relationship. The summary lists the R-squared value as “multiple R-squared”, which in our case has a value of 0.6995.

Confidence Intervals

Confidence intervals are good tool to find the likely range for the true value of a parameter. The “confint” function is an easy tool to construct a confidence interval with a confidence level of your choice. I’m going to use a 95% confidence level.

confint(mod, level = .95)
##                 2.5 %    97.5 %
## (Intercept) 2.9347422 35.187074
## Free        0.2864073  0.520122

R gives us ranges for both the intercept and slope. The meaning of these intervals is that we are 95% confident that they contain the entire population’s real values for the intercept and slope of our regression model.

Intervals for Specific Data Frames

A more specific thing we can do is construct both confidence and prediction intervals for our response variable when our explanatory variable is a defined value. The first thing we have to do to accomplish this is create a new data frame in which we only use the data that matches up with the value we decide for x.

newdata <- data.frame(Free = 140)

Now we can use the predict function to produce confidence and prediction intervals for Short skate score when Free skate score is 140. Notice that the only difference comes at the end when we input the type of interval. If I don’t specify a level of confidence R will just use 95%.

confy <- predict(mod, newdata, interval = "confidence")
predy <- predict(mod, newdata, interval = "predict")
confy
##        fit      lwr      upr
## 1 75.51796 73.16528 77.87064
predy
##        fit      lwr      upr
## 1 75.51796 63.92005 87.11588

As you can see, both intervals are centered at the same spot, but the prediction interval is significantly wider. This is because the prediction interval is the range in which we are 95% confident that an individual who scored 140 on the free skate would earn on the short skate. Different than that is the confidence interval. This is the interval for which we are 95% confident contains the true population mean for all individuals who score 140 on the free skate.