R Guide Chapter Three

Introduction to Dealing with Data Sets

When you first dig into a data set, it is good to do some preliminary analysis. This can include first loading the data and looking at the row names. It is also good to know how many pieces of data you are working with. In the code below, I load in the MiceData. This data observes 20 different mice, and it records their weight, brain weight and litter size. In this project, I want to see if body weight and brain weight are correlated. I want to see if the brain weight of a mouse depends on its body weight. I am going to ingore the litter size for now and just look at two of the three given variables.

MiceData <- read.csv("MiceData.csv")
head(MiceData)

##   ï..LitterSize BodyWeight BrainWeight
## 1             3      9.447       0.444
## 2             4      9.780       0.436
## 3             4      9.155       0.417
## 4             4      9.613       0.429
## 5             5      8.850       0.425
## 6             5      9.610       0.434

numRows = nrow(MiceData)
numCols = ncol(MiceData)
numRows

## [1] 20

numCols

## [1] 3

names(MiceData)

## [1] "ï..LitterSize" "BodyWeight"    "BrainWeight"

attach(MiceData)

Here we have 3 columns and 20 rows. This means that twenty mice were observed and the body weight, brain weight and litter size was recorded for each one. For future calculations we need n = 20. The column names are also given when the names(MiceData) is called.

Graphing the Data

Next, we may want to graph our data. Using a general plot does not give us too much useful information. The axes labels are somewhat unclear when simply using the plot(MiceData) funciton. Thus, we need to determine how we want to organize and label our graph. For this project, I want to see if brain weight depends on body weight. Thus, brain weight is the response variable and body weight is the predictor. When plotting the data with this bias, we plot it using te plot command but give R more information regarding how we want the plot to be organized.

plot(BodyWeight, BrainWeight, ylab = "Brainweight of Mouse",
     xlab = "Bodyweight of Mouse", 
     main = "Brainweight and Bodyweight of Mice")

The data does seem to be somewhat positively coorelated. That being said, the data is fairly sparse.Ideally, I would like there to be more than 20 data points.However, there is a trend upward. The more the mouse weighs, the bigger its brain seems to be. However, within this trend, the data is not packed tightly. There is still variation in the brain size based on the weight. Mice with the same weights can have different sized brains. All - in - all, I’d say we will expect to see a positive correlation and have some results to analyze. It is important to make some observations when first looking at the data so you can be skeptical of unexpected results later on in the analysis.

Linear Regression

When making the linear regression for the MiceData, I simply had to supply R with the data and it did the rest. Using the lm command and providing R with the predictor and response variables sets R up with enough information to generate a linear regression.

MData <- lm(BrainWeight ~ BodyWeight)
MData

## 
## Call:
## lm(formula = BrainWeight ~ BodyWeight)
## 
## Coefficients:
## (Intercept)   BodyWeight  
##     0.33555      0.01048

From the linear regression above, we see that the y-intercept of the data is .33555 and the slope is .01048. This means that if the body weight of the mouse was 0, then the brain weight would be .33555. This, of course, does not make sense. A mouse cannot have a brain weight of 0. Our data does not go close to a brain weight of 0 and therefore this y-intercept is an extrapolation of the data. In this class, we do not want to use extrapolations and therefore, the y-intercept does not mean anything to us in this case. The slope is the coefficient in front of the body weight. This slope tells us that for each ounce increase in body weight for the mouse, the brain weight for that mouse will increase by .01048. This results in our equation looking as follows:

Brainweight = .33555 + .01048*(BodyWeight)

Let’s see what adding this linear regression line to the graph look like. We will use the same plot from above and use the abline function to graph the regression line on top of the original plot. Make sure to run these two lines of code simultaneously or you will get an error message.

plot(BodyWeight, BrainWeight, ylab = "Brainweight of Mouse",
     xlab = "Bodyweight of Mouse", 
     main = "Brainweight and Bodyweight of Mice")
abline(.33555,.01048)

Looking at the regression line with the data tells us that the line follows the general trend of the data. Again, I am a little dissapointed that the data is sparse, but I would say the regression line does a good job in following the general trend of the data.

Using Linear Regression to Predict

Now we want to use our linear approximation to predict data. We can take a specific body weight and try and predict the brain weight of the mouse.

Let’s say we have a bodyweight of 6.655, what will we predict for the brainweight?

How will our regression do if they bodyweight is 9.61? I used the linear regression line to predict brain weights of mice with these body weights.

Brainweight = .33555 + .01048*(BodyWeight)

ExpectB1 = .33555 + .01048*(6.655)
ExpectB1

## [1] 0.4052944

ExpectB2 = .33555 + .01048*(9.61)
ExpectB2

## [1] 0.4362628

Now let’s compare to the observed brain weights with these body weights. For these data points, I took the correlating brain weight of the mouse that corresponded to the specific body weight.

MiceData[16,]

##    ï..LitterSize BodyWeight BrainWeight
## 16            10      6.655       0.405

ActualB1 = MiceData[16, "BrainWeight"]
ActualB1

## [1] 0.405

MiceData[6,]

##   ï..LitterSize BodyWeight BrainWeight
## 6             5       9.61       0.434

ActualB2 = MiceData[6, "BrainWeight"]
ActualB2

## [1] 0.434

Expected and Actual Comparisons

Now, I want to look at the difference in what the linear regression predicted and what the actual data was. In order to do this, I take the actual data point and subtract the expected data point.

ErrorB1 = ActualB1 - ExpectB1
ErrorB1

## [1] -0.0002944

ErrorB2 = ActualB2 - ExpectB2
ErrorB2

## [1] -0.0022628

Let’s take a look at the results: B1Error: -0.0002944 It looks like out model overestimated a little. Definetaly not by much. It is exact for the first 3 decimal places which is a pretty good predictor especially because there is some variability among the data.

B2Error: -0.0022628 This error is pretty small and the estimation is exact for the first two decimal places. So this looks pretty good for this weight.

I would say our model does a decent job at estimating for these brain weights. Of course the numbers are so small it is hard to look at the numbers, so maybe a percent error would be good to look at in this case. Maybe (observed - expected) divided by expected would be helpful since these numbers are so small.

Example:

percentDiff = ((ErrorB1))
percentDiff/(ExpectB1)

## [1] -0.0007263856

perDiff = 100* percentDiff
perDiff

## [1] -0.02944

So the estimated data point only differed from the actual data point by .029%. That’s pretty good!

Residuals

What we did above was look at the residuals of data points. Above I just used 2 data points and specifically isolated those data points. In R, we can have it calculate all the residuals at once. Doing this allows us to get a very good idea of how our linear regression is doing at predicting the data. R has a command that will use the linear regression model to calculate all the residuals. With this we can see exactly how our model is doing at predicting the data.

myResiduals = MData$residuals
myResiduals

##             1             2             3             4             5 
##  0.0094437460 -0.0020462378 -0.0144959678 -0.0072960057 -0.0032994361 
##             6             7             8             9            10 
## -0.0022645644 -0.0185142376  0.0139180566 -0.0041028097  0.0060979864 
##            11            12            13            14            15 
##  0.0046701459 -0.0025621862 -0.0177184641  0.0213644508  0.0083732634 
##            16            17            18            19            20 
## -0.0002948878  0.0241714441  0.0071758977 -0.0246659673  0.0020457735

Seeing all these numbers can be a little overwhelming. It can be helpful to look at them in a histogram. This will give us a good visual representation of how our model is doing at predicting the data. When looking at the histogram, we want to determine if the residuals are normally distributed or not.

hist(myResiduals)

Looking at the above graph, it looks as though the residuals are normally distributed. To make it even easier to analyze, let’s plot it, If the residuals are normal, they should be in a straight line. Let’s take a look (again, make sure to plot these at the same time):

qqnorm(myResiduals)
qqline(myResiduals)

Interesting. It isn’t terrible, but not quite a straight line. There is some variation, but for the most part, I’d say they at least follow the line which is good. Another thing we learned about is homoscedasticity. Homoscedasticity is when the error term is very similar for all the variables. In order to test this we can plot the original data and then plot the residuals with respect to bodyweight.

plot(MData$residuals ~ BodyWeight)
abline(0,0)

There is no sign that this data is moving in a weird trend. Therefore, there is no sign of heteroscedasticity (Where the data doesn’t follow a pattern and the error gets larger or smaller as the x variable changes).Thus, the data is homoscedastic.

Summary in R

We can use the R command summary(MData). This will give us a lot of information about our data. Let’s take a look:

summary(MData)

## 
## Call:
## lm(formula = BrainWeight ~ BodyWeight)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.024666 -0.004901 -0.001171  0.007475  0.024171 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.335548   0.017327  19.366 1.68e-13 ***
## BodyWeight  0.010480   0.002204   4.755 0.000158 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01308 on 18 degrees of freedom
## Multiple R-squared:  0.5567, Adjusted R-squared:  0.5321 
## F-statistic: 22.61 on 1 and 18 DF,  p-value: 0.0001583

The first thing I notice is that the intercept is .33555 and the slope coefficient on the body weight is .01048 which is familiar from above. We can see that the residual standard error is 0.01308.If our linear regression was perfect, this value would be 0. So our linear regression is fitting the data pretty well since this number is so small. We see that this value was found by using 18 degrees of freedom. This makes sense because n = 20. The Multiple R-Squared value is 55.67%. This means that our response variable is 55.67% explained by the predictor value. So brain weight can be explained by body weight but only by 55.67%. The p-value is .0001583. Alpha = 0.05 and so p-value < alpha. Thus, we reject the null hypothesis. This means that brain weight does depend on body weight.

Correlation

We can see what R thinks about the correlation of the brain weight and body weight of the mice.

cor(BrainWeight, BodyWeight)

## [1] 0.7461485

cor(BodyWeight, BrainWeight)

## [1] 0.7461485

When doing correlation coefficients in R, the order doesn’t matter. Here we see that there is a positive correlation between body weight and brain weight in mice. 1.0 would mean there is a perfect correlation and 0 would mean there is no correlation. Therefore, .746 means that they are pretty strongly positively correlated. This verifies the data that we have observed so we are satisfied with this numeric result.

Confidence Intervals for Linear Regression Model

Let’s look at what R can generate for us when it comes to confidence intervals.

mod <-  lm(BrainWeight ~ BodyWeight, data = MiceData )
confint(mod, level=.9)

##                     5 %       95 %
## (Intercept) 0.305501930 0.36559330
## BodyWeight  0.006658234 0.01430263

From the information above, we can say that if we generated 100 linear regression models on this data, 90% of them would have the intercept fall between .3055 and .3656. It also tells us that in these 100 linear regression models, 90% of them would contain the true coefficient.

Confidence Intervals

R can also generate confidence intervals based on a known point. For these examples, the known point is the body weight of the mouse. I am going to use 7.400. We start by taking a data frame where the body weight is equal to 7.400 and then we make confidence intervals from there.

newData <- data.frame(BodyWeight = 7.400)
newData

##   BodyWeight
## 1        7.4

confidenceInt <- predict(MData, newData, interval = "confidence")
confidenceInt

##         fit       lwr       upr
## 1 0.4131028 0.4067493 0.4194563

This tells us that if the BodyWeight of the mouse is equal to 7.400 , we are confident that the average brainweight of all mice with the body weight equal to 7.4000 will be between .4067 and .4195. This is not for the specific mouse, but for the average of all mice with that body weight.

Prediction Intervals

Prediction intervals are different than confidence intervals. For prediction intervals, we are predicting where that specific mouse’s body weight will fall given their brain weight. With confidence intervals, we are looking at the average of all mice with a specific body weight, but with prediction intervals, we look at a specific mouse and its body and brain weight.

predictionInt <- predict(MData, newData, interval = "predict")
predictionInt

##         fit       lwr       upr
## 1 0.4131028 0.3848935 0.4413122

Using the prediction interval, we are confident that a specific mouse with a weight of 7.400 will have a brain weight between .38489 and .44131.

Comparing Confidence Intervals and Prediction Intervals

The prediction interval is wider than the confidence interval. That is because the prediction interval is used given a specific mouse and its weight. It is harder to predict for a specific mouse since there will be more variability. For the confidence interval, we are just trying to predict where the average brain weight of mice with that bodyweight will fall.

We can use this command in R to look at the widths of these two intervals and compare them.

confWidth <- confidenceInt %*% c(0, -1, 1)  #conf interval width
confWidth

##         [,1]
## 1 0.01270702

predictWidth <- predictionInt %*% c(0, -1, 1)  #pred interval width
predictWidth

##         [,1]
## 1 0.05641872

These two intervals both have small widths because the variability is so small. Let’s do some mathematical comparison:

confWidth - predictWidth

##         [,1]
## 1 -0.0437117

So we can see that the confidence interval is smaller than the prediction variable by .0437117. This is what is expected.

We can also look to see if the prediction and confidence intervals are centered at the same place by using the all.equal command as follows:

all.equal(confidenceInt[1], predictionInt[1])

## [1] TRUE

Summary

In chapter three we learned about how to analyze data in R. In order to analyze data well, you need to be able to graph it well. We learned how to pick the response and predictor variables and how to use R to graph these appropiately. We also learned how to use R to make a Linear Regression model. The linear regression model models the data in a coherent way. We can use this model to estimate and predict outcomes for data points that we do not have. After we create the linear regression model, we should test to see how good it is. This is done by looking at the residuals. The residuals are the actual values compared to the expected values. Looking at the residuals can help us understand how our model is interpreting data and how well it is doing. Once we are confident in our linear regression model, we can use it to make confidence and prediction intervals. Creating prediction and confidence intervals is important for predicting future outomes in the data. We also became familiar with the R summary() command which helps us interpret our data quickly and efficiently.