R Guide Chapter Three

Introduction to Dealing with Data Sets

When you first dig into a data set, it is good to do some preliminary analysis. This can include first loading the data and looking at the row names. It is also good to know how many pieces of data you are working with.

MiceData <- read.csv("MiceData.csv")

head(MiceData)

##   ï..LitterSize BodyWeight BrainWeight
## 1             3      9.447       0.444
## 2             4      9.780       0.436
## 3             4      9.155       0.417
## 4             4      9.613       0.429
## 5             5      8.850       0.425
## 6             5      9.610       0.434

MiceData

##    ï..LitterSize BodyWeight BrainWeight
## 1              3      9.447       0.444
## 2              4      9.780       0.436
## 3              4      9.155       0.417
## 4              4      9.613       0.429
## 5              5      8.850       0.425
## 6              5      9.610       0.434
## 7              6      8.298       0.404
## 8              6      8.543       0.439
## 9              7      7.400       0.409
## 10             7      8.335       0.429
## 11             8      7.040       0.414
## 12             8      7.253       0.409
## 13             9      6.600       0.387
## 14             9      7.260       0.433
## 15            10      6.305       0.410
## 16            10      6.655       0.405
## 17            11      7.183       0.435
## 18            11      6.133       0.407
## 19            12      5.450       0.368
## 20            12      6.050       0.401

numRows = nrow(MiceData)
numCols = ncol(MiceData)

numRows

## [1] 20

numCols

## [1] 3

names(MiceData)

## [1] "ï..LitterSize" "BodyWeight"    "BrainWeight"

attach(MiceData)

Here we have 3 columns and 20 rows. So our n = 20. The column names are also given when the names(MiceData) is called. ##Graphing the Data

Next, we may want to graph out data.

plot(MiceData)

For my RGuide, I am going to be trying to determine whether the body size of a mouse affects the brain size of the mouse.

Thus, the Brainweight depends on the BodyWeight.

Let’s plot this with axis:

plot(BodyWeight, BrainWeight, ylab = "Brainweight of Mouse",
     xlab = "Bodyweight of Mouse", 
     main = "Brainweight and Bodyweight of Mice")

It’s always good to do a quick analysis after plotting the data. Let’s go ahead and do this.

The data does seem to be somewhat positively coorelated. That being said, the data is fairly sparse. There are only 20 observations so there is not a large amount of data to go off of. However, there is a trend upward. The more the mouse weighs, the bigger it’s brain seems to be. However, within this trend, the data is not packed tightly. There is still variation in the brain size based on the weight. Mice with the same weights can have different sized brains. All - in - all, I’d say we will expect to see a positive correlation and have some results to analyze.

Linear Regression

Here is the linear regression. We know that Brainweight depends on Bodyweight. Now, we let RStudio run the code for us.

MData <- lm(BrainWeight ~ BodyWeight)
MData

## 
## Call:
## lm(formula = BrainWeight ~ BodyWeight)
## 
## Coefficients:
## (Intercept)   BodyWeight  
##     0.33555      0.01048

This gives us the y-intercept of .33555 and the slope of .01048.

Thus: Brainweight = .33555 + .01048*(BodyWeight)

Great! Let’s see what adding this line to the graph look like. Make sure to run these two lines of code simultaneously or you will get an error message.

plot(BodyWeight, BrainWeight, ylab = "Brainweight of Mouse",
     xlab = "Bodyweight of Mouse", 
     main = "Brainweight and Bodyweight of Mice")
abline(.33555,.01048)

This is our linear approximation of the data.

Using Linear Regression to Predict

Let’s use our linear approximation to predict data.

Let’s say we have a bodyweight of 6.655, what will we predict for the brainweight?

How about bodyweight of 7.183 or 9.61?

Brainweight = .33555 + .01048*(BodyWeight)

ExpectB1 = .33555 + .01048*(6.655)
ExpectB1

## [1] 0.4052944

ExpectB2 = .33555+.01048*(7.183)
ExpectB2

## [1] 0.4108278

ExpectB3 = .33555 + .01048*(9.61)
ExpectB3

## [1] 0.4362628

Now let’s compare to the observed brainweights with these bodyweights

MiceData[16,]

##    ï..LitterSize BodyWeight BrainWeight
## 16            10      6.655       0.405

ActualB1 = MiceData[16, "BrainWeight"]
ActualB1

## [1] 0.405

MiceData[17,]

##    ï..LitterSize BodyWeight BrainWeight
## 17            11      7.183       0.435

ActualB2 = MiceData[17, "BrainWeight"]
ActualB2

## [1] 0.435

MiceData[6,]

##   ï..LitterSize BodyWeight BrainWeight
## 6             5       9.61       0.434

ActualB3 = MiceData[6, "BrainWeight"]
ActualB3

## [1] 0.434

Let’s look at the error : actual - expected

ErrorB1 = ActualB1 - ExpectB1
ErrorB1

## [1] -0.0002944

ErrorB2 = ActualB2 - ExpectB2
ErrorB2

## [1] 0.02417216

ErrorB3 = ActualB3 - ExpectB3
ErrorB3

## [1] -0.0022628

Let’s take a look at the results: B1Error: -0.0002944

It looks like out model overestimated a little. Definetaly not by much. It is exact for the first 3 decimal places which is a pretty good predictor especially because there is some variability among the data.

B2Error: 0.02417216

This error is a bit worse. It looks like our model underestimated the weight of this mouse’s brain. We were not even corect for the

B3Error: -0.0022628

This error is pretty small and the estimation is exact for the first three decimal places. So this looks pretty good for this weight.

I would say our data does a decent job at estimating for these brain weights. Of course the numbers are so small it is hard to look at the numbers, so maybe a percent error would be good to look at in this case. Maybe (observed - expected)^2 divided by expected would be helpful since these numbers are so small.

Example:

percentDiff = ((ErrorB2))
percentDiff/(ExpectB2)

## [1] 0.05883769

perDiff = 100* percentDiff
perDiff

## [1] 2.417216

So our worst data point(of these three) is still only 2.417% off

Now, let’s look at some residuals.

myResiduals = MData$residuals
myResiduals

##             1             2             3             4             5 
##  0.0094437460 -0.0020462378 -0.0144959678 -0.0072960057 -0.0032994361 
##             6             7             8             9            10 
## -0.0022645644 -0.0185142376  0.0139180566 -0.0041028097  0.0060979864 
##            11            12            13            14            15 
##  0.0046701459 -0.0025621862 -0.0177184641  0.0213644508  0.0083732634 
##            16            17            18            19            20 
## -0.0002948878  0.0241714441  0.0071758977 -0.0246659673  0.0020457735

We want to see what the residuals look like by making a histogram. Are they normally distributed?

hist(myResiduals)

To make it even easier to analyze, let’s plot it, If the residuals are normal, they should be in a straight line. let’s take a look (again, make sure to plot these at the same time):

qqnorm(myResiduals)
qqline(myResiduals)

Interesting. It isn’t terrible, but not quite a straight line. There is some variation, but for the most part, I’d say they at least follow the line which is good.

Another thing we learned about is homoscedasticity. Homoscedasticity is when the error term is very similar for all the variables. In order to test this we can plot the original data and then plot the residuals with respect to bodyweight.

plot(MData$residuals ~ BodyWeight)
abline(0,0)

We don’t want the data to be tightly packed an any ares, and I would say that is true here. Therefore, there is no sign of heteroscedasticity (Where the data doesn’t follow a pattern and the error gets larger or smaller as the x variable changes).

Mean Square Error

Let’s look at the mean square error. That is the sqrt(sum(residuals)^2 / (n-2)):

sqrt(sum(MData$residuals)^2 / (18))

## [1] 3.577685e-19

Summary in R

Or we can just use the summary function in R to get all the data at once:

summary(MData)

## 
## Call:
## lm(formula = BrainWeight ~ BodyWeight)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.024666 -0.004901 -0.001171  0.007475  0.024171 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.335548   0.017327  19.366 1.68e-13 ***
## BodyWeight  0.010480   0.002204   4.755 0.000158 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01308 on 18 degrees of freedom
## Multiple R-squared:  0.5567, Adjusted R-squared:  0.5321 
## F-statistic: 22.61 on 1 and 18 DF,  p-value: 0.0001583

Great! We see our y-int at .33555 and our slope at .01049. These numbers should be familiar from above when we found out linear regression equation.

Confidence Intervals

Let’s look at what R can generate for us when it comes to confidence intervals.

mod <-  lm(BrainWeight ~ BodyWeight, data = MiceData )
confint(mod, level=.9)

##                     5 %       95 %
## (Intercept) 0.305501930 0.36559330
## BodyWeight  0.006658234 0.01430263

So we can say that we are 90% sure that the brainweight of a mouse will fall between .006658234 and .01430263

Now, let’s look at the correlation between the brainweight and the bodyweight of a mouse. Again, the brainweight depends on the bodyweight of the mouse.

Correlation

cor(BrainWeight, BodyWeight)

## [1] 0.7461485

cor(BodyWeight, BrainWeight)

## [1] 0.7461485

When doing correlation coefficients in R, the order doesn’t matter. Here we see that there is a positive correlation between body weight and brain weight in mice. 1.0 would mean there is a perfect correlation and 0 would mean there is no correlation. Therefore, .746 means that they are pretty strongly positively correlated. This verifies the data that we have observed so we are satisfied with this numeric result.

Here is the line data done yet another way:

lm( formula = BrainWeight ~ BodyWeight)

## 
## Call:
## lm(formula = BrainWeight ~ BodyWeight)
## 
## Coefficients:
## (Intercept)   BodyWeight  
##     0.33555      0.01048

Confidence Intervals

Let’s take a look at the difference between confidence intervals and prediction intervals

newData <- data.frame(BodyWeight = 7.400)
newData

##   BodyWeight
## 1        7.4

confidenceInt <- predict(MData, newData, interval = "confidence")
confidenceInt

##         fit       lwr       upr
## 1 0.4131028 0.4067493 0.4194563

This tells us that if the BodyWeight of the mouse is equal to 7.400 , we are confident that the average brainweight of all mice with the bodyweigh equal to 7.4000 will be between .4067 and .4195

The Actual BrainWeight from our data is .409

Prediction Intervals

predictionInt <- predict(MData, newData, interval = "predict")
predictionInt

##         fit       lwr       upr
## 1 0.4131028 0.3848935 0.4413122

Using the prediction interval, we are confident that given a mouse with a weight of 7.400, we are confident that the bodyweight of that specific mouse will fall between .38489 and .44131.

The prediction interval is wider than the confidence interval. That is because the prediction interval is used given a specific mouse and it’s weight. It is harder to predict for a specific mouse since there will be more variability. For the confidence interval, we are just trying to predict where the average brainweight of mice with that bodyweight will fall.

all.equal(confidenceInt[1], predictionInt[1])

## [1] TRUE

We can use this command to look at the widths of these two intervals and compare them.

confWidth <- confidenceInt %*% c(0, -1, 1)  #conf interval width
confWidth

##         [,1]
## 1 0.01270702

predictWidth <- predictionInt %*% c(0, -1, 1)  #pred interval width
predictWidth

##         [,1]
## 1 0.05641872

These two intervals both have small widths because the variability is so small. Let’s do some mathematical comparison:

confWidth - predictWidth

##         [,1]
## 1 -0.0437117

So we can see that the confidence interval is smaller than the prediction variable. This is what is expected.

R Guide Chapter Three

Kara Huyett

Introduction to Dealing with Data Sets

Linear Regression

Using Linear Regression to Predict

Mean Square Error

Summary in R

Confidence Intervals

Correlation

Confidence Intervals

Prediction Intervals