When you first dig into a data set, it is good to do some preliminary analysis. This can include first loading the data and looking at the row names. It is also good to know how many pieces of data you are working with.
MiceData <- read.csv("MiceData.csv")
head(MiceData)
## ï..LitterSize BodyWeight BrainWeight
## 1 3 9.447 0.444
## 2 4 9.780 0.436
## 3 4 9.155 0.417
## 4 4 9.613 0.429
## 5 5 8.850 0.425
## 6 5 9.610 0.434
MiceData
## ï..LitterSize BodyWeight BrainWeight
## 1 3 9.447 0.444
## 2 4 9.780 0.436
## 3 4 9.155 0.417
## 4 4 9.613 0.429
## 5 5 8.850 0.425
## 6 5 9.610 0.434
## 7 6 8.298 0.404
## 8 6 8.543 0.439
## 9 7 7.400 0.409
## 10 7 8.335 0.429
## 11 8 7.040 0.414
## 12 8 7.253 0.409
## 13 9 6.600 0.387
## 14 9 7.260 0.433
## 15 10 6.305 0.410
## 16 10 6.655 0.405
## 17 11 7.183 0.435
## 18 11 6.133 0.407
## 19 12 5.450 0.368
## 20 12 6.050 0.401
numRows = nrow(MiceData)
numCols = ncol(MiceData)
numRows
## [1] 20
numCols
## [1] 3
names(MiceData)
## [1] "ï..LitterSize" "BodyWeight" "BrainWeight"
attach(MiceData)
Here we have 3 columns and 20 rows. So our n = 20. The column names are also given when the names(MiceData) is called. ##Graphing the Data
Next, we may want to graph out data.
plot(MiceData)
For my RGuide, I am going to be trying to determine whether the body size of a mouse affects the brain size of the mouse.
Thus, the Brainweight depends on the BodyWeight.
Let’s plot this with axis:
plot(BodyWeight, BrainWeight, ylab = "Brainweight of Mouse",
xlab = "Bodyweight of Mouse",
main = "Brainweight and Bodyweight of Mice")
It’s always good to do a quick analysis after plotting the data. Let’s go ahead and do this.
The data does seem to be somewhat positively coorelated. That being said, the data is fairly sparse. There are only 20 observations so there is not a large amount of data to go off of. However, there is a trend upward. The more the mouse weighs, the bigger it’s brain seems to be. However, within this trend, the data is not packed tightly. There is still variation in the brain size based on the weight. Mice with the same weights can have different sized brains. All - in - all, I’d say we will expect to see a positive correlation and have some results to analyze.
Here is the linear regression. We know that Brainweight depends on Bodyweight. Now, we let RStudio run the code for us.
MData <- lm(BrainWeight ~ BodyWeight)
MData
##
## Call:
## lm(formula = BrainWeight ~ BodyWeight)
##
## Coefficients:
## (Intercept) BodyWeight
## 0.33555 0.01048
This gives us the y-intercept of .33555 and the slope of .01048.
Thus: Brainweight = .33555 + .01048*(BodyWeight)
Great! Let’s see what adding this line to the graph look like. Make sure to run these two lines of code simultaneously or you will get an error message.
plot(BodyWeight, BrainWeight, ylab = "Brainweight of Mouse",
xlab = "Bodyweight of Mouse",
main = "Brainweight and Bodyweight of Mice")
abline(.33555,.01048)
This is our linear approximation of the data.
Let’s use our linear approximation to predict data.
Let’s say we have a bodyweight of 6.655, what will we predict for the brainweight?
How about bodyweight of 7.183 or 9.61?
Brainweight = .33555 + .01048*(BodyWeight)
ExpectB1 = .33555 + .01048*(6.655)
ExpectB1
## [1] 0.4052944
ExpectB2 = .33555+.01048*(7.183)
ExpectB2
## [1] 0.4108278
ExpectB3 = .33555 + .01048*(9.61)
ExpectB3
## [1] 0.4362628
Now let’s compare to the observed brainweights with these bodyweights
MiceData[16,]
## ï..LitterSize BodyWeight BrainWeight
## 16 10 6.655 0.405
ActualB1 = MiceData[16, "BrainWeight"]
ActualB1
## [1] 0.405
MiceData[17,]
## ï..LitterSize BodyWeight BrainWeight
## 17 11 7.183 0.435
ActualB2 = MiceData[17, "BrainWeight"]
ActualB2
## [1] 0.435
MiceData[6,]
## ï..LitterSize BodyWeight BrainWeight
## 6 5 9.61 0.434
ActualB3 = MiceData[6, "BrainWeight"]
ActualB3
## [1] 0.434
Let’s look at the error : actual - expected
ErrorB1 = ActualB1 - ExpectB1
ErrorB1
## [1] -0.0002944
ErrorB2 = ActualB2 - ExpectB2
ErrorB2
## [1] 0.02417216
ErrorB3 = ActualB3 - ExpectB3
ErrorB3
## [1] -0.0022628
Let’s take a look at the results: B1Error: -0.0002944
It looks like out model overestimated a little. Definetaly not by much. It is exact for the first 3 decimal places which is a pretty good predictor especially because there is some variability among the data.
B2Error: 0.02417216
This error is a bit worse. It looks like our model underestimated the weight of this mouse’s brain. We were not even corect for the
B3Error: -0.0022628
This error is pretty small and the estimation is exact for the first three decimal places. So this looks pretty good for this weight.
I would say our data does a decent job at estimating for these brain weights. Of course the numbers are so small it is hard to look at the numbers, so maybe a percent error would be good to look at in this case. Maybe (observed - expected)^2 divided by expected would be helpful since these numbers are so small.
Example:
percentDiff = ((ErrorB2))
percentDiff/(ExpectB2)
## [1] 0.05883769
perDiff = 100* percentDiff
perDiff
## [1] 2.417216
So our worst data point(of these three) is still only 2.417% off
Now, let’s look at some residuals.
myResiduals = MData$residuals
myResiduals
## 1 2 3 4 5
## 0.0094437460 -0.0020462378 -0.0144959678 -0.0072960057 -0.0032994361
## 6 7 8 9 10
## -0.0022645644 -0.0185142376 0.0139180566 -0.0041028097 0.0060979864
## 11 12 13 14 15
## 0.0046701459 -0.0025621862 -0.0177184641 0.0213644508 0.0083732634
## 16 17 18 19 20
## -0.0002948878 0.0241714441 0.0071758977 -0.0246659673 0.0020457735
We want to see what the residuals look like by making a histogram. Are they normally distributed?
hist(myResiduals)
To make it even easier to analyze, let’s plot it, If the residuals are normal, they should be in a straight line. let’s take a look (again, make sure to plot these at the same time):
qqnorm(myResiduals)
qqline(myResiduals)
Interesting. It isn’t terrible, but not quite a straight line. There is some variation, but for the most part, I’d say they at least follow the line which is good.
Another thing we learned about is homoscedasticity. Homoscedasticity is when the error term is very similar for all the variables. In order to test this we can plot the original data and then plot the residuals with respect to bodyweight.
plot(MData$residuals ~ BodyWeight)
abline(0,0)
We don’t want the data to be tightly packed an any ares, and I would say that is true here. Therefore, there is no sign of heteroscedasticity (Where the data doesn’t follow a pattern and the error gets larger or smaller as the x variable changes).
Let’s look at the mean square error. That is the sqrt(sum(residuals)^2 / (n-2)):
sqrt(sum(MData$residuals)^2 / (18))
## [1] 3.577685e-19
Or we can just use the summary function in R to get all the data at once:
summary(MData)
##
## Call:
## lm(formula = BrainWeight ~ BodyWeight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.024666 -0.004901 -0.001171 0.007475 0.024171
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.335548 0.017327 19.366 1.68e-13 ***
## BodyWeight 0.010480 0.002204 4.755 0.000158 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01308 on 18 degrees of freedom
## Multiple R-squared: 0.5567, Adjusted R-squared: 0.5321
## F-statistic: 22.61 on 1 and 18 DF, p-value: 0.0001583
Great! We see our y-int at .33555 and our slope at .01049. These numbers should be familiar from above when we found out linear regression equation.
Let’s look at what R can generate for us when it comes to confidence intervals.
mod <- lm(BrainWeight ~ BodyWeight, data = MiceData )
confint(mod, level=.9)
## 5 % 95 %
## (Intercept) 0.305501930 0.36559330
## BodyWeight 0.006658234 0.01430263
So we can say that we are 90% sure that the brainweight of a mouse will fall between .006658234 and .01430263
Now, let’s look at the correlation between the brainweight and the bodyweight of a mouse. Again, the brainweight depends on the bodyweight of the mouse.
cor(BrainWeight, BodyWeight)
## [1] 0.7461485
cor(BodyWeight, BrainWeight)
## [1] 0.7461485
When doing correlation coefficients in R, the order doesn’t matter. Here we see that there is a positive correlation between body weight and brain weight in mice. 1.0 would mean there is a perfect correlation and 0 would mean there is no correlation. Therefore, .746 means that they are pretty strongly positively correlated. This verifies the data that we have observed so we are satisfied with this numeric result.
Here is the line data done yet another way:
lm( formula = BrainWeight ~ BodyWeight)
##
## Call:
## lm(formula = BrainWeight ~ BodyWeight)
##
## Coefficients:
## (Intercept) BodyWeight
## 0.33555 0.01048
Let’s take a look at the difference between confidence intervals and prediction intervals
newData <- data.frame(BodyWeight = 7.400)
newData
## BodyWeight
## 1 7.4
confidenceInt <- predict(MData, newData, interval = "confidence")
confidenceInt
## fit lwr upr
## 1 0.4131028 0.4067493 0.4194563
This tells us that if the BodyWeight of the mouse is equal to 7.400 , we are confident that the average brainweight of all mice with the bodyweigh equal to 7.4000 will be between .4067 and .4195
The Actual BrainWeight from our data is .409
predictionInt <- predict(MData, newData, interval = "predict")
predictionInt
## fit lwr upr
## 1 0.4131028 0.3848935 0.4413122
Using the prediction interval, we are confident that given a mouse with a weight of 7.400, we are confident that the bodyweight of that specific mouse will fall between .38489 and .44131.
The prediction interval is wider than the confidence interval. That is because the prediction interval is used given a specific mouse and it’s weight. It is harder to predict for a specific mouse since there will be more variability. For the confidence interval, we are just trying to predict where the average brainweight of mice with that bodyweight will fall.
all.equal(confidenceInt[1], predictionInt[1])
## [1] TRUE
We can use this command to look at the widths of these two intervals and compare them.
confWidth <- confidenceInt %*% c(0, -1, 1) #conf interval width
confWidth
## [,1]
## 1 0.01270702
predictWidth <- predictionInt %*% c(0, -1, 1) #pred interval width
predictWidth
## [,1]
## 1 0.05641872
These two intervals both have small widths because the variability is so small. Let’s do some mathematical comparison:
confWidth - predictWidth
## [,1]
## 1 -0.0437117
So we can see that the confidence interval is smaller than the prediction variable. This is what is expected.