Data analysis of FATRATS data for Intermediate Statistics using R

This R markdown highlights way you can manipulate and plot data in R using both the console and R markdown. Data taken from a weight gain study of rats with respect to varying sources of protein.

Imputing Data

The first challenge with analyzing data using R is loading data for use in the console. This is achieved by first importing a data set into the R environment (if importing from an Excel document, make sure the file is saved as a csv), and then by running the code:

FATRATS <- read.csv("/Users/matthewhecking/Documents/Intermediate Stats using R/FATRATS.csv")

This code is specific to where you have the file saved, to find this info, right click on the document and choose “get info”, the file's information should be listed under “where”.

Once the data is imported and loaded, we can check it using the command:

head(FATRATS)

##   hilo source weight level animveg beefpork interanveg interbfprk
## 1    1      1     73     1       1        1          1          1
## 2    1      1    102     1       1        1          1          1
## 3    1      1    118     1       1        1          1          1
## 4    1      1    104     1       1        1          1          1
## 5    1      1     81     1       1        1          1          1
## 6    1      1    107     1       1        1          1          1

This gives us the first 5 rows of data from the file, and shows us that the file has successfully been loaded into R.

Basic Data Analysis

After the data set has been uploaded, we can now work in R and analyze it. To begin, we first attach the file in R using the command:

attach(FATRATS)

Once the file is attached, we can use a general linear model to get our first look at the data, by imputing

fit = glm(FATRATS)
summary(fit)

## 
## Call:
## glm(formula = FATRATS)
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## -2.66e-15  -1.78e-15  -1.33e-15  -1.11e-15  -6.66e-16  
## 
## Coefficients: (1 not defined because of singularities)
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)  1.50e+00   1.43e-15  1.05e+15   <2e-16 ***
## source      -1.89e-16   2.55e-16 -7.40e-01     0.46    
## weight      -1.92e-17   1.50e-17 -1.28e+00     0.21    
## level       -5.00e-01   2.35e-16 -2.13e+15   <2e-16 ***
## animveg      9.84e-17   1.49e-16  6.60e-01     0.51    
## beefpork           NA         NA        NA       NA    
## interanveg   1.57e-16   1.54e-16  1.02e+00     0.31    
## interbfprk   2.27e-16   2.55e-16  8.90e-01     0.38    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.597e-30)
## 
##     Null deviance: 1.5000e+01  on 59  degrees of freedom
## Residual deviance: 1.3766e-28  on 53  degrees of freedom
## AIC: -3909
## 
## Number of Fisher Scoring iterations: 1

This gives the most basic model to describe the data, however some values are being falsely manipulated and given improper values (the p values of intercept and value for example are very low, which doesn't make logical sense). To improve the model, we can assign stricter parameters by assigning interactions and factors. An improved model may look something like

fit = glm(weight ~ level)
summary(fit)

## 
## Call:
## glm(formula = weight ~ level)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -39.13   -8.73    1.13    9.52   26.40  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    87.87       1.94   45.41   <2e-16 ***
## level           7.27       1.94    3.76    4e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 224.7)
## 
##     Null deviance: 16199  on 59  degrees of freedom
## Residual deviance: 13031  on 58  degrees of freedom
## AIC: 499.1
## 
## Number of Fisher Scoring iterations: 2

To further improve the model, we can look at every variable and include it within the model, by writing something like:

fit = aov(lm(weight ~ level + animveg + beefpork + interanveg + interbfprk))
summary(fit)

##             Df Sum Sq Mean Sq F value  Pr(>F)    
## level        1   3168    3168   14.77 0.00032 ***
## animveg      1    264     264    1.23 0.27221    
## beefpork     1      2       2    0.01 0.91444    
## interanveg   1   1178    1178    5.49 0.02283 *  
## interbfprk   1      0       0    0.00 1.00000    
## Residuals   54  11586     215                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Graphing Data

After analyzing the data and looking at the significance of it mathematically, it helps to analyze the data visually by graphing the data. In R, this can be done by using the code “interaction.plot”

interaction.plot(hilo, source, weight)

plot of chunk unnamed-chunk-7

the same information can also be written in a different format, by rearranging the variables, for example:

interaction.plot(source, hilo, weight)

plot of chunk unnamed-chunk-8