This R Markdown will focus on how to import, analyze, and graph data using R, as well as imputting it into R markdown. Data was collected as a part of a study to measure rate of erosion of ballast stones supporting railroad tracks in the US.
The first challenge in using R is simply importing Data in a usable format to analyze in the console. The easiest way to go about this is first importing the data set into the R studio environment (if the file is being imported as an excel file, make sure to make it a csv file before importing). Next, the data set must be loaded into R by running the code:
ABRASION <- read.table("/Users/matthewhecking/Documents/Intermediate Stats using R/ABRASION.csv")
This code is specific to where you have the file saved. To check its pathway, right click on the document and choose “get info”, the file's pathway should be listed within.
After the data set is loaded, we can confirm that R understand the file by typing the command:
head(ABRASION)
## V1
## 1 lab,quarry,abraded
## 2 1,1,2.7
## 3 1,1,2.8
## 4 1,1,2.5
## 5 1,1,3.4
## 6 1,1,2.55
This should show the first 5 lines of the file, and shows us that the file has been successfully loaded into R.
After confirming the data set is in R correctly, we can begin to manipulate the file. The first command used to analyze a data set is the command “attach”, which is written like:
attach(ABRASION)
once the file is attached, R can identify the data set as having variables and factors. (A quick reminder, whenever you are finished with a dataset, make sure to detach the file by typing “detach(file)”, to make sure R does not keep using the previous variables.)
The next step is to preform a simple analysis, like a general linear model. This is achieved by writing the command:
fit = glm(abraded ~ lab + quarry)
summary(fit)
##
## Call:
## glm(formula = abraded ~ lab + quarry)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.041 -0.706 -0.287 0.388 4.120
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.62197 0.23230 11.3 <2e-16 ***
## lab 0.00773 0.03829 0.2 0.84
## quarry 0.46703 0.04484 10.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.232)
##
## Null deviance: 388.62 on 209 degrees of freedom
## Residual deviance: 254.97 on 207 degrees of freedom
## AIC: 644.7
##
## Number of Fisher Scoring iterations: 2
By looking at the given P values for this linear model, we can see the relative significance of the interactions between factors. The P value for lab, with its relation to the abrasion amount is shown to be statistically random, however the quarry's interaction with the abrasion is highly significant.
This analysis can also be shown visually by graphing the information, using the command “interaction plot”.
interaction.plot(lab, quarry, abraded, ylab = "grams of material lost during abrasion")
With this graphical representation, we can also see outliers within the data set, as is shown with the 5th quarry.