Tutorial: Transfering Data Using RStudio and ggplot

Below is a step-by-step processes for taking the data presented in table form from an article, making it ready for R, and then graphing said data using ggplot.

Step 1: Find an article with the data

This is the easiest step of the process. Once you have an article of interest, search out the table full of numbers. It will look something like this:

alt text


Step 2: Transfer data into an R readable file

Unlike some statistics programs like SPSS, R wants you to have your data set, organized, and ready to go before starting the program. This takes a little legwork on the part of the individual. Microsoft Excel is a very useful program that allows us to organize the data in a spreadsheet before importing it into R.

First, open a new spreadsheet:

alt text

Second, transfer the data from the table into the spreadsheet. This is easier said then done. It takes an understanding of what the table is saying and how it is saying it.

For example, the table in Fig. 1 is looking at the different intervention charactersitics, and the table includes the names of those characteristics, the number and percentage of students recieving interventions with those characteristics, the IPNS mean, the standard deviation, and the F and P values. Most of the data can be input into Excel line by line, as seen below:

alt text

With this data, the major difference in the way I laid it out is I made sure to include the intervention characteristic grouping alonside the individual intervention characteristic. This will allow for sorting to be done in RStudio. When you save the spreadsheet, choose to save it as a .csv file instead of a .xlsx so it can be read by R.


Step 3: Import Dataset into RStudio

Now it's time to important the data into RStudio. To do this, first open the program. Next, in the workspace area in the top right of the window, click Import Dataset and select the data .csv file. When the Import Data Set seperate window opens, check over your data in the Data Frame box to make sure it is arranged correctly, and click import.

alt text


Step 4: Assign the Data a Working Name

Now that the data is imported into program, we want to give our data a name so we can begin working with it. Right now our data is like a big ball of gas, and assigning it a working name will condense it into a liquid, or something workable. The code below assigns the data set to the the word data. Within the quotation marks, insert either the web url or the data file location on your system.

# This code will allow the reader to import the data and attach it to the
# term 'data' for use by R
data <- read.csv("~/Desktop/Homework 1 Data for R.csv")


Step 5: Download/Load the ggplot Package

Packages in R allow us to ask R to do specific things. One of these packages is ggplot. What ggplot allows us to do is create very specific graphs based on our data. Whenever we want to use this, we have to load it in RStudio. To do this, we look at the bottom right window of RStudio, find the ggplot package, and click the empty box beside it. That will load it into the system. If ggplot is not listed under your packages list, you have to install it. To do this, just click Install Packages in the bottom right window, type ggplot into the packages line, and click install.

alt text


Step 6: Time to Do Some Coding

Now comes the challenging part. Our data is loaded and ggplot is loaded, and now we will use R code to develop a graph. Provided below will be the lines of code used in R and ggplot to create certain transformations that will craft our data from nothing into this:

alt text

ggplot(data = data, aes(x = Characteristic.Main.Level, y = IPND.Mean, color = Characteristic.Main.Level))

## (ggplot) tells the program that you will be making a plot

## (aes(x=Characteristic.Main.Level, y=IPND.Mean)) identifies that the
## Characteristic.Main.Level set of data will be along the X axis and the
## IPND.Mean will be on the Y axis

## (data=data) identifies the data set being used

## (color=Characteristic.Main.Level) will apply the color scheme based on
## that data set

geom_line(size = 4)

## (geom_line(size=4)) identfies the style of the graph in which the data
## will be displayed, and in this case the data will be presented in a
## line

alt text


xlab("Intervention Characteristic") + ylab("Intervention Percentage Nonoverlapping Data (IPND)") + 
    ggtitle("Intervention Percentage of Nonoverlapping Data (IPND) by Study and Sample Characteristics") + 
    ylim(c(50, 100))

## (xlab('Intervention Characteristic') + ylab('Intervention Percentage
## Nonoverlapping Data (IPND)')) will change the titles on the X and Y
## axis

## (ggtitle('Intervention Percentage of Nonoverlapping Data (IPND) by
## Study and Sample Characteristics')) adds a title to the top of the
## graph

## (ylim(c(50,100)) changes the distribution of the Y axis. In the graph
## above, the Y axis starts at 70 and ends at 100. By using this code, it
## will start at 50 and end at 100.

alt text


geom_hline(aes(yintercept = 90), color = "gold", size = 1) + geom_text(aes(x = Characteristic.Main.Level, 
    y = IPND.Mean, label = Characteristic.Sub.Level), data = data, size = 5, 
    hjust = 0, vjust = 0, color = "black")

## (geom_hline(aes(yintercept=90), color='gold', size=1)) draws the line
## on the y intercept at the 90 value, colors it gold, and sizes it to 1

## (geom_text(aes(x=Characteristic.Main.Level, y=IPND.Mean,
## label=Characteristic.Sub.Level), data=data, size=5, hjust=0, vjust=0,
## color='black')) adds the value of the each data point on the X axis
## when compard to the Y axis, identifies the data set being used, sizes
## the text to 5, and colors it black.

## The values hjust and vjust can change the orientation of the text

alt text

theme(legend.position = "none", panel.grid.major = element_line(colour = "grey"), 
    panel.background = element_rect(fill = "white"), axis.line = element_line(size = 1, 
        colour = "black"))

## The (theme) command allows us to make fine tweaks to the graph.
## (legend.position='none') allows us to get rid of the legend on the side
## of the graph.

## The command (panel.grid.major = element_line(colour = 'grey')) changes
## the major grid lines to the color grey. The command (panel.background =
## element_rect(fill = 'white')) changes the background of the graph
## itself to the color white.

## The command (axis.line = element_line(size = 1, colour = 'black'))
## changes the color of the lines of the axis to the color black.

alt text

We are very close to the final graph, but there is a key difference between the graph at the top of this tutorial and what we have here. The major difference is the gold line and how it is laid on the graph. At the top graph, the horizontal gold line is beneath the verticle lines, but in the graph above, it is on top. The reason for this difference is the location of the gold line code in realtion to the other coding. R and ggplot follow commands in sequence, so if we tell it to put the verticle lines on the graph first, and the horizontal lines on second, ggplot will put the horizontal lines on top of the verticle lines.

So, in conclusion, I present to you the final string of code and the final graph. I hope this tutorial has been helpful. Remember to tip your waitress.

**ggplot(data=data, aes(x=Characteristic.Main.Level, y=IPND.Mean, color=Characteristic.Main.Level)) + geom_hline(aes(yintercept=90), color="gold", size=1)+ geom_line(size=4) + xlab("Intervention Characteristic") + ylab("Intervention Percentage Nonoverlapping Data (IPND)") +ggtitle("Intervention Percentage of Nonoverlapping Data (IPND) by Study and Sample Characteristics")+ylim(c(50,100))+ geom_text(aes(x=Characteristic.Main.Level, y=IPND.Mean, label=Characteristic.Sub.Level), data=data, size=5, hjust=0, vjust=0, color="black")+theme(legend.position="none", panel.grid.major = element_line(colour = "grey"), panel.background = element_rect(fill = "white"), axis.line = element_line(size = 1, colour = "black"))**


From this:

alt text

To this:

alt text