The purpose of this lab was to extract a table from a research article and recreate the data in a tabular format in Excel to be read into R. This tutorial will show you how to import an Excel file into R Studio and create a plot with the tabular data using R package ggplot2.
1) Create Excel data file Using the table from the original research article, you want to manually enter the data into an Excel spreadsheet.
2) Imporing Excel data file into RStudio Under the Environment tab in RStudio, click on “Import Dataset” tab. Select option to import data from text file. Choose the appropriate Excel document you created your dataset in; hit open. In the import dataset box, you will be presented with several options to ensure appropriate import of your data. You will be able to name your dataset. There are options for selecting symbols to represent separators, decimals, and quotes in your data. The defaults for these options are commas as separators, periods as decimals, and double quotes for quotes. These defaults work well for most datasets and are what we used in our lab. Click the import button to finish. The table will appear in the script pane on the lefthand side of RStudio.
3) Review imported data to ensure proper variable display Visually review data to ensure the variables are organized and displayed as you would like them to be. The following is a link to our dataset:
https://www.dropbox.com/s/gztw8swqa4il1bj/Lab%201%20Stats.csv
4) Access ggplot2 To cue up ggplot2, type the command library(ggplot2). This will open the package and current project so it can be used with the dataset you have just imported.
5) To make a plot with ggplot2 Decide what type of plot would best visually represent your dataset information. For our purposes, we chose to create a line plot of our data. The following directions are for a dataset containing a continuous outcome variable and two predictor variables. Our dataset, named “just.work.damn.it”, contains the following variables: “Proportion” as the outcome (y) variable, “Year” as the “x” variable. The variable “Group” was used as a categorical separator to indicate differences in the x and y relationship by each racial group. Each level of the group variable represents a separate line in the final plot.
The following reflects our code in order to achieve the line graph for our particular data:
ggplot(data=just.work.damn.it, aes(x=Year, y=Proportion, group=Group, shape=Group, color=Group)) + geom_line() + geom_point()
a) data= is where the name of your imported dataset goes.
b) aes stands for aesthetic, which is defined by how graphic elements are visually perceived. This is the part that defines the parts of the graph; it is used to define the X and Y axes, as well as size, color, fill, and groupings.
c) x= represents what predictor variable matches on to the x dimension of the graph
d) y= represents what predictor variable matches on to the y dimension of the graph
e) group= allows us to enter an additional predictor variable (e.g., in our dataset, it is race)
f) shape= you want each level of the group variable to have its own shape
g) color= you want each level of the group variable to have its own color
h) geom_line represents the type of graph (in our case, a line graph)
i) geom_point signifies that you want specific data points to be exhibited on each line
Disclaimer: Remember that all dataset and variable names are case sensitive, and must be entered precisely, or RStudio will be unable to recognize and read the data correctly, and an error message will appear.
The following represents our final line graph with the data:
library("ggplot2")
just.work.damn.it <- read.csv("/Users/annagenchanok/Desktop/LAB 1 CSV FILE.csv")
ggplot(data = just.work.damn.it, aes(x = Year, y = Proportion, group = Group,
shape = Group, color = Group)) + geom_line() + geom_point()