Originally delivered Thurs. Sep 01, 2016
This document is a walk through of a basic R session. It assumes that a file called “Lab1_data_PA_eagles.xlsx” is saved to your desktop. R does not normally access data from standard Excel files, so in this walk through we first re-save this data in an R-compatbile format, a “csv” file, called “Lab1_data_PA_eagles.csv”.
If the file “Lab1_data_PA_eagles.xlsx” is not available, the necessary data can be generated using the code below without having to work with data from Excel.
#Generate data found in 1_Lab1_data_PA_eagles_XLS.xlsx
#Years for which the number of breeding pairs of eagles in Pennsylvania, USA, is known
year <- c(1980,1981,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000, 2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,NA)
#Number of pairs of eagles each year
eagles <- c(3,NA,NA,7,9,15,17,19,20,20,23,29,43,51,55,64,69,NA,96,100,NA,NA,NA,NA, NA,NA,NA,252,277,NA)
#create an R dataframe
eagles <- data.frame(year = year,
eagles = eagles)
write.csv(eagles, file ="Lab1_data_PA_eagles.csv")
Some examples
see http://blog.revolutionanalytics.com/2014/05/companies-using-r-in-2014.html
Save the file “Lab1_data_PA_eagles.xlsx” to your computers desktop. Today we will be using this as the “working directory”
In Excel, follow these steps
The data is now in a format that can be loaded into R.
Main parts of RStudio
You can change the locations of the windows. I prefer to have my consol in the lower left, Plots etc in the upper left, and source viewer on thet right.
These elemetns can be moved by going to Tools, Global Options, then Pane Layout.
We will now take the data we saved as a .csv file and load it into R.
Follow these steps * “Session” * “Set working directory” * “Choose Directory” - select your computers desktop * Select directory & click “Select folder” * The command “setwd” shows up followed by the location of the directory you selected
You can set your working directory to be anywhere on the computer. It is essential to make sure that the csv file you want to load into R is in your working directory.
(Slide ~30 of ppt)
This should be your desktop, where we saved the csv file with the data.
getwd()
## [1] "C:/Users/lisanjie2/Desktop/TEACHING/1_STATS_CalU/1_STAT_CalU_2016_by_NLB/Lab/Lab1_intro_to_R"
If you have not saved the file Lab1_data_PA_eagles.xlsx as as csv file to your desktop and/or your working directory, do so now.
getwd()
## [1] "C:/Users/lisanjie2/Desktop/TEACHING/1_STATS_CalU/1_STAT_CalU_2016_by_NLB/Lab/Lab1_intro_to_R"
Try these commands in the source viewer * date() * ls()
Note that “l” = lower case “L”
#Today's date
date()
## [1] "Tue Sep 06 12:27:23 2016"
#Probably won't return anything interesting:
ls()
## [1] "eagles" "year"
“ls” means “list”. More on this command later.
Now try just “date” without the parentheses. What happens?
date
## function ()
## .Internal(date())
## <bytecode: 0x06c9a434>
## <environment: namespace:base>
This will return what appears to be nonsense.
NB: parentheses are key to the execution of commands in R!
When things don’t work, one of the 1st things to check for are parentheses!
Type “date()”" & instead of clicking “Run” put the curser right after the last parenthesis and press “Crlt+Enter” on the keyboard (the “+” means “at the same time”, not the “+” key)
date()
## [1] "Tue Sep 06 12:27:23 2016"
Now Execute the command “list.files()” using “Crlt+Enter” on the keyboard
list.files()
This tells you what files are saved in your working directory (wd). There should be the original xlsx Excel file &t he csv file you made using “save as” (and anything else on your desktop). We will now load the .csv file into R.
Copy and paste the CSV file name from the consol in the source view then Execute the command “read.csv(file =”Lab1_data_PA_eagles.csv“)”. You can type it but you must be careful to have NO TYPOS. R is unforgiving when it comes to typos.
read.csv(file = "Lab1_data_PA_eagles.csv")
## X year eagles
## 1 1 1980 3
## 2 2 1981 NA
## 3 3 1989 NA
## 4 4 1990 7
## 5 5 1991 9
## 6 6 1992 15
## 7 7 1993 17
## 8 8 1994 19
## 9 9 1995 20
## 10 10 1996 20
## 11 11 1997 23
## 12 12 1998 29
## 13 13 1999 43
## 14 14 2000 51
## 15 15 2001 55
## 16 16 2002 64
## 17 17 2003 69
## 18 18 2004 NA
## 19 19 2005 96
## 20 20 2006 100
## 21 21 2007 NA
## 22 22 2008 NA
## 23 23 2009 NA
## 24 24 2010 NA
## 25 25 2011 NA
## 26 26 2012 NA
## 27 27 2013 NA
## 28 28 2014 252
## 29 29 2015 277
## 30 30 NA NA
You must have the file name in quotation marks and include the .csv Any small error will cause things to not work.
Here are examples of mistakes that won’t work
#Incorrect - none of these will work
read.csv(file = Lab1_data_PA_eagles.csv) #missing quotes " "
read.csv(file = "Lab1_data_PA_eagles_CSV") #missing .csv
read.csv(file "Lab1_data_PA_eagles_CSV") #missing =
Note that R returns erro messages, but they aren’t necessarily very helpful in figuring out what the problem actually is.
Now type this: “eagles <- read.csv(file =”Lab1_data_PA_eagles.csv“)”" What happens when you execute this command?
eagles <- read.csv(file = "Lab1_data_PA_eagles.csv")
It might actually look like not much has happened. But That’s good! It means the data has successful been loaded into R. You have “assigned” the data from your file to the “object” named “eagles”
“<-” is called the “assignment operator”. It is a special type of R command
“<” is usually shares The comma ( , ) Key Type “shift + ,” To get it.
If you type just “eagles” and execute it as a command what happens?
eagles
## X year eagles
## 1 1 1980 3
## 2 2 1981 NA
## 3 3 1989 NA
## 4 4 1990 7
## 5 5 1991 9
## 6 6 1992 15
## 7 7 1993 17
## 8 8 1994 19
## 9 9 1995 20
## 10 10 1996 20
## 11 11 1997 23
## 12 12 1998 29
## 13 13 1999 43
## 14 14 2000 51
## 15 15 2001 55
## 16 16 2002 64
## 17 17 2003 69
## 18 18 2004 NA
## 19 19 2005 96
## 20 20 2006 100
## 21 21 2007 NA
## 22 22 2008 NA
## 23 23 2009 NA
## 24 24 2010 NA
## 25 25 2011 NA
## 26 26 2012 NA
## 27 27 2013 NA
## 28 28 2014 252
## 29 29 2015 277
## 30 30 NA NA
This should be the exact same data that was in the original Excel file. We have saved these data into and “R Object” that we can now work with.
Now Execute the list command ls(). You should now see “eagles”.
This means that the Object you assigned your data is now in your “workspace.”
ls()
(slide 51ish)
Look at the “eagles” object using the summary() command. DO NOT put the “>” in front of it. This “>” is just part of the readout from the Consol.
summary(eagles)
## X year eagles
## Min. : 1.00 Min. :1980 Min. : 3.00
## 1st Qu.: 8.25 1st Qu.:1994 1st Qu.: 18.00
## Median :15.50 Median :2001 Median : 29.00
## Mean :15.50 Mean :2001 Mean : 61.53
## 3rd Qu.:22.75 3rd Qu.:2008 3rd Qu.: 66.50
## Max. :30.00 Max. :2015 Max. :277.00
## NA's :1 NA's :11
Check how big the eagles object is using dim() command[dimension]
dim(eagles)
## [1] 30 3
Look at the top of the eagles object
head(eagles)
## X year eagles
## 1 1 1980 3
## 2 2 1981 NA
## 3 3 1989 NA
## 4 4 1990 7
## 5 5 1991 9
## 6 6 1992 15
Look at the the bottom of the eagles object
tail(eagles)
## X year eagles
## 25 25 2011 NA
## 26 26 2012 NA
## 27 27 2013 NA
## 28 28 2014 252
## 29 29 2015 277
## 30 30 NA NA
Try executing these commands directly from the console & also from the source viewer using the “Cntl+Enter” shortcut (where the “+” means “at the same time”)
Call up the help information for these commands
?dim
## starting httpd help server ...
## done
Unfortunately, the help files for R Are designed w/programmers in mind and are typically very encyclopedic. You can usually get Some useful information from them but often it can be hard When you are a beginner to find What you need
You can often find information Online, eg, by googling “R dim command”. Usually the R help file will come up. Other information will also show up. For very basic R commands this might not always be productive, for but things related to stats, plotting, and programming there is frequently lots of information. Also checkout the website stackoverflow.com
If you want to make a plot of the number of eagles over time in PA, what command do you think will do it? Many R commands use fairly Simple language.
plot(eagles ~ year, data = eagles)
One thing that makes R tricky is that there are multiple ways to accomplish the exact same thing.
Try typing in these different commands. The following commands all produce the exact same figure, just with different colors (via “col = .”).
One consequence of this fact is that different books/instructors/etc. will use slightly different approaches, making it sometimes tricky to compare code written by different people.
plot(eagles$eagles ~ eagles$year,col = 2)
plot(eagles[,"eagles"] ~ eagles[,"year"], col = 3)
plot(eagles[,2] ~ eagles[,1], col = 4)
plot(eagles$year,eagles$eagles, col = 5)
R plots can be customized almost infinitely. Type these different commands into the source viewer & execute them.
plot(eagles ~ year, data = eagles, col = 2)
plot(eagles ~ year, data = eagles, col = 2, pch = 2)
plot(eagles ~ year, data = eagles, col = 2, pch = 2, xlab = "Year of census")
plot(eagles ~ year, data = eagles, col = 2, pch = 2, xlab = "Year of census",
ylab = "Number of eagles")
can be on separate lines. Be mindful of the commas though!
Here, each command is on a separate line. This will produce the exact same plot as before.
plot(eagles ~ year,
data = eagles,
col = 2,
pch = 2,
xlab = "Year of census",
ylab = "Number of eagles")
You can even include blank lines
plot(eagles ~ year,
data = eagles,
col = 2,
pch = 2,
xlab = "Year of census",
ylab = "Number of eagles")
There are many commands for summary data in R, such as mean, median. However, you have to be careful about NAs!
mean(eagles$eagles)
## [1] NA
So, you asked for the mean of the eagles data, and you got NA. That’s really annoying.
Try this
mean(eagles$eagles, na.rm = T)
## [1] 61.52632
“na.rm = T”, which means “na.rm = TRUE”, which means, “should I remove the NAs = yes, do it”
Mean(eagles$eagles, na.rm = T)
#This returns the error message "Error in UseMethod("Mean") :
# no applicable method for 'Mean' applied to an object of class "c('double', #'numeric')""
Note that the R error message is not very helpful : (
mean(eagles$eagles, na.rm = T)
## [1] 61.52632
median(eagles$eagles, na.rm = T)
## [1] 29
min(eagles$eagles, na.rm = T)
## [1] 3
max(eagles$eagles, na.rm = T)
## [1] 277
summary(eagles$eagles)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 3.00 18.00 29.00 61.53 66.50 277.00 11
sd(eagles$eagles, na.rm = T)
## [1] 77.1149
The standard error (se) is a very common summary statistics but for some reason there is not a function for it in base R
Use the sd() command and the square root command sqrt
sd(eagles$eagles, na.rm = T)/sqrt(15)
## [1] 19.91098
Here is data for West Virigina as an example. I will Add it to a column called in the spreadsheet called “eagles.WV”
Things work best when your Excel file is “clean” & only has exactly what you want in it. Any extra, accidental typing can cause problems or make things confusing. A good practice is to always highlight cells to the right of and below your data, right click & select “Delete”. This will remove any accidental typing that occurred. Do this to the cells below your data also.
ReLoad data; be sure to include the “csv” at the end. Use this code “eagles <- read.csv(file =”Lab1_data_PA_eagles.csv“)”. NOTE: I changed the name of the file to include “_w_2_states" so that I wouldn’t overwrite the origina file. Don’t use this code unless you changed the file name to the exact same thing
#Use this code, w/o the "#" in front of it
# eagles <- read.csv(file = "Lab1_data_PA_eagles.csv")
#NOTE: I changed the name of the file to include "_w_2_states" so that I wouldn't overwrite the origina file. Don't use this code unless you changed the file name to the exact smame thing
eagles <- read.csv(file = "Lab1_data_PA_eagles_w_2_states.csv")
Type ls() to see what is now in your workspace
ls()
## [1] "eagles" "year"
Look at the re-loaded eagles data object
summary(eagles)
## year eagles eagles.WV
## Min. :1980 Min. : 3.00 Min. : 0.000
## 1st Qu.:1994 1st Qu.: 18.00 1st Qu.: 4.000
## Median :2001 Median : 29.00 Median : 5.000
## Mean :2001 Mean : 61.53 Mean : 6.882
## 3rd Qu.:2008 3rd Qu.: 66.50 3rd Qu.:10.000
## Max. :2015 Max. :277.00 Max. :19.000
## NA's :10 NA's :12
dim(eagles)
## [1] 29 3
head(eagles)
## year eagles eagles.WV
## 1 1980 3 0
## 2 1981 NA 1
## 3 1989 NA NA
## 4 1990 7 2
## 5 1991 9 3
## 6 1992 15 4
tail(eagles)
## year eagles eagles.WV
## 24 2010 NA NA
## 25 2011 NA NA
## 26 2012 NA NA
## 27 2013 NA NA
## 28 2014 252 NA
## 29 2015 277 NA
Use “col = 1” to set PA to black
plot(eagles ~ year, data = eagles, col = 1)# col = 1 sets point color to black
We’ll add points to the first graph using using command “points()”. Its very similar to the plot command. Be sure to change the name of the columns of data being graphed. Use “col = 2” within the “points()” command to set the other state to red to red
#Main plot
plot(eagles ~ year, data = eagles, col = 1)# col = 1 sets point color to black
#adding points to graph with points()
points(eagles.WV ~ year,
data = eagles,
col = 2) #WV data; set colors to red using col = 2
In MS Word, it works well to right click in a document and click on the clipboard icon. The shortcut “Crtl+V” should work in Word and PowerPoint
Plot can be modified using many different commands contained within the main “plot()” command.
change the type of point used for each state via pch =
#The main plot
plot(eagles ~ year, data = eagles,
col = 1, #col = 1 for black
pch = 2) #pch = 2 for triangles
#Add points for the other data
points(eagles.WV ~ year ,
data = eagles,
col = 2, #col = 2 for red
pch = 4) #pch = 4 for Xs
Plots should always have legends. Legends are highly customizable in R but can require a bit of coding. Here is how you could do it. This will be covered again in later labs in more detail. One thing to note that will be discussed later is the use of the “c(…)” in the code.
#The main plot
plot(eagles ~ year, data = eagles,
col = 1, #color of point
pch = 2) #shape of point
#Add new data with col = 2 and pch = 4
points(eagles.WV ~ year , data = eagles,
col = 2, #color of point
pch = 4) #shape of point
#Add a legend
legend("topleft", #where the legend goes
legend = c("PA","WV"), #the text the legend contains
col = c(1,2), #colors of the points in the legend
pch = c(2,4) ) #symbols of the points
Most work in R takes data from a spreadsheet and loads it using read.csv(). It is possible to also enter data manually. This is often useful for class exercises where small “toy” datasets are used that are easy to manage. We’ll enter data “by hand” to add a third state of data to our figure.
In R terms, were are creating a “vector” of data using the “c()” command. c() is a very very common command that we will discuss more later
eagles.OH <- c(NA,NA,NA,16,19,20,24,26,30,33,38,47,57,63,73,79,88,NA,125,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA)
#Length of the OH eagle vector
length(eagles.OH)
## [1] 29
#Dimineions of the OH vector
#this will not produce output for a very R-ish reason
dim(eagles.OH)
## NULL
Note again that we have to use the cryptic “na.rm = T”
#This won't work
mean(eagles.OH)
## [1] NA
#This will b/c we include "na.rm = T"
mean(eagles.OH, na.rm = T)
## [1] 49.2
median(eagles.OH, na.rm = T)
## [1] 38
min(eagles.OH, na.rm = T)
## [1] 16
max(eagles.OH, na.rm = T)
## [1] 125
summary(eagles.OH)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 16.0 25.0 38.0 49.2 68.0 125.0 14
sd(eagles.OH, na.rm = T)
## [1] 31.34873
The standard error can be calcualted by hand
#Note that the length of the OH vector is 29, but many of those values are "NA" so that the actual sample size is only 15
sd(eagles.OH, na.rm = T)/sqrt(15)
## [1] 8.094207
Plots can be easily made when data is in the form of a vector
### Make a boxplot
#What is the thick black line?
boxplot(eagles.OH)
#Make a histogram
hist(eagles.OH)
We currently have 2 datasets living in R. One is a dataframe that we imported from a spreadsheet in CSV form. The other is a “vector” of data from OH. We can combine these using a command called cbind() for “colum bind”.
eagles <- cbind(eagles, #original dataframe from csv
eagles.OH) #OH vector
See what the new eagles dataframe looks like
eagles
## year eagles eagles.WV eagles.OH
## 1 1980 3 0 NA
## 2 1981 NA 1 NA
## 3 1989 NA NA NA
## 4 1990 7 2 16
## 5 1991 9 3 19
## 6 1992 15 4 20
## 7 1993 17 5 24
## 8 1994 19 5 26
## 9 1995 20 5 30
## 10 1996 20 5 33
## 11 1997 23 6 38
## 12 1998 29 6 47
## 13 1999 43 7 57
## 14 2000 51 10 63
## 15 2001 55 12 73
## 16 2002 64 13 79
## 17 2003 69 14 88
## 18 2004 NA NA NA
## 19 2005 96 19 125
## 20 2006 100 NA NA
## 21 2007 NA NA NA
## 22 2008 NA NA NA
## 23 2009 NA NA NA
## 24 2010 NA NA NA
## 25 2011 NA NA NA
## 26 2012 NA NA NA
## 27 2013 NA NA NA
## 28 2014 252 NA NA
## 29 2015 277 NA NA
Look at the revised “eagles” object
dim(eagles)
## [1] 29 4
summary(eagles)
## year eagles eagles.WV eagles.OH
## Min. :1980 Min. : 3.00 Min. : 0.000 Min. : 16.0
## 1st Qu.:1994 1st Qu.: 18.00 1st Qu.: 4.000 1st Qu.: 25.0
## Median :2001 Median : 29.00 Median : 5.000 Median : 38.0
## Mean :2001 Mean : 61.53 Mean : 6.882 Mean : 49.2
## 3rd Qu.:2008 3rd Qu.: 66.50 3rd Qu.:10.000 3rd Qu.: 68.0
## Max. :2015 Max. :277.00 Max. :19.000 Max. :125.0
## NA's :10 NA's :12 NA's :14
names(eagles)
## [1] "year" "eagles" "eagles.WV" "eagles.OH"
head(eagles)
## year eagles eagles.WV eagles.OH
## 1 1980 3 0 NA
## 2 1981 NA 1 NA
## 3 1989 NA NA NA
## 4 1990 7 2 16
## 5 1991 9 3 19
## 6 1992 15 4 20
tail(eagles)
## year eagles eagles.WV eagles.OH
## 24 2010 NA NA NA
## 25 2011 NA NA NA
## 26 2012 NA NA NA
## 27 2013 NA NA NA
## 28 2014 252 NA NA
## 29 2015 277 NA NA
We will now make a plot w/ three sets of data from the 3 columns in teh dataframe. We make the initial plot with plot(), add West Virginia using “points(eagles.WV ~ …” to call up the eagles.WV column from the eagles dataframe. The OH data column is plotted using “points(eagles.OH ~ …”. We change the color each time using “col = …” and change the shape of the point with “pch = …” .
plot(eagles ~ year, data = eagles, col = 1, pch = 2) #The main plot
points(eagles.WV ~ year , data = eagles, col = 2, pch = 4) #add the WV data
points(eagles.OH ~ year , data = eagles, col = 3, pch = 5) #add the OH data
plot(eagles ~ year, data = eagles, col = 1, pch = 2)
points(eagles.WV ~ year , data = eagles, col = 2, pch = 4) #WV data
points(eagles.OH ~ year , data = eagles, col = 3, pch = 5) #OH data
legend("topleft", #where the legend coges
legend = c("PA","WV","OH"), #the text the legend contains
col = c(1,2,3), #colors of the points in the legend
pch = c(2,4,5) ) #symbols of the points