Getting started with R

What we’ll do

Get some data ready in Excel
Set up RStudio
Load data into Rstudio
Examine it
Plot It
Add more data to Excel
Examine it, plot it etc.

Part IIa. Prepping data in Excel

Save data to your R working directory (WD)

Save the file “Lab1_data_PA_eagles.xlsx” to your computers desktop. Today we will be using this as the “working directory”

Re-Save The Excel file as a “csv” file

In Excel, follow these steps

1)“File”
2)“Save As”
3)“Browse”
4)Select the working directory (your desktop)
5)Select “Save as type”
6)Select “CSV (Comma delimited)”
7)Click “Save”

The data is now in a format that can be loaded into R.

Part IIb. Getting to know RStudio

Main parts of RStudio

The Consol
The Source Viewer/Script editor
Plots/Packages/Help tabs

You can change the locations of the windows. I prefer to have my consol in the lower left, Plots etc in the upper left, and source viewer on thet right.

These elemetns can be moved by going to Tools, Global Options, then Pane Layout.

Part IIc. Loading Data into RStudio

We will now take the data we saved as a .csv file and load it into R.

Set the “working directory” (“WD”) in R

Follow these steps * “Session” * “Set working directory” * “Choose Directory” - select your computers desktop * Select directory & click “Select folder” * The command “setwd” shows up followed by the location of the directory you selected

You can set your working directory to be anywhere on the computer. It is essential to make sure that the csv file you want to load into R is in your working directory.

Part IId) Interacting with R

(Slide ~30 of ppt)

Executing commands in the RStuiod CONSOL

Type “getwd()” in the consol.
Must have both ( and )
Press enter
Prints out your working directory (WD)

This should be your desktop, where we saved the csv file with the data.

getwd()

## [1] "C:/Users/lisanjie2/Desktop/TEACHING/1_STATS_CalU/1_STAT_CalU_2016_by_NLB/Lab/Lab1_intro_to_R"

If you have not saved the file Lab1_data_PA_eagles.xlsx as as csv file to your desktop and/or your working directory, do so now.

Part IIe) Interacting with R from the SOURCE VIEWER part 1

Click on the source viewer pane in RStudio
Type “getwd()” in the source viewer
Click on the “Run” button in the upper Right part of the pane
The getwd() command is sent over to the consol and excucted

getwd()

## [1] "C:/Users/lisanjie2/Desktop/TEACHING/1_STATS_CalU/1_STAT_CalU_2016_by_NLB/Lab/Lab1_intro_to_R"

Try these commands in the source viewer * date() * ls()
Note that “l” = lower case “L”

#Today's date
date()

## [1] "Tue Sep 06 12:27:23 2016"

#Probably won't return anything interesting:
ls()

## [1] "eagles" "year"

“ls” means “list”. More on this command later.

Now try just “date” without the parentheses. What happens?

date

## function () 
## .Internal(date())
## <bytecode: 0x06c9a434>
## <environment: namespace:base>

This will return what appears to be nonsense.

NB: parentheses are key to the execution of commands in R!

When things don’t work, one of the 1st things to check for are parentheses!

Part IIf) Interacting with R

Executing commands from the SOURCE VIEWER part 2

Type “date()”" & instead of clicking “Run” put the curser right after the last parenthesis and press “Crlt+Enter” on the keyboard (the “+” means “at the same time”, not the “+” key)

date()

## [1] "Tue Sep 06 12:27:23 2016"

Now Execute the command “list.files()” using “Crlt+Enter” on the keyboard

list.files()

This tells you what files are saved in your working directory (wd). There should be the original xlsx Excel file &t he csv file you made using “save as” (and anything else on your desktop). We will now load the .csv file into R.

Loading data into R

Copy and paste the CSV file name from the consol in the source view then Execute the command “read.csv(file =”Lab1_data_PA_eagles.csv“)”. You can type it but you must be careful to have NO TYPOS. R is unforgiving when it comes to typos.

read.csv(file = "Lab1_data_PA_eagles.csv")

##     X year eagles
## 1   1 1980      3
## 2   2 1981     NA
## 3   3 1989     NA
## 4   4 1990      7
## 5   5 1991      9
## 6   6 1992     15
## 7   7 1993     17
## 8   8 1994     19
## 9   9 1995     20
## 10 10 1996     20
## 11 11 1997     23
## 12 12 1998     29
## 13 13 1999     43
## 14 14 2000     51
## 15 15 2001     55
## 16 16 2002     64
## 17 17 2003     69
## 18 18 2004     NA
## 19 19 2005     96
## 20 20 2006    100
## 21 21 2007     NA
## 22 22 2008     NA
## 23 23 2009     NA
## 24 24 2010     NA
## 25 25 2011     NA
## 26 26 2012     NA
## 27 27 2013     NA
## 28 28 2014    252
## 29 29 2015    277
## 30 30   NA     NA

You must have the file name in quotation marks and include the .csv Any small error will cause things to not work.

Here are examples of mistakes that won’t work

#Incorrect - none of these will work
read.csv(file = Lab1_data_PA_eagles.csv)    #missing quotes " "
read.csv(file = "Lab1_data_PA_eagles_CSV")   #missing .csv
read.csv(file "Lab1_data_PA_eagles_CSV")     #missing =

Note that R returns erro messages, but they aren’t necessarily very helpful in figuring out what the problem actually is.

Load eagles data into an R “object”

Now type this: “eagles <- read.csv(file =”Lab1_data_PA_eagles.csv“)”" What happens when you execute this command?

eagles <- read.csv(file = "Lab1_data_PA_eagles.csv")

It might actually look like not much has happened. But That’s good! It means the data has successful been loaded into R. You have “assigned” the data from your file to the “object” named “eagles”

The “<-” command

“<-” is called the “assignment operator”. It is a special type of R command

“<” is usually shares The comma ( , ) Key Type “shift + ,” To get it.

R objects

If you type just “eagles” and execute it as a command what happens?

eagles

##     X year eagles
## 1   1 1980      3
## 2   2 1981     NA
## 3   3 1989     NA
## 4   4 1990      7
## 5   5 1991      9
## 6   6 1992     15
## 7   7 1993     17
## 8   8 1994     19
## 9   9 1995     20
## 10 10 1996     20
## 11 11 1997     23
## 12 12 1998     29
## 13 13 1999     43
## 14 14 2000     51
## 15 15 2001     55
## 16 16 2002     64
## 17 17 2003     69
## 18 18 2004     NA
## 19 19 2005     96
## 20 20 2006    100
## 21 21 2007     NA
## 22 22 2008     NA
## 23 23 2009     NA
## 24 24 2010     NA
## 25 25 2011     NA
## 26 26 2012     NA
## 27 27 2013     NA
## 28 28 2014    252
## 29 29 2015    277
## 30 30   NA     NA

This should be the exact same data that was in the original Excel file. We have saved these data into and “R Object” that we can now work with.

Now Execute the list command ls(). You should now see “eagles”.
This means that the Object you assigned your data is now in your “workspace.”

ls()

Interacting with data “objects” in R & Getting to know an R data object

(slide 51ish)

Look at the “eagles” object using the summary() command. DO NOT put the “>” in front of it. This “>” is just part of the readout from the Consol.

summary(eagles)

##        X              year          eagles      
##  Min.   : 1.00   Min.   :1980   Min.   :  3.00  
##  1st Qu.: 8.25   1st Qu.:1994   1st Qu.: 18.00  
##  Median :15.50   Median :2001   Median : 29.00  
##  Mean   :15.50   Mean   :2001   Mean   : 61.53  
##  3rd Qu.:22.75   3rd Qu.:2008   3rd Qu.: 66.50  
##  Max.   :30.00   Max.   :2015   Max.   :277.00  
##                  NA's   :1      NA's   :11

Check how big the eagles object is using dim() command[dimension]

dim(eagles)

## [1] 30  3

Look at the top of the eagles object

head(eagles)

##   X year eagles
## 1 1 1980      3
## 2 2 1981     NA
## 3 3 1989     NA
## 4 4 1990      7
## 5 5 1991      9
## 6 6 1992     15

Look at the the bottom of the eagles object

tail(eagles)

##     X year eagles
## 25 25 2011     NA
## 26 26 2012     NA
## 27 27 2013     NA
## 28 28 2014    252
## 29 29 2015    277
## 30 30   NA     NA

Try executing these commands directly from the console & also from the source viewer using the “Cntl+Enter” shortcut (where the “+” means “at the same time”)

R help files

Call up the help information for these commands

?dim

## starting httpd help server ...

##  done

Unfortunately, the help files for R Are designed w/programmers in mind and are typically very encyclopedic. You can usually get Some useful information from them but often it can be hard When you are a beginner to find What you need

R help on the internet

You can often find information Online, eg, by googling “R dim command”. Usually the R help file will come up. Other information will also show up. For very basic R commands this might not always be productive, for but things related to stats, plotting, and programming there is frequently lots of information. Also checkout the website stackoverflow.com

Interacting w/ data in R: plotting the number of eagles in PA

Making a basic plot (slide 56 ish)

If you want to make a plot of the number of eagles over time in PA, what command do you think will do it? Many R commands use fairly Simple language.

plot(eagles ~ year, data = eagles)

One thing that makes R tricky is that there are multiple ways to accomplish the exact same thing.

Try typing in these different commands. The following commands all produce the exact same figure, just with different colors (via “col = .”).

One consequence of this fact is that different books/instructors/etc. will use slightly different approaches, making it sometimes tricky to compare code written by different people.

Other ways of doing the exact same thing

plot(eagles$eagles ~ eagles$year,col = 2)
plot(eagles[,"eagles"] ~ eagles[,"year"], col = 3)
plot(eagles[,2] ~ eagles[,1], col = 4)
plot(eagles$year,eagles$eagles, col = 5)

Customizing a plot in R

R plots can be customized almost infinitely. Type these different commands into the source viewer & execute them.

Change the color with “col =”

plot(eagles ~ year, data = eagles, col = 2)

Change the style of point with “pch =”

plot(eagles ~ year, data = eagles, col = 2, pch = 2)

Change the x axis label

plot(eagles ~ year, data = eagles, col = 2, pch = 2, xlab = "Year of census")

Change the y axis label

plot(eagles ~ year, data = eagles, col = 2, pch = 2, xlab = "Year of census",
     ylab = "Number of eagles")

Note that in the source viewer commands

can be on separate lines. Be mindful of the commas though!

Here, each command is on a separate line. This will produce the exact same plot as before.

plot(eagles ~ year, 
     data = eagles, 
     col = 2, 
     pch = 2, 
     xlab = "Year of census",
     ylab = "Number of eagles")

You can even include blank lines

plot(eagles ~ year, 
     
     data = eagles, 
     col = 2, 
     
     pch = 2, 
     
     xlab = "Year of census",
     
     ylab = "Number of eagles")

Summary stats for main PA eagle dataframe

There are many commands for summary data in R, such as mean, median. However, you have to be careful about NAs!

mean(eagles$eagles)

## [1] NA

So, you asked for the mean of the eagles data, and you got NA. That’s really annoying.

“NA” is a big deal in R

Try this

mean(eagles$eagles, na.rm = T)

## [1] 61.52632

“na.rm = T”, which means “na.rm = TRUE”, which means, “should I remove the NAs = yes, do it”

R also is VERY picky about uppper vs. lower case

Mean(eagles$eagles, na.rm = T)

#This returns the error message "Error in UseMethod("Mean") : 
#  no applicable method for 'Mean' applied to an object of class "c('double', #'numeric')""

Note that the R error message is not very helpful : (

Most basic R commands are lower case

mean(eagles$eagles, na.rm = T)

## [1] 61.52632

median(eagles$eagles, na.rm = T)

## [1] 29

min(eagles$eagles, na.rm = T)

## [1] 3

max(eagles$eagles, na.rm = T)

## [1] 277

summary(eagles$eagles)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    3.00   18.00   29.00   61.53   66.50  277.00      11

sd(eagles$eagles, na.rm = T)

## [1] 77.1149

Standard error in R

The standard error (se) is a very common summary statistics but for some reason there is not a function for it in base R

Calcualte the standard error by hand

Use the sd() command and the square root command sqrt

sd(eagles$eagles, na.rm = T)/sqrt(15)

## [1] 19.91098

Reload eagle data with a 2nd state added

Open up origla Excel CSV file
Type in new data from your assigned state in a new Column
Name the column “eagles.XX” where “xx” is the Postal code for that state.
Be sure to separate the words with a period; R is unfortunately picky about what characters show Up in its columns

Here is data for West Virigina as an example. I will Add it to a column called in the spreadsheet called “eagles.WV”

Preparing a file for loading into R

Things work best when your Excel file is “clean” & only has exactly what you want in it. Any extra, accidental typing can cause problems or make things confusing. A good practice is to always highlight cells to the right of and below your data, right click & select “Delete”. This will remove any accidental typing that occurred. Do this to the cells below your data also.

Reload data

ReLoad data; be sure to include the “csv” at the end. Use this code “eagles <- read.csv(file =”Lab1_data_PA_eagles.csv“)”. NOTE: I changed the name of the file to include “_w_2_states" so that I wouldn’t overwrite the origina file. Don’t use this code unless you changed the file name to the exact same thing

#Use this code, w/o the "#" in front of it
# eagles <- read.csv(file = "Lab1_data_PA_eagles.csv")

#NOTE: I changed the name of the file to include "_w_2_states" so that I wouldn't overwrite the origina file.  Don't use this code unless you changed the file name to the exact smame thing
eagles <- read.csv(file = "Lab1_data_PA_eagles_w_2_states.csv")

Type ls() to see what is now in your workspace

ls()

## [1] "eagles" "year"

Look at the re-loaded eagles data object

summary(eagles)

##       year          eagles         eagles.WV     
##  Min.   :1980   Min.   :  3.00   Min.   : 0.000  
##  1st Qu.:1994   1st Qu.: 18.00   1st Qu.: 4.000  
##  Median :2001   Median : 29.00   Median : 5.000  
##  Mean   :2001   Mean   : 61.53   Mean   : 6.882  
##  3rd Qu.:2008   3rd Qu.: 66.50   3rd Qu.:10.000  
##  Max.   :2015   Max.   :277.00   Max.   :19.000  
##                 NA's   :10       NA's   :12

dim(eagles)

## [1] 29  3

head(eagles)

##   year eagles eagles.WV
## 1 1980      3         0
## 2 1981     NA         1
## 3 1989     NA        NA
## 4 1990      7         2
## 5 1991      9         3
## 6 1992     15         4

tail(eagles)

##    year eagles eagles.WV
## 24 2010     NA        NA
## 25 2011     NA        NA
## 26 2012     NA        NA
## 27 2013     NA        NA
## 28 2014    252        NA
## 29 2015    277        NA

Plot the data from both states

Plot the PA data

Use “col = 1” to set PA to black

plot(eagles ~ year, data = eagles, col = 1)# col = 1 sets point color to black

Plot the 2nd state’s data with the points() commnad

We’ll add points to the first graph using using command “points()”. Its very similar to the plot command. Be sure to change the name of the columns of data being graphed. Use “col = 2” within the “points()” command to set the other state to red to red

#Main plot
plot(eagles ~ year, data = eagles, col = 1)# col = 1 sets point color to black
#adding points to graph with points()
points(eagles.WV ~ year, 
       data = eagles, 
       col = 2) #WV data; set colors to red using col = 2

Exporting R Figures

Click on “EXPORT” in RStudio above the graph
Select “Copy to clipboard”
A window will pop up
Adjust the size and dimensions By clicking and dragging the bottom Right corner
Click on “copy plot”
The window will disappear
The Plot is now stored in memory and can be pasted into PowerPoint or a Word docutment

In MS Word, it works well to right click in a document and click on the clipboard icon. The shortcut “Crtl+V” should work in Word and PowerPoint

Modifiying plots

Plot can be modified using many different commands contained within the main “plot()” command.

Change the plotting symbil

change the type of point used for each state via pch =

#The main plot
plot(eagles ~ year, data = eagles,
     col = 1, #col = 1 for black
     pch = 2) #pch = 2 for triangles

#Add points for the other data
points(eagles.WV ~ year , 
       data = eagles, 
       col = 2, #col = 2 for red
       pch = 4) #pch = 4 for Xs

Adding a legend

Plots should always have legends. Legends are highly customizable in R but can require a bit of coding. Here is how you could do it. This will be covered again in later labs in more detail. One thing to note that will be discussed later is the use of the “c(…)” in the code.

#The main plot
plot(eagles ~ year, data = eagles, 
     col = 1,  #color of point
     pch = 2)  #shape of point

#Add new data with col = 2 and pch = 4
points(eagles.WV ~ year , data = eagles, 
       col = 2, #color of point
       pch = 4) #shape of point

#Add a legend
legend("topleft",             #where the legend goes
       legend = c("PA","WV"), #the text the legend contains
       col = c(1,2),          #colors of the points in the legend
       pch = c(2,4) )         #symbols of the points

Add data “by hand”

Most work in R takes data from a spreadsheet and loads it using read.csv(). It is possible to also enter data manually. This is often useful for class exercises where small “toy” datasets are used that are easy to manage. We’ll enter data “by hand” to add a third state of data to our figure.

Add data for Ohio, USA

In R terms, were are creating a “vector” of data using the “c()” command. c() is a very very common command that we will discuss more later

eagles.OH <- c(NA,NA,NA,16,19,20,24,26,30,33,38,47,57,63,73,79,88,NA,125,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA)

Get information about the OH data

#Length of the OH eagle vector
length(eagles.OH)

## [1] 29

#Dimineions of the OH vector
#this will not produce output for a very R-ish reason
dim(eagles.OH)

## NULL

Summary statistics from the OH vector.

Note again that we have to use the cryptic “na.rm = T”

#This won't work
mean(eagles.OH)

## [1] NA

#This will b/c we include "na.rm = T"
mean(eagles.OH, na.rm = T)

## [1] 49.2

median(eagles.OH, na.rm = T)

## [1] 38

min(eagles.OH, na.rm = T)

## [1] 16

max(eagles.OH, na.rm = T)

## [1] 125

summary(eagles.OH)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    16.0    25.0    38.0    49.2    68.0   125.0      14

sd(eagles.OH, na.rm = T)

## [1] 31.34873

The standard error can be calcualted by hand

#Note that the length of the OH vector is 29, but many of those values are "NA" so that the actual sample size is only 15
sd(eagles.OH, na.rm = T)/sqrt(15)

## [1] 8.094207

Plots can be easily made when data is in the form of a vector

### Make a boxplot
#What is the thick black line?
boxplot(eagles.OH)

#Make a histogram
hist(eagles.OH)

Combine OH data with main dataframe

We currently have 2 datasets living in R. One is a dataframe that we imported from a spreadsheet in CSV form. The other is a “vector” of data from OH. We can combine these using a command called cbind() for “colum bind”.

eagles <- cbind(eagles,    #original dataframe from csv
                 eagles.OH) #OH vector

See what the new eagles dataframe looks like

eagles

##    year eagles eagles.WV eagles.OH
## 1  1980      3         0        NA
## 2  1981     NA         1        NA
## 3  1989     NA        NA        NA
## 4  1990      7         2        16
## 5  1991      9         3        19
## 6  1992     15         4        20
## 7  1993     17         5        24
## 8  1994     19         5        26
## 9  1995     20         5        30
## 10 1996     20         5        33
## 11 1997     23         6        38
## 12 1998     29         6        47
## 13 1999     43         7        57
## 14 2000     51        10        63
## 15 2001     55        12        73
## 16 2002     64        13        79
## 17 2003     69        14        88
## 18 2004     NA        NA        NA
## 19 2005     96        19       125
## 20 2006    100        NA        NA
## 21 2007     NA        NA        NA
## 22 2008     NA        NA        NA
## 23 2009     NA        NA        NA
## 24 2010     NA        NA        NA
## 25 2011     NA        NA        NA
## 26 2012     NA        NA        NA
## 27 2013     NA        NA        NA
## 28 2014    252        NA        NA
## 29 2015    277        NA        NA

Look at the revised “eagles” object

dim(eagles)

## [1] 29  4

summary(eagles)

##       year          eagles         eagles.WV        eagles.OH    
##  Min.   :1980   Min.   :  3.00   Min.   : 0.000   Min.   : 16.0  
##  1st Qu.:1994   1st Qu.: 18.00   1st Qu.: 4.000   1st Qu.: 25.0  
##  Median :2001   Median : 29.00   Median : 5.000   Median : 38.0  
##  Mean   :2001   Mean   : 61.53   Mean   : 6.882   Mean   : 49.2  
##  3rd Qu.:2008   3rd Qu.: 66.50   3rd Qu.:10.000   3rd Qu.: 68.0  
##  Max.   :2015   Max.   :277.00   Max.   :19.000   Max.   :125.0  
##                 NA's   :10       NA's   :12       NA's   :14

names(eagles)

## [1] "year"      "eagles"    "eagles.WV" "eagles.OH"

head(eagles)

##   year eagles eagles.WV eagles.OH
## 1 1980      3         0        NA
## 2 1981     NA         1        NA
## 3 1989     NA        NA        NA
## 4 1990      7         2        16
## 5 1991      9         3        19
## 6 1992     15         4        20

tail(eagles)

##    year eagles eagles.WV eagles.OH
## 24 2010     NA        NA        NA
## 25 2011     NA        NA        NA
## 26 2012     NA        NA        NA
## 27 2013     NA        NA        NA
## 28 2014    252        NA        NA
## 29 2015    277        NA        NA

Make a plot with 3 sets of data

We will now make a plot w/ three sets of data from the 3 columns in teh dataframe. We make the initial plot with plot(), add West Virginia using “points(eagles.WV ~ …” to call up the eagles.WV column from the eagles dataframe. The OH data column is plotted using “points(eagles.OH ~ …”. We change the color each time using “col = …” and change the shape of the point with “pch = …” .

plot(eagles ~ year, data = eagles, col = 1, pch = 2) #The main plot
points(eagles.WV ~ year , data = eagles, col = 2, pch = 4) #add the WV data
points(eagles.OH ~ year , data = eagles, col = 3, pch = 5) #add the OH data

Make a finished plot with 3 sets of data and a legend

plot(eagles ~ year, data = eagles, col = 1, pch = 2)
points(eagles.WV ~ year , data = eagles, col = 2, pch = 4) #WV data
points(eagles.OH ~ year , data = eagles, col = 3, pch = 5) #OH data
legend("topleft",             #where the legend coges
       legend = c("PA","WV","OH"), #the text the legend contains
       col = c(1,2,3),          #colors of the points in the legend
       pch = c(2,4,5) )         #symbols of the points

Lab 1: Introduction to R

brouwern@gmail.com

September 3, 2016

ENS 495 Lab 1: Intro to R

Preface

Part I

Introduction: What is R?

What does it do?

Why use R

Who uses it?

Part II