Introduction to R

Summer 2020

Why Are We Using R?

R is open source and freely available
Works on all standard platforms (Mac, Windows, linux)
Standard tool for many sciences (e.g., Computer Science, Physics)
Easy to produce web-based slides using:
- R-Studio
- R-Markdown
Traditional statistics tool with GUI front ends, as well as a flexible and powerful programming language
Together with ggplot2 package can produce very nice data visualizations

Install the Latest R

The latest version of R as of this presentation is 3.2.3

Go to http://cran.r-project.org/mirrors.html
Pick a mirror site (e.g., http://watson.nci.nih.gov/cran_mirror/)
Select the Download for … link that matches your platform
Follow the download and install prompts

If you are using Windows, choose the 64-bit version unless you know you are running a 32-bit version of Windows

Install the Latest R Studio

The latest version of R Studio as of this presentation is 0.99

Go to http://www.rstudio.com/products/RStudio/#Desk
Select Download RStudio Desktop, then pick your platform
Follow the download and install prompts
Watch their R Studio Overview video

The R-Studio Console

One can use the Console in R Studio to enter commands, functions, and data to perform statistical processes
The Console is located in the bottom left of the R Studio screen
When you see a grey box with code in it, you can copy and paste this into the Console and match the results to what the slides say under the grey box
For example:

print("Statistics Rocks (roughly speaking)!")

## [1] "Statistics Rocks (roughly speaking)!"

Later we’ll talk about using scripts to do more complicated things

Install Some Helpful Packages

Run R Studio
In the Console, type or copy the following:

install.packages(c("ggplot2","gcookbook","UsingR"))

If you are confused about how any function works in R, just type ?<functionname> for help – for example:

?install.packages

Test These Packages

library(ggplot2)

## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang

library(gcookbook)

Test These Packages, p.2

ggplot(cabbage_exp, aes(x=Date, y=Weight, fill=Cultivar)) + 
  geom_bar(position="dodge",stat="identity")

Many Built-In Distributions

R comes with many standard distributions
- Continuous: Exponential (exp), Normal (norm), Student $t$ (t), Uniform (unif), Chi-Squared (chi), etc.
- Discrete: Binomial (binom), Geometric (geom), Poisson (pois), etc.
We can make use of a number of operations on distributions
- r<dist>() to draw random numbers
- q<dist>() to get quantile information
- p<dist>() to get percentile information
- d<dist>() to get density information

Drawing Random Numbers

Seven random numbers from a binomial distribution, with $n=100, p=0.25$

rbinom(7,size=100,prob=0.25)

## [1] 28 33 32 20 28 27 31

Three ranom numbers $~N(5,0.1)$

rnorm(3,mean=5,sd=0.1)

## [1] 4.902377 4.940091 4.987001

Histograms, Ah-Hoy!

qplot(rexp(100000,rate=2),binwidth=0.25)

Get Some Data!

Let’s get Edgar Anderson’s dataset on Iris flowers
This dataset is stored online as a comma separated file
We can read such files directly using read.csv()
We can aslo get a summary of all the variables in this dataset, as well as their quartile information (for numeric data) and level counts (for categorical data)
With the correct package, there are commands to read many data formats
- read.xlss() for Microsoft Excel
- read.spss() for SPSS
- read.xport() for SAS
- Etc.

Try This:

irisData = read.csv(
  "http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv")
summary(irisData)

##        X           Sepal.Length    Sepal.Width     Petal.Length  
##  Min.   :  1.00   Min.   :4.300   Min.   :2.000   Min.   :1.000  
##  1st Qu.: 38.25   1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600  
##  Median : 75.50   Median :5.800   Median :3.000   Median :4.350  
##  Mean   : 75.50   Mean   :5.843   Mean   :3.057   Mean   :3.758  
##  3rd Qu.:112.75   3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100  
##  Max.   :150.00   Max.   :7.900   Max.   :4.400   Max.   :6.900  
##   Petal.Width          Species  
##  Min.   :0.100   setosa    :50  
##  1st Qu.:0.300   versicolor:50  
##  Median :1.300   virginica :50  
##  Mean   :1.199                  
##  3rd Qu.:1.800                  
##  Max.   :2.500

Deal with a Specific Column

Use the <dataset-name>$<variable-name> to get the data from a specific column
Compute basic statistics on columns, e.g, mean()

mean(irisData$Sepal.Length)

## [1] 5.843333

Get all the levels of a categorical variable:

levels(irisData$Species)

## [1] "setosa"     "versicolor" "virginica"

Confidence Intervals, the Hard Way

A 95% confidence interval of a mean using the standard normal distribution:

\[ \bar{x} \pm z^\star \cdot \frac{s}{\sqrt{n}} \]

n = dim(irisData)[1]                 # Sample Size
xb = mean(irisData$Sepal.Length)     # x_bar
s  = sqrt(var(irisData$Sepal.Length)) # standard deviation
z  = qnorm(0.975)                    # Critical value for 95%
xb - z*s/sqrt(n)
xb +z*s/sqrt(n)

## [1] 5.975849

## [1] 5.710818

Confidence Intervals, the Easy Way

R comes with an easy way to get confidence intervals using the Student $t$ distribution:

t.test(irisData$Sepal.Length)$conf.int

## [1] 5.709732 5.976934
## attr(,"conf.level")
## [1] 0.95

For a normal distribution, we load the UsingR package:

library(UsingR)
simple.z.test(irisData$Sepal.Length,
              sigma=sqrt(var(irisData$Sepal.Length)))

## [1] 5.975849 5.710818

Fun Stuff with Lists

To get a sequence from 1 to 10 stepping by 1, use 1:10
To make your own list of things, use c()
To count the values in the list, use length()

1:10

##  [1]  1  2  3  4  5  6  7  8  9 10

c(3,-1.2,44,103.009)

## [1]   3.000  -1.200  44.000 103.009

length(c(3,-1.2,44,103.009))

## [1] 4

More Fun Stuff with Lists

To sum the values in a list of numbers, use sum()
To produce a running, cumulative sum of a list, use cumsum()
The same idea applies to product, prod() and cumprod()

sum(rnorm(10))   # Sum 10 numbers drawn from std. norm

## [1] -5.292483

cumsum(1:10)     # Running sum of numbers from 1:10

##  [1]  1  3  6 10 15 21 28 36 45 55

cumprod(c(2,4,10))

## [1]  2  8 80

What is an R Soure File?

You can type commands into the console
But for anything even moderately sophisticated, you’ll probably want to create a source file
An R source file is a text-readable file with R commands in them that can be run any time
There are many advantages, not the least of which is being able to give the file to someone else to run
Also, you’ll need to turn in a source file for homework

Creating a Source File

You can use any text editor, but we’ll use R Studio
Go to File $\rightarrow$ New File $\rightarrow$ R Script
Type commands into the file in the order that you would like them to be executed
- Don’t forget to include any library() loads
- The user running this file next may not have that library loaded
Save the file when you are done
- File $\rightarrow$ Save As, then give the file a location and name that you like
- Or just File $\rightarrow$ Save if it already has a location and name
- It is customary to give the file the extension .R or .r

Sourcing the File

To run the file in R Studio, either

Press the Source button in the top right of the editor panel
Or, in the console, type:

source('path/to/my/script.R')

Make Sure Your Script Runs in a Clean Evironment

It is always a good idea to make sure your script will run with your environment completely clean

From the top-right panel, select the Environment tab
Select the little broom icon to clean the environment, the confirm with Yes on the dialog
From the Session menu option, select Restart R and Clean Output
Source your file

Turning in R Lab Assignments

You will be given several R lab assignments throughout the semester
Please submit these via git – there is a link in the assignment
I will source your file and compare its output to what I expect to see
I will also need any data files you used
- If you refer to files on your local machine, I wont be able to read them

Basic Data Types in R

Some Basic Data Types

numeric (numbers)
character (strings)
vectors (lists of one type)
factors (categorical variables)
lists (lists of arbitrary types)
matrices (numeric 1D and 2D arrays)

Numeric Data

scalar numbers: integers and real values
arithmatic operations on these

x = 3
y = 4.7
x/y + 2*x - 3

## [1] 3.638298

y > x

## [1] TRUE

Character Data

strings
operations on strings

x = "hello"
nchar(x)

## [1] 5

gsub("he","HE-",x)

## [1] "HE-llo"

Concatenating Strings with paste()

x = "hello"
y = "world"
paste(x,y)

## [1] "hello world"

paste(x,y,"turtle",sep=':')

## [1] "hello:world:turtle"

Vectors

What I called a “list” last week is really a “vector” in R
R vectors aren’t just numeric
However, they are the same type
Creating a vector with a mixture of numeric and character values will force R to coerce the vector to character

c(1,2,9)

## [1] 1 2 9

c(1,2,"9")

## [1] "1" "2" "9"

Vector Ops: Scaling, Indexing, and Length

x = c(-4,2,31)
3*x

## [1] -12   6  93

x[1]

## [1] -4

length(x)

## [1] 3

Vector Ops: Element-by-Element Ops

x = c(-4,2,3)
y = c(1,3,9)
z = 4*(1:3)
x*y

## [1] -4  6 27

x*y - z

## [1] -8 -2 15

Vector Ops: Named Elements

x = c(-4,2,3)
names(x) <- c("Bob","Frank","Mindy")
x["Frank"]

## Frank 
##     2

x["Mindy"] = 0
print(x)

##   Bob Frank Mindy 
##    -4     2     0

Vector Ops: Pasting with Vectors

x = 1:5
paste("String",x,sep='-')

## [1] "String-1" "String-2" "String-3" "String-4" "String-5"

Vector Ops: Appending to a Vector

x = seq(from=2,to=10,by=2)
print(x)

## [1]  2  4  6  8 10

x = c(x,-99)
x

## [1]   2   4   6   8  10 -99

x[7] = -98
x

## [1]   2   4   6   8  10 -99 -98

Factors

A factor is like a vector, except for categorical values
They can be created from vectors

x = c("good","good","bad","mediocre","good")
factor(x)

## [1] good     good     bad      mediocre good    
## Levels: bad good mediocre

factor(c(1,2,1,1,2,2,3,3,2,2))

##  [1] 1 2 1 1 2 2 3 3 2 2
## Levels: 1 2 3

Lists Are Like Mixed-Type Vectors

x = list()
x[[1]] = 1
x[[2]] = "hello"
x[[3]] = c(-1,-2,-9)
x

## [[1]]
## [1] 1
## 
## [[2]]
## [1] "hello"
## 
## [[3]]
## [1] -1 -2 -9

List Operations: Named Elements

x = list(3,1,4,5)
names(x) <- c("A","B","C","D")
x[["A"]]

## [1] 3

x$A

## [1] 3

Matrix Operations: Creating & Indexing

A = 1:6
dim(A) <- c(2,3)
print(A)

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

A[1,3]

## [1] 5

A[,2]

## [1] 3 4

Matrix Ops: Indexing Rows or Columns

A = 1:6
dim(A) <- c(2,3)
A[2,]; A[,3]

## [1] 2 4 6

## [1] 5 6

A[,2] = c(-99,-98)
print(A)

##      [,1] [,2] [,3]
## [1,]    1  -99    5
## [2,]    2  -98    6

Matrix Ops: Simple Arithmetic

A = 1:6
dim(A) <- c(2,3)
3*A

##      [,1] [,2] [,3]
## [1,]    3    9   15
## [2,]    6   12   18

A + A

##      [,1] [,2] [,3]
## [1,]    2    6   10
## [2,]    4    8   12

Matrix Ops: Transposition

A = 1:6
dim(A) <- c(2,3)
A

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

t(A)

##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
## [3,]    5    6

Matrix Ops: Multiplication

A = 1:6
dim(A) <- c(2,3)
A * A

##      [,1] [,2] [,3]
## [1,]    1    9   25
## [2,]    4   16   36

A %*% t(A)

##      [,1] [,2]
## [1,]   35   44
## [2,]   44   56

Dataframes

Data Frames

Basic data type for statistical operations
Implements a table
- Variables stored in columns
- Observations stored in rows
Typically columns are named
Values in each column must be the same type
- factors
- numeric vectors
- character vectors
Different columns may have different types

Data Frames: Creating Data Frames

Student = c("Bob", "Sue", "Cat", "Lin")
NumberGrade = c(96, 82, 97, 74)
LetterGrade = factor(c("A","B","A","C"))
RosterData = data.frame(Student,NumberGrade,LetterGrade)
RosterData

##   Student NumberGrade LetterGrade
## 1     Bob          96           A
## 2     Sue          82           B
## 3     Cat          97           A
## 4     Lin          74           C

Data Frames: Getting Variable Data

rd = data.frame( Student = c("Bob", "Sue", "Cat", "Lin"),
                 NumberGrade = c(96, 82, 97, 74),
                 LetterGrade = factor(c("A","B","A","C")) )
rd$NumberGrade

## [1] 96 82 97 74

rd$LetterGrade

## [1] A B A C
## Levels: A B C

Data Frames: Like Matrices, but Not

# rd from last slide
dim(rd)

## [1] 4 3

rd[2,]

##   Student NumberGrade LetterGrade
## 2     Sue          82           B

Data Frames: Summarizing

# rd from two slides ago
summary(rd)

##  Student  NumberGrade    LetterGrade
##  Bob:1   Min.   :74.00   A:2        
##  Cat:1   1st Qu.:80.00   B:1        
##  Lin:1   Median :89.00   C:1        
##  Sue:1   Mean   :87.25              
##          3rd Qu.:96.25              
##          Max.   :97.00

mode() versus class()

mode() describes how the data is stored
class() describes what class interprets that data

c(mode(4), mode("hello"))

## [1] "numeric"   "character"

x = as.Date("2014-01-22"); mode(x)

## [1] "numeric"

class(x)

## [1] "Date"

assign(), get(), and rm()

assign() allows you to assign values to a named variable
get() allows you to get the value from a named variable
rm() removes a named variable altogether

x = 2
assign("y",3)
print(c(get("x"),y))

## [1] 2 3

rm(x)
#print(x)

Read Table

x = read.table('http://cs.ucf.edu/~wiegand/idc6700/datasets/color-cookbook-eg.txt', 
               header=T)
head(x)

##   cond1 cond2 yval
## 1     A     I  2.0
## 2     A     J  2.5
## 3     A     K  1.6
## 4     A     L  0.8
## 5     B     I  2.2
## 6     B     J  2.4

Learn More!

We’ll be learning a lot more about R as the semester progresses, including a number of lab assignments. But there’s also a lot of on-line materials, including:

The online Cookbook for R, http://www.cookbook-r.com
A simple article in Computer World introducing R, http://www.computerworld.com/article/2497143/business-intelligence-beginner-s-guide-to-r-introduction.html
Another introductory article in Computer Word about getting your data into R, http://www.computerworld.com/article/2497164/business-intelligence/beginner-s-guide-to-r-get-your-data-into-r.html
The R-Bloggers site, http://www.r-bloggers.com
- This includes a two-part introduction to R series, http://rtutorialseries.blogspot.com/2009/10/r-tutorial-series-introduction-to-r_11.html

Or Checkout These Books

R Cookbook, by Paul Teetor

R Graphics Cookbook, by Winston Chang