Summer 2020

Why Are We Using R?

  • R is open source and freely available
  • Works on all standard platforms (Mac, Windows, linux)
  • Standard tool for many sciences (e.g., Computer Science, Physics)
  • Easy to produce web-based slides using:
  • Traditional statistics tool with GUI front ends, as well as a flexible and powerful programming language
  • Together with ggplot2 package can produce very nice data visualizations

Install the Latest R

The latest version of R as of this presentation is 3.2.3

  1. Go to http://cran.r-project.org/mirrors.html
  2. Pick a mirror site (e.g., http://watson.nci.nih.gov/cran_mirror/)
  3. Select the Download for … link that matches your platform
  4. Follow the download and install prompts

If you are using Windows, choose the 64-bit version unless you know you are running a 32-bit version of Windows

Install the Latest R Studio

The R-Studio Console

  • One can use the Console in R Studio to enter commands, functions, and data to perform statistical processes
  • The Console is located in the bottom left of the R Studio screen
  • When you see a grey box with code in it, you can copy and paste this into the Console and match the results to what the slides say under the grey box
  • For example:
print("Statistics Rocks (roughly speaking)!")
## [1] "Statistics Rocks (roughly speaking)!"
  • Later we’ll talk about using scripts to do more complicated things

Install Some Helpful Packages

  1. Run R Studio
  2. In the Console, type or copy the following:
install.packages(c("ggplot2","gcookbook","UsingR"))

If you are confused about how any function works in R, just type ?<functionname> for help – for example:

?install.packages

Test These Packages

library(ggplot2)
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
library(gcookbook)

Test These Packages, p.2

ggplot(cabbage_exp, aes(x=Date, y=Weight, fill=Cultivar)) + 
  geom_bar(position="dodge",stat="identity")

Many Built-In Distributions

  • R comes with many standard distributions
    • Continuous: Exponential (exp), Normal (norm), Student \(t\) (t), Uniform (unif), Chi-Squared (chi), etc.
    • Discrete: Binomial (binom), Geometric (geom), Poisson (pois), etc.
  • We can make use of a number of operations on distributions
    • r<dist>() to draw random numbers
    • q<dist>() to get quantile information
    • p<dist>() to get percentile information
    • d<dist>() to get density information

Drawing Random Numbers

Seven random numbers from a binomial distribution, with \(n=100, p=0.25\)

rbinom(7,size=100,prob=0.25)
## [1] 28 33 32 20 28 27 31

Three ranom numbers \(~N(5,0.1)\)

rnorm(3,mean=5,sd=0.1)
## [1] 4.902377 4.940091 4.987001

Histograms, Ah-Hoy!

qplot(rexp(100000,rate=2),binwidth=0.25)

Get Some Data!

  • Let’s get Edgar Anderson’s dataset on Iris flowers
  • This dataset is stored online as a comma separated file
  • We can read such files directly using read.csv()
  • We can aslo get a summary of all the variables in this dataset, as well as their quartile information (for numeric data) and level counts (for categorical data)
  • With the correct package, there are commands to read many data formats
    • read.xlss() for Microsoft Excel
    • read.spss() for SPSS
    • read.xport() for SAS
    • Etc.

Try This:

irisData = read.csv(
  "http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv")
summary(irisData)
##        X           Sepal.Length    Sepal.Width     Petal.Length  
##  Min.   :  1.00   Min.   :4.300   Min.   :2.000   Min.   :1.000  
##  1st Qu.: 38.25   1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600  
##  Median : 75.50   Median :5.800   Median :3.000   Median :4.350  
##  Mean   : 75.50   Mean   :5.843   Mean   :3.057   Mean   :3.758  
##  3rd Qu.:112.75   3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100  
##  Max.   :150.00   Max.   :7.900   Max.   :4.400   Max.   :6.900  
##   Petal.Width          Species  
##  Min.   :0.100   setosa    :50  
##  1st Qu.:0.300   versicolor:50  
##  Median :1.300   virginica :50  
##  Mean   :1.199                  
##  3rd Qu.:1.800                  
##  Max.   :2.500

Deal with a Specific Column

  • Use the <dataset-name>$<variable-name> to get the data from a specific column
  • Compute basic statistics on columns, e.g, mean()
mean(irisData$Sepal.Length)
## [1] 5.843333
  • Get all the levels of a categorical variable:
levels(irisData$Species)
## [1] "setosa"     "versicolor" "virginica"

Confidence Intervals, the Hard Way

A 95% confidence interval of a mean using the standard normal distribution:

\[ \bar{x} \pm z^\star \cdot \frac{s}{\sqrt{n}} \]

n = dim(irisData)[1]                 # Sample Size
xb = mean(irisData$Sepal.Length)     # x_bar
s  = sqrt(var(irisData$Sepal.Length)) # standard deviation
z  = qnorm(0.975)                    # Critical value for 95%
xb - z*s/sqrt(n)
xb +z*s/sqrt(n)
## [1] 5.975849
## [1] 5.710818

Confidence Intervals, the Easy Way

R comes with an easy way to get confidence intervals using the Student \(t\) distribution:

t.test(irisData$Sepal.Length)$conf.int
## [1] 5.709732 5.976934
## attr(,"conf.level")
## [1] 0.95

For a normal distribution, we load the UsingR package:

library(UsingR)
simple.z.test(irisData$Sepal.Length,
              sigma=sqrt(var(irisData$Sepal.Length)))
## [1] 5.975849 5.710818

Fun Stuff with Lists

  • To get a sequence from 1 to 10 stepping by 1, use 1:10
  • To make your own list of things, use c()
  • To count the values in the list, use length()
1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
c(3,-1.2,44,103.009)
## [1]   3.000  -1.200  44.000 103.009
length(c(3,-1.2,44,103.009))
## [1] 4

More Fun Stuff with Lists

  • To sum the values in a list of numbers, use sum()
  • To produce a running, cumulative sum of a list, use cumsum()
  • The same idea applies to product, prod() and cumprod()
sum(rnorm(10))   # Sum 10 numbers drawn from std. norm
## [1] -5.292483
cumsum(1:10)     # Running sum of numbers from 1:10
##  [1]  1  3  6 10 15 21 28 36 45 55
cumprod(c(2,4,10))
## [1]  2  8 80

What is an R Soure File?

  • You can type commands into the console

  • But for anything even moderately sophisticated, you’ll probably want to create a source file

  • An R source file is a text-readable file with R commands in them that can be run any time

  • There are many advantages, not the least of which is being able to give the file to someone else to run

  • Also, you’ll need to turn in a source file for homework

Creating a Source File

  • You can use any text editor, but we’ll use R Studio

  • Go to File \(\rightarrow\) New File \(\rightarrow\) R Script

  • Type commands into the file in the order that you would like them to be executed
    • Don’t forget to include any library() loads
    • The user running this file next may not have that library loaded
  • Save the file when you are done
    • File \(\rightarrow\) Save As, then give the file a location and name that you like
    • Or just File \(\rightarrow\) Save if it already has a location and name
    • It is customary to give the file the extension .R or .r

Sourcing the File

To run the file in R Studio, either

  • Press the Source button in the top right of the editor panel

  • Or, in the console, type:

source('path/to/my/script.R')

Make Sure Your Script Runs in a Clean Evironment

It is always a good idea to make sure your script will run with your environment completely clean

  1. From the top-right panel, select the Environment tab

  2. Select the little broom icon to clean the environment, the confirm with Yes on the dialog

  3. From the Session menu option, select Restart R and Clean Output

  4. Source your file

Turning in R Lab Assignments

  • You will be given several R lab assignments throughout the semester

  • Please submit these via git – there is a link in the assignment

  • I will source your file and compare its output to what I expect to see

  • I will also need any data files you used
    • If you refer to files on your local machine, I wont be able to read them

Basic Data Types in R

Some Basic Data Types

  • numeric (numbers)
  • character (strings)
  • vectors (lists of one type)
  • factors (categorical variables)
  • lists (lists of arbitrary types)
  • matrices (numeric 1D and 2D arrays)

Numeric Data

  • scalar numbers: integers and real values
  • arithmatic operations on these
x = 3
y = 4.7
x/y + 2*x - 3
## [1] 3.638298
y > x
## [1] TRUE

Character Data

  • strings
  • operations on strings
x = "hello"
nchar(x)
## [1] 5
gsub("he","HE-",x)
## [1] "HE-llo"

Concatenating Strings with paste()

x = "hello"
y = "world"
paste(x,y)
## [1] "hello world"
paste(x,y,"turtle",sep=':')
## [1] "hello:world:turtle"

Vectors

  • What I called a “list” last week is really a “vector” in R
  • R vectors aren’t just numeric
  • However, they are the same type
  • Creating a vector with a mixture of numeric and character values will force R to coerce the vector to character
c(1,2,9)
## [1] 1 2 9
c(1,2,"9")
## [1] "1" "2" "9"

Vector Ops: Scaling, Indexing, and Length

x = c(-4,2,31)
3*x
## [1] -12   6  93
x[1]
## [1] -4
length(x)
## [1] 3

Vector Ops: Element-by-Element Ops

x = c(-4,2,3)
y = c(1,3,9)
z = 4*(1:3)
x*y
## [1] -4  6 27
x*y - z
## [1] -8 -2 15

Vector Ops: Named Elements

x = c(-4,2,3)
names(x) <- c("Bob","Frank","Mindy")
x["Frank"]
## Frank 
##     2
x["Mindy"] = 0
print(x)
##   Bob Frank Mindy 
##    -4     2     0

Vector Ops: Pasting with Vectors

x = 1:5
paste("String",x,sep='-')
## [1] "String-1" "String-2" "String-3" "String-4" "String-5"

Vector Ops: Appending to a Vector

x = seq(from=2,to=10,by=2)
print(x)
## [1]  2  4  6  8 10
x = c(x,-99)
x
## [1]   2   4   6   8  10 -99
x[7] = -98
x
## [1]   2   4   6   8  10 -99 -98

Factors

  • A factor is like a vector, except for categorical values
  • They can be created from vectors
x = c("good","good","bad","mediocre","good")
factor(x)
## [1] good     good     bad      mediocre good    
## Levels: bad good mediocre
factor(c(1,2,1,1,2,2,3,3,2,2))
##  [1] 1 2 1 1 2 2 3 3 2 2
## Levels: 1 2 3

Lists Are Like Mixed-Type Vectors

x = list()
x[[1]] = 1
x[[2]] = "hello"
x[[3]] = c(-1,-2,-9)
x
## [[1]]
## [1] 1
## 
## [[2]]
## [1] "hello"
## 
## [[3]]
## [1] -1 -2 -9

List Operations: Named Elements

x = list(3,1,4,5)
names(x) <- c("A","B","C","D")
x[["A"]]
## [1] 3
x$A
## [1] 3

Matrix Operations: Creating & Indexing

A = 1:6
dim(A) <- c(2,3)
print(A)
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
A[1,3]
## [1] 5
A[,2]
## [1] 3 4

Matrix Ops: Indexing Rows or Columns

A = 1:6
dim(A) <- c(2,3)
A[2,]; A[,3]
## [1] 2 4 6
## [1] 5 6
A[,2] = c(-99,-98)
print(A)
##      [,1] [,2] [,3]
## [1,]    1  -99    5
## [2,]    2  -98    6

Matrix Ops: Simple Arithmetic

A = 1:6
dim(A) <- c(2,3)
3*A
##      [,1] [,2] [,3]
## [1,]    3    9   15
## [2,]    6   12   18
A + A
##      [,1] [,2] [,3]
## [1,]    2    6   10
## [2,]    4    8   12

Matrix Ops: Transposition

A = 1:6
dim(A) <- c(2,3)
A
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
t(A)
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
## [3,]    5    6

Matrix Ops: Multiplication

A = 1:6
dim(A) <- c(2,3)
A * A
##      [,1] [,2] [,3]
## [1,]    1    9   25
## [2,]    4   16   36
A %*% t(A)
##      [,1] [,2]
## [1,]   35   44
## [2,]   44   56

Dataframes

Data Frames

  • Basic data type for statistical operations
  • Implements a table
    • Variables stored in columns
    • Observations stored in rows
  • Typically columns are named
  • Values in each column must be the same type
    • factors
    • numeric vectors
    • character vectors
  • Different columns may have different types

Data Frames: Creating Data Frames

Student = c("Bob", "Sue", "Cat", "Lin")
NumberGrade = c(96, 82, 97, 74)
LetterGrade = factor(c("A","B","A","C"))
RosterData = data.frame(Student,NumberGrade,LetterGrade)
RosterData
##   Student NumberGrade LetterGrade
## 1     Bob          96           A
## 2     Sue          82           B
## 3     Cat          97           A
## 4     Lin          74           C

Data Frames: Getting Variable Data

rd = data.frame( Student = c("Bob", "Sue", "Cat", "Lin"),
                 NumberGrade = c(96, 82, 97, 74),
                 LetterGrade = factor(c("A","B","A","C")) )
rd$NumberGrade
## [1] 96 82 97 74
rd$LetterGrade
## [1] A B A C
## Levels: A B C

Data Frames: Like Matrices, but Not

# rd from last slide
dim(rd)
## [1] 4 3
rd[2,]
##   Student NumberGrade LetterGrade
## 2     Sue          82           B

Data Frames: Summarizing

# rd from two slides ago
summary(rd)
##  Student  NumberGrade    LetterGrade
##  Bob:1   Min.   :74.00   A:2        
##  Cat:1   1st Qu.:80.00   B:1        
##  Lin:1   Median :89.00   C:1        
##  Sue:1   Mean   :87.25              
##          3rd Qu.:96.25              
##          Max.   :97.00

mode() versus class()

  • mode() describes how the data is stored
  • class() describes what class interprets that data
c(mode(4), mode("hello"))
## [1] "numeric"   "character"
x = as.Date("2014-01-22"); mode(x)
## [1] "numeric"
class(x)
## [1] "Date"

assign(), get(), and rm()

  • assign() allows you to assign values to a named variable
  • get() allows you to get the value from a named variable
  • rm() removes a named variable altogether
x = 2
assign("y",3)
print(c(get("x"),y))
## [1] 2 3
rm(x)
#print(x)

Read Table

x = read.table('http://cs.ucf.edu/~wiegand/idc6700/datasets/color-cookbook-eg.txt', 
               header=T)
head(x) 
##   cond1 cond2 yval
## 1     A     I  2.0
## 2     A     J  2.5
## 3     A     K  1.6
## 4     A     L  0.8
## 5     B     I  2.2
## 6     B     J  2.4

Learn More!

We’ll be learning a lot more about R as the semester progresses, including a number of lab assignments. But there’s also a lot of on-line materials, including:

Or Checkout These Books

R Cookbook, by Paul Teetor

R Graphics Cookbook, by Winston Chang