Introduction to R

Katia Oleinik
koleinik@bu.edu

Helpful Links:


Login to SCC Cluster



Login: tuta30

Password: VizTut30



ssh -X scc4.bu.edu


cp -r /scratch/r-intro-2 .

R as a scientific calculator

Try executing the following commands in your Console Window:

> 2+3          # addition  
[1] 5
> 2^3          # power  
[1] 8
> log(2)       # built-in functions 
[1] 0.6931

Double Precision

By default R outputs 7 significant digits (single precision display). But all calculations are always done using double precision

> options(digits=15)   # change to double precision display
> exp(3)
[1] 20.0855369231877

Return back to the single precision output

> options(digits=7) 
> exp(3)
[1] 20.08554

Variables


> a <- 3
> A <- 7  # R is case sensetive
> 
> b = -5  

Both assignment operators are equivalent, the first one is more “traditional”.

Variables


Rules:

  • The name should contain only letters,digits, periods and underscores.
  • No other special characters ($, %, etc.) are allowed.
  • It should start with either a character or a period.
  • Avoid using names c, t, cat, F, T, D as those are built-in functions/constants

Variables

Let's create a few variables:

> str.var  <- "character variable"   
> num.var  <- 21.17                  
> bool.var <- TRUE                   
> comp.var <- 1-3i                   

To view the variable's value, either type in the variable name or use print() function:

> num.var  
[1] 21.17
> print(str.var)
[1] "character variable"

Variables

Check the mode of the variable (its type):

> mode(bool.var)
[1] "logical"
> mode(num.var)
[1] "numeric"
> mode(str.var)
[1] "character"

Data Types

R has a wide variety of data types including:

  • Vectors (numerical, character, boolean, complex)
  • Factors (numerical, character, boolean, complex)
  • Matrices
  • Data Frames
  • Lists

Vectors

Vector - an array of R objects of the same type.

To create a vector use function c() - concatinate:

> ( names <- c ("Alex", "Nick", "Mike") )
[1] "Alex" "Nick" "Mike"
> ( numbers <- c (21, -3, 7.25) )
[1] 21.00 -3.00  7.25

Note: If the R command is enclosed in parentheses, then after the command is executed the result is also printed to the screen.

Vectors

Vectors can be defined in a number of ways:

> ( vals <- c (2, -7, 5, 3, -1 ) )   
[1]  2 -7  5  3 -1
> ( vals <- 1:7 )                  
[1] 1 2 3 4 5 6 7
> ( vals <- seq(0, 3, by=0.5) )       
[1] 0.0 0.5 1.0 1.5 2.0 2.5 3.0
> ( vals <- rep("o", 7) )            
[1] "o" "o" "o" "o" "o" "o" "o"

Vectors

Vectors can be defined in a number of ways:

> ( vals <- numeric(9) )
[1] 0 0 0 0 0 0 0 0 0
> ( vals <- rnorm(5,2,1.5 ) )
[1]  2.481861  2.439406  3.559156 -1.742139  3.746586
> ( vals <- rpois(5,4 )  )
[1] 2 7 3 5 8

Vectors


Vector elements can have labels:

> heights<-c(Alex=180, Bob=175, Clara=165, Don=185)
> heights
 Alex   Bob Clara   Don 
  180   175   165   185 

Vector arithmetics:

Do not use loops to perform operations on vectors. R operates on vectors!

> a <- 1:5
> b <- seq(2,10, by=2)
> 
> a+b
[1]  3  6  9 12 15
> b/a   
[1] 2 2 2 2 2

Vector slicing (subsetting)

You can access particular elements in the vector in the following manner:

> # define a numeric vector
> x <- c(734, 145, 958, 456, 924) 
> 
> 
> x[2]     # returns second element        
[1] 145
> x[2:4]   # returns 2nd through 4th
[1] 145 958 456

Vector slicing (subsetting)

You can access particular elements in the vector in the following manner:

> # define a numeric vector
> x <- c(734, 145, 958, 456, 924) 
> 
> x[c(1,3,5)]  # returns 1st, 3rd and 5th elemetns
[1] 734 958 924
> x[-2]        # returns all but 2nd element
[1] 734 958 456 924

Vector slicing (subsetting)

You can access particular elements in the vector in the following manner:

> # define a numeric vector
> x <- c(734, 145, 958, 456, 924) 
> 
> x[c(TRUE, TRUE, FALSE, FALSE, TRUE)]   # returns 1st, 2nd, and 5th elements
[1] 734 145 924

Vector slicing (subsetting)

You can access particular elements in the vector in the following manner:

> # define a numeric vector
> x <- c(734, 145, 958, 456, 924) 
> 
> # returns only those elements that less than 500
> x[x<500]     
[1] 145 456

Vector Functions


max(x), min(x), sum(x)

mean(x), median(x), range(x)

var(x) , cor(x,y)

sort(x), rank(x), order(x)

cumsum(), cumprod(x), cummin(x), cumprod(x)

duplicated(x), unique(x)

Vector Functions


Example:

> # define a numeric vector
> x <- c(734, 145, 958, 456, 924) 
> 
> mean(x)
[1] 643.4
> sort(x)
[1] 145 456 734 924 958

Vector Functions


Usefule vector function:

> # define a numeric vector
> x <- c(734, 145, 958, 456, 924) 
> 
> summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  145.0   456.0   734.0   643.4   924.0   958.0 

Missing Values


To define a missing value, use NA:

> # define a numeric vector
> x <- c(734, 145, NA, 456, NA)    
> 
> # Check if the element in the vector is missing
> is.na(x)              
[1] FALSE FALSE  TRUE FALSE  TRUE
> # Which elements in the vector are missing
> which(is.na(x))       
[1] 3 5

Missing Values


Missing value cannot be compared to anything:

> # define a numeric vector
> x <- c(734, 145, NA, 456, NA)    
> 
> x == NA   # this does not work !     
[1] NA NA NA NA NA
> # Use is.na() instead
> is.na(x)              
[1] FALSE FALSE  TRUE FALSE  TRUE

Factors


Factor is a special type of a vector that stores “categorical” variables. To convert a vector into the factor use factor() function

> x <- c(0, 1, 1, 1, 0, 0, 1, 0)    
> x <- factor(x)
> table(x)
x
0 1 
4 4 

Factors


Each level in the factor variable can be named

> x <- factor( c(0, 0, 1, 1, 0, 0, 1, 0), labels=c("Fail","Success"))   
> 
> table(x)
x
   Fail Success 
      5       3 

Factors


Factors are treated differently by the summary() function:

> # define a numeric vector
> x <- factor( c(0, 0, 1, 1, 0, 0, 1, 0), labels=c("Fail","Success"))   
> 
> summary(x)
   Fail Success 
      5       3 

Matrices


Matrix is a 2 dimentional array of elements of the same type:

> matr <- matrix( c(1,2,3,4,5,6) , ncol=2)
> matr
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

Note: R matrix is “column-major”.

Matrices


Matrix is a 2 dimentional array of elements of the same type:

> matr <- matrix( c(1,2,3,4,5,6) , nrow=2) 
> matr
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Matrices


You can also convert array into a matrix:

> # First define the vector
> matr <- c(1,2,3,4,5,6) 
> # Then change dimensions
> dim(matr) <- c(2,3)      
> 
> matr
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Matrices


You can fill matrix by-row

> matr <- matrix( c(1,2,3,4,5,6) , ncol=2, byrow=TRUE)
> matr
     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6

Matrix Operations


> smatr = matrix( c(1,-3, 2, 5) , ncol=2) 
> smatr
     [,1] [,2]
[1,]    1    2
[2,]   -3    5
> # transpose matrix
> t(smatr) 
     [,1] [,2]
[1,]    1   -3
[2,]    2    5

Matrix Operations


> smatr = matrix( c(1,-3, 2, 5) , ncol=2) 
> smatr
     [,1] [,2]
[1,]    1    2
[2,]   -3    5
> # Inverse matrix
> solve(smatr)   
          [,1]        [,2]
[1,] 0.4545455 -0.18181818
[2,] 0.2727273  0.09090909

Matrix Operations


> smatr
     [,1] [,2]
[1,]    1    2
[2,]   -3    5
> # product of matricies elements
> smatr*smatr    
     [,1] [,2]
[1,]    1    4
[2,]    9   25

Matrix Operations


> smatr
     [,1] [,2]
[1,]    1    2
[2,]   -3    5
> # matrix product
> smatr %*% smatr  
     [,1] [,2]
[1,]   -5   12
[2,]  -18   19

Matrix Operations


> smatr
     [,1] [,2]
[1,]    1    2
[2,]   -3    5
> # inverse of each element of the matrix
> smatr^(-1)     
           [,1] [,2]
[1,]  1.0000000  0.5
[2,] -0.3333333  0.2

Matrix Operations

Some useful matrix functions:
colMeans(); rowMeans(); colSums(); rowSums()

> smatr
     [,1] [,2]
[1,]    1    2
[2,]   -3    5
> # inverse of each element of the matrix
> rowSums(smatr)    
[1] 3 2

R help


Access help file for the R function:

> ?matrix


> help(matrix)

R help


You can search for help using help.search() function:

> help.search("matrix")



Or using two question marks:

> ??matrix

R help


Get arguments of a function

> args(matrix)
function (data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL) 
NULL

R help


Examples of function usage

> example(matrix)

matrix> is.matrix(as.matrix(1:10))
[1] TRUE

matrix> !is.matrix(warpbreaks)  # data.frame, NOT matrix!
[1] TRUE

matrix> warpbreaks[1:10,]
   breaks wool tension
1      26    A       L
2      30    A       L
3      54    A       L
4      25    A       L
5      70    A       L
6      52    A       L
7      51    A       L
8      26    A       L
9      67    A       L
10     18    A       M

matrix> as.matrix(warpbreaks[1:10,])  # using as.matrix.data.frame(.) method
   breaks wool tension
1  "26"   "A"  "L"    
2  "30"   "A"  "L"    
3  "54"   "A"  "L"    
4  "25"   "A"  "L"    
5  "70"   "A"  "L"    
6  "52"   "A"  "L"    
7  "51"   "A"  "L"    
8  "26"   "A"  "L"    
9  "67"   "A"  "L"    
10 "18"   "A"  "M"    

matrix> ## Example of setting row and column names
matrix> mdat <- matrix(c(1,2,3, 11,12,13), nrow = 2, ncol = 3, byrow = TRUE,
matrix+                dimnames = list(c("row1", "row2"),
matrix+                                c("C.1", "C.2", "C.3")))

matrix> mdat
     C.1 C.2 C.3
row1   1   2   3
row2  11  12  13

Clean current R session


Check variables in the current session

> objects()
 [1] "a"        "A"        "b"        "bool.var" "comp.var" "heights" 
 [7] "matr"     "mdat"     "names"    "num.var"  "numbers"  "smatr"   
[13] "str.var"  "vals"     "x"       

Or

> ls()
 [1] "a"        "A"        "b"        "bool.var" "comp.var" "heights" 
 [7] "matr"     "mdat"     "names"    "num.var"  "numbers"  "smatr"   
[13] "str.var"  "vals"     "x"       

Clean current R session


Remove array x from the memory

> rm(x)


Remove everything from your working enviroment

> rm(list=ls())

Running R script in a batch mode



R CMD BATCH Rprog.R

Rscript Rprog.R

R -q –vanilla < Rprog.R

Data Frames


A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.). It is similar to SAS and SPSS datasets.

> names <- c("Alex", "Bob", "Cat")
> ages <- c(12,5,7)
> sex <- c("M","M","F")
> kids <- data.frame(Names=names,Ages=ages,Sex=sex)
> kids
  Names Ages Sex
1  Alex   12   M
2   Bob    5   M
3   Cat    7   F

Data Frames

Summary function will recognize each variable type:

> kids
  Names Ages Sex
1  Alex   12   M
2   Bob    5   M
3   Cat    7   F
> summary(kids)
  Names        Ages      Sex  
 Alex:1   Min.   : 5.0   F:1  
 Bob :1   1st Qu.: 6.0   M:2  
 Cat :1   Median : 7.0        
          Mean   : 8.0        
          3rd Qu.: 9.5        
          Max.   :12.0        

Data Frames


  1. Read the data from a file
  2. Clean the dataset if necessary, check for missing values
  3. Get summary of the data
  4. Use R graphics to explore each variable in the dataset
  5. Perform statistical analysis

Data Frames


Read the dataframe

> dt <- read.csv("USA_states.csv")   



See IO.R for more examples on reading a dataset.

Explore Data Frames


Look at the first fiew records

> head(dt)
       State Location Population Total.Area..km.2. House.seats
1    Alabama       SE    4822023            135767           7
2     Alaska       NW     731449           1723337           1
3    Arizona       SW    6553255            295233           9
4   Arkansas        S    2949131            137733           4
5 California       SW   38041430            423968          53
6   Colorado        C    5187582            269602           7
  Total.AP.exams Traffic.fatalities Driver.s.Ed.Required
1          39916                1.5                    0
2           4570                1.3                    0
3          53138                1.3                    0
4          43760                1.8                    0
5         632571                1.0                    1
6          67460                1.0                    1
  Deaths.per.100K.from.lung.cancer X..Cigarettes.sold.per.capita
1                            17.05                         18.20
2                            25.88                         30.34
3                            19.80                         25.82
4                            15.98                         18.24
5                            22.07                         28.60
6                               NA                            NA
          Statehood
1 December 14, 1819
2   January 3, 1959
3 February 14, 1912
4     June 15, 1836
5 September 9, 1850
6    August 1, 1876

Explore Data Frames


Look at the last fiew records

> tail(dt)
           State Location Population Total.Area..km.2. House.seats
45       Vermont       NE     626011             24905           1
46      Virginia        E    8185867            110787          11
47    Washington       NW    6897012            184661          10
48 West Virginia        E    1855413             62755           3
49     Wisconsin        N    5726398            169634           8
50       Wyoming        C     576412            253335           1
   Total.AP.exams Traffic.fatalities Driver.s.Ed.Required
45           6433                1.0                    1
46         149918                0.9                    1
47          75915                0.9                    1
48          11486                1.8                    0
49          61335                1.0                    1
50           2050                1.4                    0
   Deaths.per.100K.from.lung.cancer X..Cigarettes.sold.per.capita
45                            21.22                         25.89
46                               NA                            NA
47                            20.34                         21.17
48                            20.55                         21.25
49                            15.53                         22.86
50                            15.92                         28.04
           Statehood
45     March 4, 1791
46     June 25, 1788
47 November 11, 1889
48     June 20, 1863
49      May 29, 1848
50     July 10, 1890

Explore Data Frames


Get the names of the columns (variables)

> names(dt)        
 [1] "State"                            "Location"                        
 [3] "Population"                       "Total.Area..km.2."               
 [5] "House.seats"                      "Total.AP.exams"                  
 [7] "Traffic.fatalities"               "Driver.s.Ed.Required"            
 [9] "Deaths.per.100K.from.lung.cancer" "X..Cigarettes.sold.per.capita"   
[11] "Statehood"                       

Explore Data Frames


Get the number of the observations and the number of the variables

> dim(dt)
[1] 50 11

Explore Data Frames


> nrow(dt)    # number of rows
[1] 50
> ncol(dt)    # number of columns
[1] 11

Explore Data Frames


Get general summary of each variable

> summary(dt)      
        State       Location    Population       Total.Area..km.2.
 Alabama   : 1   C      :12   Min.   :  576412   Min.   :   4002  
 Alaska    : 1   NE     : 8   1st Qu.: 1855441   1st Qu.:  96909  
 Arizona   : 1   E      : 7   Median : 4491154   Median : 147872  
 Arkansas  : 1   N      : 7   Mean   : 6265714   Mean   : 196667  
 California: 1   S      : 6   3rd Qu.: 6834295   3rd Qu.: 219022  
 Colorado  : 1   NW     : 4   Max.   :38041430   Max.   :1723337  
 (Other)   :44   (Other): 6                                       
  House.seats    Total.AP.exams   Traffic.fatalities Driver.s.Ed.Required
 Min.   : 1.00   Min.   :  2050   Min.   :0.600      Min.   :0.00        
 1st Qu.: 3.00   1st Qu.: 11550   1st Qu.:0.925      1st Qu.:0.00        
 Median : 6.00   Median : 40581   Median :1.200      Median :1.00        
 Mean   : 8.70   Mean   : 76296   Mean   :1.222      Mean   :0.56        
 3rd Qu.: 9.75   3rd Qu.: 92869   3rd Qu.:1.400      3rd Qu.:1.00        
 Max.   :53.00   Max.   :632571   Max.   :2.000      Max.   :1.00        

 Deaths.per.100K.from.lung.cancer X..Cigarettes.sold.per.capita
 Min.   :12.01                    Min.   :12.12                
 1st Qu.:16.13                    1st Qu.:18.09                
 Median :20.32                    Median :21.21                
 Mean   :19.48                    Mean   :21.62                
 3rd Qu.:22.78                    3rd Qu.:25.34                
 Max.   :26.48                    Max.   :33.60                
 NA's   :8                        NA's   :8                    
            Statehood 
 November 2, 1889: 2  
 April 28, 1788  : 1  
 April 30, 1812  : 1  
 August 1, 1876  : 1  
 August 10, 1821 : 1  
 August 21, 1959 : 1  
 (Other)         :43  

Explore Data Frames


Display the structure of the dataset - type of each variable and list the first few values

> str(dt)     
'data.frame':   50 obs. of  11 variables:
 $ State                           : Factor w/ 50 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Location                        : Factor w/ 8 levels "C","E","N","NE",..: 7 5 8 6 8 1 4 2 7 2 ...
 $ Population                      : int  4822023 731449 6553255 2949131 38041430 5187582 3590347 917092 19317568 9919945 ...
 $ Total.Area..km.2.               : num  135767 1723337 295233 137733 423968 ...
 $ House.seats                     : int  7 1 9 4 53 7 5 1 27 14 ...
 $ Total.AP.exams                  : int  39916 4570 53138 43760 632571 67460 56838 11081 344996 141528 ...
 $ Traffic.fatalities              : num  1.5 1.3 1.3 1.8 1 1 0.7 1.3 1.3 1.2 ...
 $ Driver.s.Ed.Required            : int  0 0 0 0 1 1 0 1 1 0 ...
 $ Deaths.per.100K.from.lung.cancer: num  17.1 25.9 19.8 16 22.1 ...
 $ X..Cigarettes.sold.per.capita   : num  18.2 30.3 25.8 18.2 28.6 ...
 $ Statehood                       : Factor w/ 49 levels "April 28, 1788",..: 9 21 16 30 49 3 24 14 37 18 ...

Explore Data Frames

An example of R graphics

> hist(dt$Traffic.fatalities)

plot of chunk unnamed-chunk-54

Explore Data Frames


Read the dataset using option without converting string variables as factors

> dt <- read.csv("USA_states.csv", stringsAsFactors=FALSE)   
> 
> #Edit the names of this dataset 
> names(dt) <- c("state", "location", "pop", "area", "house","ap", "traffic","ed","cancer","cigar","date")
> names(dt)
 [1] "state"    "location" "pop"      "area"     "house"    "ap"      
 [7] "traffic"  "ed"       "cancer"   "cigar"    "date"    

Explore Data Frames


Convert location & education variables to be a factor variables

> dt$location <- factor(dt$location)
> dt$ed <- factor(dt$ed)
> str(dt)   # display the structure of the dataset
'data.frame':   50 obs. of  11 variables:
 $ state   : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
 $ location: Factor w/ 8 levels "C","E","N","NE",..: 7 5 8 6 8 1 4 2 7 2 ...
 $ pop     : int  4822023 731449 6553255 2949131 38041430 5187582 3590347 917092 19317568 9919945 ...
 $ area    : num  135767 1723337 295233 137733 423968 ...
 $ house   : int  7 1 9 4 53 7 5 1 27 14 ...
 $ ap      : int  39916 4570 53138 43760 632571 67460 56838 11081 344996 141528 ...
 $ traffic : num  1.5 1.3 1.3 1.8 1 1 0.7 1.3 1.3 1.2 ...
 $ ed      : Factor w/ 2 levels "0","1": 1 1 1 1 2 2 1 2 2 1 ...
 $ cancer  : num  17.1 25.9 19.8 16 22.1 ...
 $ cigar   : num  18.2 30.3 25.8 18.2 28.6 ...
 $ date    : chr  "December 14, 1819" "January 3, 1959" "February 14, 1912" "June 15, 1836" ...

Dataframe Slicing


> # element from the 3rd row, 5th column
> dt[3,5] 
> 
> # all elements from the 3rd row
> dt[3,] 
> 
> # all elements in the 5th column
> dt[ ,5]  
> 
> # same: 5th variable in the dataframe
> dt[[5]]    

Dataframe Slicing


> # accessing elemnts using variable name
> dt$area  
> 
> # list all the rows in the dataset for which variable ed is equal to 0
> dt[dt$ed == 0, ]  


Logical operators include:
is equal? (==); not equal? (!=);
greater? (>); greater than or equal? (>=);
less? (>); less than or equal? (>=);

Explore the Data Frame

Check for missing values

> # how many values are missing:
> sum(is.na(dt$cigar))
[1] 8
> # which obs. have missing values:
> which(is.na(dt$cigar))
[1]  6  9 10 11 29 33 37 46

Summaries for continues variables

> summary(dt$cigar)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  12.12   18.10   21.21   21.62   25.34   33.60       8 

Summaries for continues variables



> # does not work... because of missing variables
> mean(dt$cigar)     
[1] NA



> mean(dt$cigar, na.rm=TRUE)     
[1] 21.62381

Summaries for continues variables



> median(dt$cigar, na.rm=TRUE) 
> quantile(dt$cigar, na.rm=TRUE)    
> 
> # standard deviation
> sd(dt$cigar, na.rm=TRUE)         
> 
> # Tukey min, lower-hinge, median, upper-hindge, max
> fivenum(dt$cigar, na.rm=TRUE)    
> 
> max(dt$cigar, na.rm=TRUE) 
> min(dt$cigar, na.rm=TRUE) 

Graphics exploratory Analysis



> hist(dt$cigar)                  

plot of chunk unnamed-chunk-64

Graphics exploratory Analysis



> boxplot(dt$cigar, horizontal=TRUE)               # boxplot

plot of chunk unnamed-chunk-65

Graphics exploratory Analysis



> # barplot for categorical variable
> barplot(table(dt$location))

plot of chunk unnamed-chunk-66

Statistical Analysis


Summary information by groups

> tapply(dt$ap, dt$location, mean)
        C         E         N        NE        NW         S        SE 
 45101.17  92739.43  46612.43  68508.25  29091.50  82726.00 142052.67 
       SW 
237062.67 

Statistical Analysis


Correlation between 2 continious variables:

> cor(dt$house, dt$pop)   # strong correlation  
[1] 0.9996195
> plot(dt$house ~ dt$pop)     

plot of chunk unnamed-chunk-68

Statistical Analysis


Correlation between 2 continious variables:

> cor(dt$house, dt$area)     # weak correlation 
[1] 0.1194445
> plot(dt$house ~ dt$area)    

plot of chunk unnamed-chunk-69

Statistical Analysis


Search for all correlated data in the table at once

> pairs(dt[,c(3,4,5,6,9,10)])

plot of chunk unnamed-chunk-70

Statistical Analysis


Linear Model (fit a line)

> ( lm.fit <- lm(dt$ap ~dt$pop) )

Call:
lm(formula = dt$ap ~ dt$pop)

Coefficients:
(Intercept)       dt$pop  
 -2.313e+04    1.587e-02  

Statistical Analysis


Linear Model (fit a line)

> # Use data argument to specify the dataset name
> ( lm.fit <- lm(ap ~ pop, data=dt) )  

Call:
lm(formula = ap ~ pop, data = dt)

Coefficients:
(Intercept)          pop  
 -2.313e+04    1.587e-02  

Statistical Analysis


Linear Model (fit a line)

> # Use with function to specify the dataset name
> ( lm.fit <- with(dt, lm(ap ~ pop) ) )  # same command using data option

Call:
lm(formula = ap ~ pop)

Coefficients:
(Intercept)          pop  
 -2.313e+04    1.587e-02  

Statistical Analysis


Summarize the result of linear modeling:

> summary(lm.fit)   

Call:
lm(formula = ap ~ pop)

Residuals:
   Min     1Q Median     3Q    Max 
-74287 -12769   4817  14302  61586 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.313e+04  5.129e+03   -4.51 4.18e-05 ***
pop          1.587e-02  5.490e-04   28.90  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26900 on 48 degrees of freedom
Multiple R-squared:  0.9457,    Adjusted R-squared:  0.9445 
F-statistic: 835.4 on 1 and 48 DF,  p-value: < 2.2e-16

Statistical Analysis


Access elements of the output

> # access coefficients
> coef(lm.fit)   
  (Intercept)           pop 
-2.313141e+04  1.586852e-02 
> resid(lm.fit)  # access residual errors
         1          2          3          4          5          6 
-13470.971  16094.395 -27721.065  20093.058  52041.117   8272.147 
         7          8          9         10         11         12 
 22995.907  19659.515  61586.145   7244.538  12778.459   7532.564 
        13         14         15         16         17         18 
-14077.865 -12829.423  -8086.854  -7324.639   4794.695 -29565.833 
        19         20         21         22         23         24 
 16090.096  52478.089  11625.923 -44104.912   2481.421 -15202.956 
        25         26         27         28         29         30 
-38180.643  12049.307   4803.969   4830.251  12769.566  -6173.537 
        31         32         33         34         35         36 
  3402.003 -42417.720 -24302.581  14324.347 -60902.386 -10389.148 
        37         38         39         40         41         42 
-12587.562 -74287.050  16288.827 -10581.096  14233.313 -34736.628 
        43         44         45         46         47         48 
  7740.357  13543.223  19630.540  43151.794 -10398.981   5174.747 
        49         50 
 -6403.066  16034.603 

Statistical Analysis


Access elements of the output

> fitted(lm.fit) # predicted values for dt$ap
         1          2          3          4          5          6 
 53386.971 -11524.395  80859.065  23666.942 580529.883  59187.853 
         7          8          9         10         11         12 
 33842.093  -8578.515 283409.855 134283.462  -1037.459   2190.436 
        13         14         15         16         17         18 
181179.865  80606.423  25714.854  22663.639  46379.305  49893.833 
        19         20         21         22         23         24 
 -2039.096  70247.911  82333.077 133702.912  62227.579  24234.956 
        25         26         27         28         29         30 
 72428.643  -7181.307   6313.031  20648.749  -2173.566 117536.537 
        31         32         33         34         35         36 
  9962.997 287419.720 131619.581 -12029.347 160058.386  37404.148 
        37         38         39         40         41         42 
 38745.562 179407.050  -6464.827  51827.096  -9907.313  79319.628 
        43         44         45         46         47         48 
390389.643  22177.777 -13197.540 106766.206  86313.981   6311.253 
        49         50 
 67738.066 -13984.603 

Statistical Analysis


Plot linear model:

> plot(dt$ap~dt$pop)
> abline (lm.fit)

plot of chunk unnamed-chunk-77

Classical Tests


Tests for normality:

> # if the variable is normally distributed, the graph should be close to the straight line
> qqnorm(dt$cigar)  
> qqline(dt$cigar, lty=2)

plot of chunk unnamed-chunk-78

Classical Tests


> # p-value is large so we cannot reject the hypothesis that the data
> # is normally distributed
> shapiro.test(dt$cigar)  

    Shapiro-Wilk normality test

data:  dt$cigar
W = 0.9846, p-value = 0.8346

Classical Tests


Tests for normality:

> # This time the variable is not normally distributed
> qqnorm(dt$area)  
> qqline(dt$area, lty=2)

plot of chunk unnamed-chunk-80

Classical Tests


> # p-value is less than 0.05, so we can reject the null-hypothesis
> shapiro.test(dt$area) 

    Shapiro-Wilk normality test

data:  dt$area
W = 0.5198, p-value = 1.485e-11

Classical Tests


Fisher F test: compare if 2 sample variences are significantly different

> # p-value (large) and 95-confidence interval (includes 1) suggests
> # that we cannot reject the hypothesis that variences are the same for 2 samples
> var.test (dt$traffic[dt$ed ==0], dt$traffic[dt$ed==1] ) 

    F test to compare two variances

data:  dt$traffic[dt$ed == 0] and dt$traffic[dt$ed == 1]
F = 0.8706, num df = 21, denom df = 27, p-value = 0.7528
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.3892307 2.0327402
sample estimates:
ratio of variances 
         0.8706086 

Classical Tests


Comparing 2 means: Student's t test

> # we reject null-hypothesis based on the confidence interval and p value
> t.test (dt$traffic[dt$ed ==0], dt$traffic[dt$ed==1] ) 

    Welch Two Sample t-test

data:  dt$traffic[dt$ed == 0] and dt$traffic[dt$ed == 1]
t = 2.5475, df = 46.532, p-value = 0.01422
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.04802005 0.40912281
sample estimates:
mean of x mean of y 
 1.350000  1.121429 

Classical Tests


Paired t test:
Let's simulate another variable “number of trafic fatalities after after some law was implemented

> dt$traffic.after <- dt$traffic - rnorm(length(dt$traffic), mean=0.2, sd=.2)
> ( t.result<- t.test (dt$traffic, dt$traffic.after, paired=TRUE ) )

    Paired t-test

data:  dt$traffic and dt$traffic.after
t = 5.5731, df = 49, p-value = 1.056e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.09599496 0.20426285
sample estimates:
mean of the differences 
              0.1501289 


As we expected the difference in means is not equal to 0 based on 95 percent conf. interval and p value

Lists


Let's examine the output of the function

> str(t.result)
List of 9
 $ statistic  : Named num 5.57
  ..- attr(*, "names")= chr "t"
 $ parameter  : Named num 49
  ..- attr(*, "names")= chr "df"
 $ p.value    : num 1.06e-06
 $ conf.int   : atomic [1:2] 0.096 0.204
  ..- attr(*, "conf.level")= num 0.95
 $ estimate   : Named num 0.15
  ..- attr(*, "names")= chr "mean of the differences"
 $ null.value : Named num 0
  ..- attr(*, "names")= chr "difference in means"
 $ alternative: chr "two.sided"
 $ method     : chr "Paired t-test"
 $ data.name  : chr "dt$traffic and dt$traffic.after"
 - attr(*, "class")= chr "htest"

Lists


The output of this function is a list, we can access each individual element of the list by name

> t.result$p.value
[1] 1.056408e-06
> t.result$conf.int
[1] 0.09599496 0.20426285
attr(,"conf.level")
[1] 0.95


We can also access list elements by the index:

> t.result[[3]]
[1] 1.056408e-06

Statistical Models



y ~ x - regression

y ~ x-1 - regression through the origin

y ~ x + z - multiple regression

y ~ x * z - multiple regression with interaction

y ~ xc - one way ANOVA (xc is a categorical variable)

y ~ xc1 + xc2 - two way ANOVA (xc1, xc2) …

Statistical Models


> lm.fit <- lm(dt$ap ~ dt$pop)
> summary(lm.fit )

Call:
lm(formula = dt$ap ~ dt$pop)

Residuals:
   Min     1Q Median     3Q    Max 
-74287 -12769   4817  14302  61586 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.313e+04  5.129e+03   -4.51 4.18e-05 ***
dt$pop       1.587e-02  5.490e-04   28.90  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26900 on 48 degrees of freedom
Multiple R-squared:  0.9457,    Adjusted R-squared:  0.9445 
F-statistic: 835.4 on 1 and 48 DF,  p-value: < 2.2e-16

Statistical Models


Now we can perform the analysis of variance

> anova(lm.fit )
Analysis of Variance Table

Response: dt$ap
          Df     Sum Sq    Mean Sq F value    Pr(>F)    
dt$pop     1 6.0464e+11 6.0464e+11   835.4 < 2.2e-16 ***
Residuals 48 3.4741e+10 7.2377e+08                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


> qf(0.95,1,48)   # density function with 48 degrees of freedom
[1] 4.042652

Statistical Models


Plot the results:

> par(mfrow=c(2,2))  # display 4 graphs at once
> plot(lm.fit )

plot of chunk unnamed-chunk-91

Multiple regression


Suppose Y = dt$ap - number of AP tests in the state
Predictors are:
X1 = dt$pop - population
X2 = dt$area - land area
X3 = dt$cigar - cigarets sold

Y = FUN(X1, X2, X3) + err

> lm.fit <- lm(dt$ap ~ dt$pop + dt$area + dt$cigar)
> summary(lm.fit)

Call:
lm(formula = dt$ap ~ dt$pop + dt$area + dt$cigar)

Residuals:
   Min     1Q Median     3Q    Max 
-68810  -8076   5554  14423  63597 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.226e+04  1.787e+04  -2.364   0.0233 *  
dt$pop       1.533e-02  5.852e-04  26.200   <2e-16 ***
dt$area      1.173e-02  1.531e-02   0.766   0.4483    
dt$cigar     8.021e+02  8.333e+02   0.963   0.3418    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26240 on 38 degrees of freedom
  (8 observations deleted due to missingness)
Multiple R-squared:  0.9519,    Adjusted R-squared:  0.9481 
F-statistic: 250.7 on 3 and 38 DF,  p-value: < 2.2e-16

Thank you!

For help email: koleinik@bu.edu or help@scc.bu.edu