Katia Oleinik
koleinik@bu.edu
SCV examples: http://scv.bu.edu/examples/r/
Examples of tasks replicated in SAS and R: http://sas-and-r.blogspot.com/
Many examples of statistical analysis using R: http://www.ats.ucla.edu/stat/r/
R for SAS, Stata and SPSS users: http://scv.bu.edu/examples/r/tutorials/
Login: tuta30
Password: VizTut30
ssh -X scc4.bu.edu
cp -r /scratch/r-intro-2 .
Try executing the following commands in your Console Window:
> 2+3 # addition
[1] 5
> 2^3 # power
[1] 8
> log(2) # built-in functions
[1] 0.6931
By default R outputs 7 significant digits (single precision display). But all calculations are always done using double precision
> options(digits=15) # change to double precision display
> exp(3)
[1] 20.0855369231877
Return back to the single precision output
> options(digits=7)
> exp(3)
[1] 20.08554
> a <- 3
> A <- 7 # R is case sensetive
>
> b = -5
Both assignment operators are equivalent, the first one is more “traditional”.
Rules:
Let's create a few variables:
> str.var <- "character variable"
> num.var <- 21.17
> bool.var <- TRUE
> comp.var <- 1-3i
To view the variable's value, either type in the variable name or use print() function:
> num.var
[1] 21.17
> print(str.var)
[1] "character variable"
Check the mode of the variable (its type):
> mode(bool.var)
[1] "logical"
> mode(num.var)
[1] "numeric"
> mode(str.var)
[1] "character"
R has a wide variety of data types including:
Vector - an array of R objects of the same type.
To create a vector use function c() - concatinate:
> ( names <- c ("Alex", "Nick", "Mike") )
[1] "Alex" "Nick" "Mike"
> ( numbers <- c (21, -3, 7.25) )
[1] 21.00 -3.00 7.25
Note: If the R command is enclosed in parentheses, then after the command is executed the result is also printed to the screen.
Vectors can be defined in a number of ways:
> ( vals <- c (2, -7, 5, 3, -1 ) )
[1] 2 -7 5 3 -1
> ( vals <- 1:7 )
[1] 1 2 3 4 5 6 7
> ( vals <- seq(0, 3, by=0.5) )
[1] 0.0 0.5 1.0 1.5 2.0 2.5 3.0
> ( vals <- rep("o", 7) )
[1] "o" "o" "o" "o" "o" "o" "o"
Vectors can be defined in a number of ways:
> ( vals <- numeric(9) )
[1] 0 0 0 0 0 0 0 0 0
> ( vals <- rnorm(5,2,1.5 ) )
[1] 2.481861 2.439406 3.559156 -1.742139 3.746586
> ( vals <- rpois(5,4 ) )
[1] 2 7 3 5 8
Vector elements can have labels:
> heights<-c(Alex=180, Bob=175, Clara=165, Don=185)
> heights
Alex Bob Clara Don
180 175 165 185
Do not use loops to perform operations on vectors. R operates on vectors!
> a <- 1:5
> b <- seq(2,10, by=2)
>
> a+b
[1] 3 6 9 12 15
> b/a
[1] 2 2 2 2 2
You can access particular elements in the vector in the following manner:
> # define a numeric vector
> x <- c(734, 145, 958, 456, 924)
>
>
> x[2] # returns second element
[1] 145
> x[2:4] # returns 2nd through 4th
[1] 145 958 456
You can access particular elements in the vector in the following manner:
> # define a numeric vector
> x <- c(734, 145, 958, 456, 924)
>
> x[c(1,3,5)] # returns 1st, 3rd and 5th elemetns
[1] 734 958 924
> x[-2] # returns all but 2nd element
[1] 734 958 456 924
You can access particular elements in the vector in the following manner:
> # define a numeric vector
> x <- c(734, 145, 958, 456, 924)
>
> x[c(TRUE, TRUE, FALSE, FALSE, TRUE)] # returns 1st, 2nd, and 5th elements
[1] 734 145 924
You can access particular elements in the vector in the following manner:
> # define a numeric vector
> x <- c(734, 145, 958, 456, 924)
>
> # returns only those elements that less than 500
> x[x<500]
[1] 145 456
max(x), min(x), sum(x)
mean(x), median(x), range(x)
var(x) , cor(x,y)
sort(x), rank(x), order(x)
cumsum(), cumprod(x), cummin(x), cumprod(x)
duplicated(x), unique(x)
Example:
> # define a numeric vector
> x <- c(734, 145, 958, 456, 924)
>
> mean(x)
[1] 643.4
> sort(x)
[1] 145 456 734 924 958
Usefule vector function:
> # define a numeric vector
> x <- c(734, 145, 958, 456, 924)
>
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
145.0 456.0 734.0 643.4 924.0 958.0
To define a missing value, use NA:
> # define a numeric vector
> x <- c(734, 145, NA, 456, NA)
>
> # Check if the element in the vector is missing
> is.na(x)
[1] FALSE FALSE TRUE FALSE TRUE
> # Which elements in the vector are missing
> which(is.na(x))
[1] 3 5
Missing value cannot be compared to anything:
> # define a numeric vector
> x <- c(734, 145, NA, 456, NA)
>
> x == NA # this does not work !
[1] NA NA NA NA NA
> # Use is.na() instead
> is.na(x)
[1] FALSE FALSE TRUE FALSE TRUE
Factor is a special type of a vector that stores “categorical” variables.
To convert a vector into the factor use factor() function
> x <- c(0, 1, 1, 1, 0, 0, 1, 0)
> x <- factor(x)
> table(x)
x
0 1
4 4
Each level in the factor variable can be named
> x <- factor( c(0, 0, 1, 1, 0, 0, 1, 0), labels=c("Fail","Success"))
>
> table(x)
x
Fail Success
5 3
Factors are treated differently by the summary() function:
> # define a numeric vector
> x <- factor( c(0, 0, 1, 1, 0, 0, 1, 0), labels=c("Fail","Success"))
>
> summary(x)
Fail Success
5 3
Matrix is a 2 dimentional array of elements of the same type:
> matr <- matrix( c(1,2,3,4,5,6) , ncol=2)
> matr
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
Note: R matrix is “column-major”.
Matrix is a 2 dimentional array of elements of the same type:
> matr <- matrix( c(1,2,3,4,5,6) , nrow=2)
> matr
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
You can also convert array into a matrix:
> # First define the vector
> matr <- c(1,2,3,4,5,6)
> # Then change dimensions
> dim(matr) <- c(2,3)
>
> matr
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
You can fill matrix by-row
> matr <- matrix( c(1,2,3,4,5,6) , ncol=2, byrow=TRUE)
> matr
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
> smatr = matrix( c(1,-3, 2, 5) , ncol=2)
> smatr
[,1] [,2]
[1,] 1 2
[2,] -3 5
> # transpose matrix
> t(smatr)
[,1] [,2]
[1,] 1 -3
[2,] 2 5
> smatr = matrix( c(1,-3, 2, 5) , ncol=2)
> smatr
[,1] [,2]
[1,] 1 2
[2,] -3 5
> # Inverse matrix
> solve(smatr)
[,1] [,2]
[1,] 0.4545455 -0.18181818
[2,] 0.2727273 0.09090909
> smatr
[,1] [,2]
[1,] 1 2
[2,] -3 5
> # product of matricies elements
> smatr*smatr
[,1] [,2]
[1,] 1 4
[2,] 9 25
> smatr
[,1] [,2]
[1,] 1 2
[2,] -3 5
> # matrix product
> smatr %*% smatr
[,1] [,2]
[1,] -5 12
[2,] -18 19
> smatr
[,1] [,2]
[1,] 1 2
[2,] -3 5
> # inverse of each element of the matrix
> smatr^(-1)
[,1] [,2]
[1,] 1.0000000 0.5
[2,] -0.3333333 0.2
Some useful matrix functions:
colMeans(); rowMeans(); colSums(); rowSums()
> smatr
[,1] [,2]
[1,] 1 2
[2,] -3 5
> # inverse of each element of the matrix
> rowSums(smatr)
[1] 3 2
Access help file for the R function:
> ?matrix
> help(matrix)
You can search for help using help.search() function:
> help.search("matrix")
Or using two question marks:
> ??matrix
Get arguments of a function
> args(matrix)
function (data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)
NULL
Examples of function usage
> example(matrix)
matrix> is.matrix(as.matrix(1:10))
[1] TRUE
matrix> !is.matrix(warpbreaks) # data.frame, NOT matrix!
[1] TRUE
matrix> warpbreaks[1:10,]
breaks wool tension
1 26 A L
2 30 A L
3 54 A L
4 25 A L
5 70 A L
6 52 A L
7 51 A L
8 26 A L
9 67 A L
10 18 A M
matrix> as.matrix(warpbreaks[1:10,]) # using as.matrix.data.frame(.) method
breaks wool tension
1 "26" "A" "L"
2 "30" "A" "L"
3 "54" "A" "L"
4 "25" "A" "L"
5 "70" "A" "L"
6 "52" "A" "L"
7 "51" "A" "L"
8 "26" "A" "L"
9 "67" "A" "L"
10 "18" "A" "M"
matrix> ## Example of setting row and column names
matrix> mdat <- matrix(c(1,2,3, 11,12,13), nrow = 2, ncol = 3, byrow = TRUE,
matrix+ dimnames = list(c("row1", "row2"),
matrix+ c("C.1", "C.2", "C.3")))
matrix> mdat
C.1 C.2 C.3
row1 1 2 3
row2 11 12 13
Check variables in the current session
> objects()
[1] "a" "A" "b" "bool.var" "comp.var" "heights"
[7] "matr" "mdat" "names" "num.var" "numbers" "smatr"
[13] "str.var" "vals" "x"
Or
> ls()
[1] "a" "A" "b" "bool.var" "comp.var" "heights"
[7] "matr" "mdat" "names" "num.var" "numbers" "smatr"
[13] "str.var" "vals" "x"
Remove array x from the memory
> rm(x)
Remove everything from your working enviroment
> rm(list=ls())
R CMD BATCH Rprog.R
Rscript Rprog.R
R -q –vanilla < Rprog.R
A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.). It is similar to SAS and SPSS datasets.
> names <- c("Alex", "Bob", "Cat")
> ages <- c(12,5,7)
> sex <- c("M","M","F")
> kids <- data.frame(Names=names,Ages=ages,Sex=sex)
> kids
Names Ages Sex
1 Alex 12 M
2 Bob 5 M
3 Cat 7 F
Summary function will recognize each variable type:
> kids
Names Ages Sex
1 Alex 12 M
2 Bob 5 M
3 Cat 7 F
> summary(kids)
Names Ages Sex
Alex:1 Min. : 5.0 F:1
Bob :1 1st Qu.: 6.0 M:2
Cat :1 Median : 7.0
Mean : 8.0
3rd Qu.: 9.5
Max. :12.0
Read the dataframe
> dt <- read.csv("USA_states.csv")
See IO.R for more examples on reading a dataset.
Look at the first fiew records
> head(dt)
State Location Population Total.Area..km.2. House.seats
1 Alabama SE 4822023 135767 7
2 Alaska NW 731449 1723337 1
3 Arizona SW 6553255 295233 9
4 Arkansas S 2949131 137733 4
5 California SW 38041430 423968 53
6 Colorado C 5187582 269602 7
Total.AP.exams Traffic.fatalities Driver.s.Ed.Required
1 39916 1.5 0
2 4570 1.3 0
3 53138 1.3 0
4 43760 1.8 0
5 632571 1.0 1
6 67460 1.0 1
Deaths.per.100K.from.lung.cancer X..Cigarettes.sold.per.capita
1 17.05 18.20
2 25.88 30.34
3 19.80 25.82
4 15.98 18.24
5 22.07 28.60
6 NA NA
Statehood
1 December 14, 1819
2 January 3, 1959
3 February 14, 1912
4 June 15, 1836
5 September 9, 1850
6 August 1, 1876
Look at the last fiew records
> tail(dt)
State Location Population Total.Area..km.2. House.seats
45 Vermont NE 626011 24905 1
46 Virginia E 8185867 110787 11
47 Washington NW 6897012 184661 10
48 West Virginia E 1855413 62755 3
49 Wisconsin N 5726398 169634 8
50 Wyoming C 576412 253335 1
Total.AP.exams Traffic.fatalities Driver.s.Ed.Required
45 6433 1.0 1
46 149918 0.9 1
47 75915 0.9 1
48 11486 1.8 0
49 61335 1.0 1
50 2050 1.4 0
Deaths.per.100K.from.lung.cancer X..Cigarettes.sold.per.capita
45 21.22 25.89
46 NA NA
47 20.34 21.17
48 20.55 21.25
49 15.53 22.86
50 15.92 28.04
Statehood
45 March 4, 1791
46 June 25, 1788
47 November 11, 1889
48 June 20, 1863
49 May 29, 1848
50 July 10, 1890
Get the names of the columns (variables)
> names(dt)
[1] "State" "Location"
[3] "Population" "Total.Area..km.2."
[5] "House.seats" "Total.AP.exams"
[7] "Traffic.fatalities" "Driver.s.Ed.Required"
[9] "Deaths.per.100K.from.lung.cancer" "X..Cigarettes.sold.per.capita"
[11] "Statehood"
Get the number of the observations and the number of the variables
> dim(dt)
[1] 50 11
> nrow(dt) # number of rows
[1] 50
> ncol(dt) # number of columns
[1] 11
Get general summary of each variable
> summary(dt)
State Location Population Total.Area..km.2.
Alabama : 1 C :12 Min. : 576412 Min. : 4002
Alaska : 1 NE : 8 1st Qu.: 1855441 1st Qu.: 96909
Arizona : 1 E : 7 Median : 4491154 Median : 147872
Arkansas : 1 N : 7 Mean : 6265714 Mean : 196667
California: 1 S : 6 3rd Qu.: 6834295 3rd Qu.: 219022
Colorado : 1 NW : 4 Max. :38041430 Max. :1723337
(Other) :44 (Other): 6
House.seats Total.AP.exams Traffic.fatalities Driver.s.Ed.Required
Min. : 1.00 Min. : 2050 Min. :0.600 Min. :0.00
1st Qu.: 3.00 1st Qu.: 11550 1st Qu.:0.925 1st Qu.:0.00
Median : 6.00 Median : 40581 Median :1.200 Median :1.00
Mean : 8.70 Mean : 76296 Mean :1.222 Mean :0.56
3rd Qu.: 9.75 3rd Qu.: 92869 3rd Qu.:1.400 3rd Qu.:1.00
Max. :53.00 Max. :632571 Max. :2.000 Max. :1.00
Deaths.per.100K.from.lung.cancer X..Cigarettes.sold.per.capita
Min. :12.01 Min. :12.12
1st Qu.:16.13 1st Qu.:18.09
Median :20.32 Median :21.21
Mean :19.48 Mean :21.62
3rd Qu.:22.78 3rd Qu.:25.34
Max. :26.48 Max. :33.60
NA's :8 NA's :8
Statehood
November 2, 1889: 2
April 28, 1788 : 1
April 30, 1812 : 1
August 1, 1876 : 1
August 10, 1821 : 1
August 21, 1959 : 1
(Other) :43
Display the structure of the dataset - type of each variable and list the first few values
> str(dt)
'data.frame': 50 obs. of 11 variables:
$ State : Factor w/ 50 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Location : Factor w/ 8 levels "C","E","N","NE",..: 7 5 8 6 8 1 4 2 7 2 ...
$ Population : int 4822023 731449 6553255 2949131 38041430 5187582 3590347 917092 19317568 9919945 ...
$ Total.Area..km.2. : num 135767 1723337 295233 137733 423968 ...
$ House.seats : int 7 1 9 4 53 7 5 1 27 14 ...
$ Total.AP.exams : int 39916 4570 53138 43760 632571 67460 56838 11081 344996 141528 ...
$ Traffic.fatalities : num 1.5 1.3 1.3 1.8 1 1 0.7 1.3 1.3 1.2 ...
$ Driver.s.Ed.Required : int 0 0 0 0 1 1 0 1 1 0 ...
$ Deaths.per.100K.from.lung.cancer: num 17.1 25.9 19.8 16 22.1 ...
$ X..Cigarettes.sold.per.capita : num 18.2 30.3 25.8 18.2 28.6 ...
$ Statehood : Factor w/ 49 levels "April 28, 1788",..: 9 21 16 30 49 3 24 14 37 18 ...
An example of R graphics
> hist(dt$Traffic.fatalities)
Read the dataset using option without converting string variables as factors
> dt <- read.csv("USA_states.csv", stringsAsFactors=FALSE)
>
> #Edit the names of this dataset
> names(dt) <- c("state", "location", "pop", "area", "house","ap", "traffic","ed","cancer","cigar","date")
> names(dt)
[1] "state" "location" "pop" "area" "house" "ap"
[7] "traffic" "ed" "cancer" "cigar" "date"
Convert location & education variables to be a factor variables
> dt$location <- factor(dt$location)
> dt$ed <- factor(dt$ed)
> str(dt) # display the structure of the dataset
'data.frame': 50 obs. of 11 variables:
$ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
$ location: Factor w/ 8 levels "C","E","N","NE",..: 7 5 8 6 8 1 4 2 7 2 ...
$ pop : int 4822023 731449 6553255 2949131 38041430 5187582 3590347 917092 19317568 9919945 ...
$ area : num 135767 1723337 295233 137733 423968 ...
$ house : int 7 1 9 4 53 7 5 1 27 14 ...
$ ap : int 39916 4570 53138 43760 632571 67460 56838 11081 344996 141528 ...
$ traffic : num 1.5 1.3 1.3 1.8 1 1 0.7 1.3 1.3 1.2 ...
$ ed : Factor w/ 2 levels "0","1": 1 1 1 1 2 2 1 2 2 1 ...
$ cancer : num 17.1 25.9 19.8 16 22.1 ...
$ cigar : num 18.2 30.3 25.8 18.2 28.6 ...
$ date : chr "December 14, 1819" "January 3, 1959" "February 14, 1912" "June 15, 1836" ...
> # element from the 3rd row, 5th column
> dt[3,5]
>
> # all elements from the 3rd row
> dt[3,]
>
> # all elements in the 5th column
> dt[ ,5]
>
> # same: 5th variable in the dataframe
> dt[[5]]
> # accessing elemnts using variable name
> dt$area
>
> # list all the rows in the dataset for which variable ed is equal to 0
> dt[dt$ed == 0, ]
Logical operators include:
is equal? (==); not equal? (!=);
greater? (>); greater than or equal? (>=);
less? (>); less than or equal? (>=);
Check for missing values
> # how many values are missing:
> sum(is.na(dt$cigar))
[1] 8
> # which obs. have missing values:
> which(is.na(dt$cigar))
[1] 6 9 10 11 29 33 37 46
> summary(dt$cigar)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
12.12 18.10 21.21 21.62 25.34 33.60 8
> # does not work... because of missing variables
> mean(dt$cigar)
[1] NA
> mean(dt$cigar, na.rm=TRUE)
[1] 21.62381
> median(dt$cigar, na.rm=TRUE)
> quantile(dt$cigar, na.rm=TRUE)
>
> # standard deviation
> sd(dt$cigar, na.rm=TRUE)
>
> # Tukey min, lower-hinge, median, upper-hindge, max
> fivenum(dt$cigar, na.rm=TRUE)
>
> max(dt$cigar, na.rm=TRUE)
> min(dt$cigar, na.rm=TRUE)
> hist(dt$cigar)
> boxplot(dt$cigar, horizontal=TRUE) # boxplot
> # barplot for categorical variable
> barplot(table(dt$location))
Summary information by groups
> tapply(dt$ap, dt$location, mean)
C E N NE NW S SE
45101.17 92739.43 46612.43 68508.25 29091.50 82726.00 142052.67
SW
237062.67
Correlation between 2 continious variables:
> cor(dt$house, dt$pop) # strong correlation
[1] 0.9996195
> plot(dt$house ~ dt$pop)
Correlation between 2 continious variables:
> cor(dt$house, dt$area) # weak correlation
[1] 0.1194445
> plot(dt$house ~ dt$area)
Search for all correlated data in the table at once
> pairs(dt[,c(3,4,5,6,9,10)])
Linear Model (fit a line)
> ( lm.fit <- lm(dt$ap ~dt$pop) )
Call:
lm(formula = dt$ap ~ dt$pop)
Coefficients:
(Intercept) dt$pop
-2.313e+04 1.587e-02
Linear Model (fit a line)
> # Use data argument to specify the dataset name
> ( lm.fit <- lm(ap ~ pop, data=dt) )
Call:
lm(formula = ap ~ pop, data = dt)
Coefficients:
(Intercept) pop
-2.313e+04 1.587e-02
Linear Model (fit a line)
> # Use with function to specify the dataset name
> ( lm.fit <- with(dt, lm(ap ~ pop) ) ) # same command using data option
Call:
lm(formula = ap ~ pop)
Coefficients:
(Intercept) pop
-2.313e+04 1.587e-02
Summarize the result of linear modeling:
> summary(lm.fit)
Call:
lm(formula = ap ~ pop)
Residuals:
Min 1Q Median 3Q Max
-74287 -12769 4817 14302 61586
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.313e+04 5.129e+03 -4.51 4.18e-05 ***
pop 1.587e-02 5.490e-04 28.90 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 26900 on 48 degrees of freedom
Multiple R-squared: 0.9457, Adjusted R-squared: 0.9445
F-statistic: 835.4 on 1 and 48 DF, p-value: < 2.2e-16
Access elements of the output
> # access coefficients
> coef(lm.fit)
(Intercept) pop
-2.313141e+04 1.586852e-02
> resid(lm.fit) # access residual errors
1 2 3 4 5 6
-13470.971 16094.395 -27721.065 20093.058 52041.117 8272.147
7 8 9 10 11 12
22995.907 19659.515 61586.145 7244.538 12778.459 7532.564
13 14 15 16 17 18
-14077.865 -12829.423 -8086.854 -7324.639 4794.695 -29565.833
19 20 21 22 23 24
16090.096 52478.089 11625.923 -44104.912 2481.421 -15202.956
25 26 27 28 29 30
-38180.643 12049.307 4803.969 4830.251 12769.566 -6173.537
31 32 33 34 35 36
3402.003 -42417.720 -24302.581 14324.347 -60902.386 -10389.148
37 38 39 40 41 42
-12587.562 -74287.050 16288.827 -10581.096 14233.313 -34736.628
43 44 45 46 47 48
7740.357 13543.223 19630.540 43151.794 -10398.981 5174.747
49 50
-6403.066 16034.603
Access elements of the output
> fitted(lm.fit) # predicted values for dt$ap
1 2 3 4 5 6
53386.971 -11524.395 80859.065 23666.942 580529.883 59187.853
7 8 9 10 11 12
33842.093 -8578.515 283409.855 134283.462 -1037.459 2190.436
13 14 15 16 17 18
181179.865 80606.423 25714.854 22663.639 46379.305 49893.833
19 20 21 22 23 24
-2039.096 70247.911 82333.077 133702.912 62227.579 24234.956
25 26 27 28 29 30
72428.643 -7181.307 6313.031 20648.749 -2173.566 117536.537
31 32 33 34 35 36
9962.997 287419.720 131619.581 -12029.347 160058.386 37404.148
37 38 39 40 41 42
38745.562 179407.050 -6464.827 51827.096 -9907.313 79319.628
43 44 45 46 47 48
390389.643 22177.777 -13197.540 106766.206 86313.981 6311.253
49 50
67738.066 -13984.603
Plot linear model:
> plot(dt$ap~dt$pop)
> abline (lm.fit)
Tests for normality:
> # if the variable is normally distributed, the graph should be close to the straight line
> qqnorm(dt$cigar)
> qqline(dt$cigar, lty=2)
> # p-value is large so we cannot reject the hypothesis that the data
> # is normally distributed
> shapiro.test(dt$cigar)
Shapiro-Wilk normality test
data: dt$cigar
W = 0.9846, p-value = 0.8346
Tests for normality:
> # This time the variable is not normally distributed
> qqnorm(dt$area)
> qqline(dt$area, lty=2)
> # p-value is less than 0.05, so we can reject the null-hypothesis
> shapiro.test(dt$area)
Shapiro-Wilk normality test
data: dt$area
W = 0.5198, p-value = 1.485e-11
Fisher F test: compare if 2 sample variences are significantly different
> # p-value (large) and 95-confidence interval (includes 1) suggests
> # that we cannot reject the hypothesis that variences are the same for 2 samples
> var.test (dt$traffic[dt$ed ==0], dt$traffic[dt$ed==1] )
F test to compare two variances
data: dt$traffic[dt$ed == 0] and dt$traffic[dt$ed == 1]
F = 0.8706, num df = 21, denom df = 27, p-value = 0.7528
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.3892307 2.0327402
sample estimates:
ratio of variances
0.8706086
Comparing 2 means: Student's t test
> # we reject null-hypothesis based on the confidence interval and p value
> t.test (dt$traffic[dt$ed ==0], dt$traffic[dt$ed==1] )
Welch Two Sample t-test
data: dt$traffic[dt$ed == 0] and dt$traffic[dt$ed == 1]
t = 2.5475, df = 46.532, p-value = 0.01422
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.04802005 0.40912281
sample estimates:
mean of x mean of y
1.350000 1.121429
Paired t test:
Let's simulate another variable “number of trafic fatalities after after some law was implemented
> dt$traffic.after <- dt$traffic - rnorm(length(dt$traffic), mean=0.2, sd=.2)
> ( t.result<- t.test (dt$traffic, dt$traffic.after, paired=TRUE ) )
Paired t-test
data: dt$traffic and dt$traffic.after
t = 5.5731, df = 49, p-value = 1.056e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.09599496 0.20426285
sample estimates:
mean of the differences
0.1501289
As we expected the difference in means is not equal to 0 based on 95 percent conf. interval and p value
Let's examine the output of the function
> str(t.result)
List of 9
$ statistic : Named num 5.57
..- attr(*, "names")= chr "t"
$ parameter : Named num 49
..- attr(*, "names")= chr "df"
$ p.value : num 1.06e-06
$ conf.int : atomic [1:2] 0.096 0.204
..- attr(*, "conf.level")= num 0.95
$ estimate : Named num 0.15
..- attr(*, "names")= chr "mean of the differences"
$ null.value : Named num 0
..- attr(*, "names")= chr "difference in means"
$ alternative: chr "two.sided"
$ method : chr "Paired t-test"
$ data.name : chr "dt$traffic and dt$traffic.after"
- attr(*, "class")= chr "htest"
The output of this function is a list, we can access each individual element of the list by name
> t.result$p.value
[1] 1.056408e-06
> t.result$conf.int
[1] 0.09599496 0.20426285
attr(,"conf.level")
[1] 0.95
We can also access list elements by the index:
> t.result[[3]]
[1] 1.056408e-06
y ~ x - regression
y ~ x-1 - regression through the origin
y ~ x + z - multiple regression
y ~ x * z - multiple regression with interaction
y ~ xc - one way ANOVA (xc is a categorical variable)
y ~ xc1 + xc2 - two way ANOVA (xc1, xc2)
…
> lm.fit <- lm(dt$ap ~ dt$pop)
> summary(lm.fit )
Call:
lm(formula = dt$ap ~ dt$pop)
Residuals:
Min 1Q Median 3Q Max
-74287 -12769 4817 14302 61586
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.313e+04 5.129e+03 -4.51 4.18e-05 ***
dt$pop 1.587e-02 5.490e-04 28.90 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 26900 on 48 degrees of freedom
Multiple R-squared: 0.9457, Adjusted R-squared: 0.9445
F-statistic: 835.4 on 1 and 48 DF, p-value: < 2.2e-16
Now we can perform the analysis of variance
> anova(lm.fit )
Analysis of Variance Table
Response: dt$ap
Df Sum Sq Mean Sq F value Pr(>F)
dt$pop 1 6.0464e+11 6.0464e+11 835.4 < 2.2e-16 ***
Residuals 48 3.4741e+10 7.2377e+08
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> qf(0.95,1,48) # density function with 48 degrees of freedom
[1] 4.042652
Plot the results:
> par(mfrow=c(2,2)) # display 4 graphs at once
> plot(lm.fit )
Suppose Y = dt$ap - number of AP tests in the state
Predictors are:
X1 = dt$pop - population
X2 = dt$area - land area
X3 = dt$cigar - cigarets sold
Y = FUN(X1, X2, X3) + err
> lm.fit <- lm(dt$ap ~ dt$pop + dt$area + dt$cigar)
> summary(lm.fit)
Call:
lm(formula = dt$ap ~ dt$pop + dt$area + dt$cigar)
Residuals:
Min 1Q Median 3Q Max
-68810 -8076 5554 14423 63597
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.226e+04 1.787e+04 -2.364 0.0233 *
dt$pop 1.533e-02 5.852e-04 26.200 <2e-16 ***
dt$area 1.173e-02 1.531e-02 0.766 0.4483
dt$cigar 8.021e+02 8.333e+02 0.963 0.3418
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 26240 on 38 degrees of freedom
(8 observations deleted due to missingness)
Multiple R-squared: 0.9519, Adjusted R-squared: 0.9481
F-statistic: 250.7 on 3 and 38 DF, p-value: < 2.2e-16
For help email: koleinik@bu.edu or help@scc.bu.edu