R

Michael Crawford
June 4th, 2015

History of R

  • R is based upon the S language
  • S was written in 1976 to be a statistical analysis environment
  • R was first released in 1993 as private software but was made free in 1995
  • In 2000 the 1.0.0 version of R was release
  • R version 3.2.0 was released in April 2015

Benefits of R

  • Built from the ground up with statistics and data mining in mind
  • Interactive (allows you to “play” with the data)
  • Large and active user base to help when you run into issues
  • There's a package for that
  • Very good (and free) Integrated Development Environment (RStudio)

Basic Statistics

x <- c(23, 33, 14, 15, 42, 28, 33, 45, 23, 34, 39, 
       21, 36, 23, 34, 36, 25, 9, 11, 19, 35, 24, 
       31, 29, 16, 23, 34, 24, 38, 15, 13, 35, 28)
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   9.00   21.00   28.00   26.91   34.00   45.00 
mean(x)
[1] 26.90909
sd(x)
[1] 9.494616
quantile(x,c(.1,.9),type = 1)
10% 90% 
 14  38 
hist(x)

plot of chunk unnamed-chunk-4

Z Tests

library(TeachingDemos)
z.test(x,mu = mean(x),sd = sd(x),conf.level = 0.9)

    One Sample z-test

data:  x
z = 0, n = 33.0000, Std. Dev. = 9.4946, Std. Dev. of the sample
mean = 1.6528, p-value = 1
alternative hypothesis: true mean is not equal to 26.90909
90 percent confidence interval:
 24.19048 29.62771
sample estimates:
mean of x 
 26.90909 

Note we had to load the 'TeachingDemos' library because r does not have z-tests by default since normally t-tests are used

z.test(x,mu = mean(x),sd = sd(x), conf.level = 0.9, alternative = 'greater')

    One Sample z-test

data:  x
z = 0, n = 33.0000, Std. Dev. = 9.4946, Std. Dev. of the sample
mean = 1.6528, p-value = 0.5
alternative hypothesis: true mean is greater than 26.90909
90 percent confidence interval:
 24.79094      Inf
sample estimates:
mean of x 
 26.90909 

All you have to do is change 'greater' to 'less' to get the other side of the interval

z.test(x,mu = mean(x),sd = sd(x), conf.level = 0.9, alternative = 'less')

T Tests

RISC_I <- c(140,120,176,288,992,144,2736,2796,752,17720,96)
t.test(RISC_I,mu = mean(RISC_I),conf.level = 0.9)

    One Sample t-test

data:  RISC_I
t = 0, df = 10, p-value = 1
alternative hypothesis: true mean is not equal to 2360
90 percent confidence interval:
 -478.3372 5198.3372
sample estimates:
mean of x 
     2360 
Z8002 <- c(126,180,141,374,1091,302,1368,1398,602,17720,240)
t.test(RISC_I,Z8002,conf.level=0.9,paired = TRUE)

    Paired t-test

data:  RISC_I and Z8002
t = 1.2531, df = 10, p-value = 0.2387
alternative hypothesis: true difference in means is not equal to 0
90 percent confidence interval:
 -98.11371 537.75007
sample estimates:
mean of the differences 
               219.8182 

Note that in this case the p-value is the alpha with which we can have confidence. So this is saying we can only be sure with a confidence of 76% (i.e. at a confidence of 0.76 the interval will no longer pass through 0)

Data Frames

Data Frames (along with vectors) are one of the fundamental data structures in R. Think of it as an excel worksheet where you have rows and columns.

DiskIOs <- c(14,16,27,42,39,50,83)
CPUTime <- c(2,5,7,9,10,13,20)
data <- data.frame(DiskIOs,CPUTime)
data
  DiskIOs CPUTime
1      14       2
2      16       5
3      27       7
4      42       9
5      39      10
6      50      13
7      83      20

One of the nice things about them is you can do operations on columns, rows or even the entire data frame.

log10(data$DiskIOs)
[1] 1.146128 1.204120 1.431364 1.623249 1.591065 1.698970 1.919078
log10(data)
   DiskIOs   CPUTime
1 1.146128 0.3010300
2 1.204120 0.6989700
3 1.431364 0.8450980
4 1.623249 0.9542425
5 1.591065 1.0000000
6 1.698970 1.1139434
7 1.919078 1.3010300

Linear Models

fit <- lm(DiskIOs ~ CPUTime, data)
fit$coefficients
(Intercept)     CPUTime 
   1.137500    3.985417 
confint(fit,level=0.9)
                  5 %     95 %
(Intercept) -5.556274 7.831274
CPUTime      3.369950 4.600884
predict(fit,data.frame(CPUTime=c(40)),
        interval='prediction',level=0.9)
       fit      lwr      upr
1 160.5542 139.5045 181.6038
predict(fit,data.frame(CPUTime=c(40)),
        interval='confidence',level=0.9)
       fit      lwr      upr
1 160.5542 141.4449 179.6634
summary(fit)

Call:
lm(formula = DiskIOs ~ CPUTime, data = data)

Residuals:
     1      2      3      4      5      6      7 
 4.892 -5.065 -2.035  4.994 -1.992 -2.948  2.154 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.1375     3.3219   0.342    0.746    
CPUTime       3.9854     0.3054  13.048 4.72e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.381 on 5 degrees of freedom
Multiple R-squared:  0.9715,    Adjusted R-squared:  0.9658 
F-statistic: 170.3 on 1 and 5 DF,  p-value: 4.716e-05

Linear Models

plot(DiskIOs ~ CPUTime,data=data)
abline(fit)

plot of chunk unnamed-chunk-13

plot(data$CPUTime,fit$residuals, xlab="CPU Time", ylab="Residuals")

plot of chunk unnamed-chunk-14

One Factor Anova

data <- read.csv("table20_1.csv")
data
   Processor Size
1          R  144
2          R  120
3          R  176
4          R  288
5          R  144
6          V  101
7          V  144
8          V  211
9          V  288
10         V   72
11         Z  130
12         Z  180
13         Z  141
14         Z  374
15         Z  302
data.anova <- aov(Size~Processor,data)
summary(data.anova)
            Df Sum Sq Mean Sq F value Pr(>F)
Processor    2  10992    5496   0.699  0.516
Residuals   12  94365    7864               
plot(TukeyHSD(data.anova,'Processor',conf.level=.9))

plot of chunk unnamed-chunk-16

One Factor Anova

library(agricolae)
HSD.test(data.anova,"Processor",alpha=0.1,
         console=TRUE)

Study: data.anova ~ "Processor"

HSD Test for Size 

Mean Square Error:  7863.767 

Processor,  means

   Size       std r Min Max
R 174.4  66.54923 5 120 288
V 163.2  87.19920 5  72 288
Z 225.4 107.51186 5 130 374

alpha: 0.1 ; Df Error: 12 
Critical Value of Studentized Range: 3.203947 

Honestly Significant Difference: 127.062 

Means with the same letter are not significantly different.

Groups, Treatments and means
a    Z   225.4 
a    R   174.4 
a    V   163.2 

Two Factor Anova

data <- read.csv("table18_11.csv")
data
   Processor Workload  Time
1          A        I 41.16
2          A        I 39.02
3          A        I 42.56
4          A        J 51.50
5          A        J 52.50
6          A        J 50.50
7          B        I 63.17
8          B        I 59.25
9          B        I 64.23
10         B        J 48.08
11         B        J 48.98
12         B        J 47.10
data.anova <- aov(Time~Processor*Workload,data)
summary(data.anova)
                   Df Sum Sq Mean Sq F value   Pr(>F)    
Processor           1  239.1   239.1  80.086 1.93e-05 ***
Workload            1    9.6     9.6   3.213    0.111    
Processor:Workload  1  459.4   459.4 153.853 1.67e-06 ***
Residuals           8   23.9     3.0                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(TukeyHSD(data.anova,'Workload',
              conf.level=.9))

plot of chunk unnamed-chunk-19

Two Factor Anova

HSD.test(data.anova,"Workload",alpha=0.1,
         console=TRUE)

Study: data.anova ~ "Workload"

HSD Test for Time 

Mean Square Error:  2.9861 

Workload,  means

      Time       std r   Min   Max
I 51.56500 11.839498 6 39.02 64.23
J 49.77667  2.077861 6 47.10 52.50

alpha: 0.1 ; Df Error: 8 
Critical Value of Studentized Range: 2.629798 

Honestly Significant Difference: 1.855235 

Means with the same letter are not significantly different.

Groups, Treatments and means
a    I   51.56 
a    J   49.78 
HSD.test(data.anova,"Processor",alpha=0.1,
         console=TRUE)

Study: data.anova ~ "Processor"

HSD Test for Time 

Mean Square Error:  2.9861 

Processor,  means

      Time      std r   Min   Max
A 46.20667 5.940931 6 39.02 52.50
B 55.13500 7.955272 6 47.10 64.23

alpha: 0.1 ; Df Error: 8 
Critical Value of Studentized Range: 2.629798 

Honestly Significant Difference: 1.855235 

Means with the same letter are not significantly different.

Groups, Treatments and means
a    B   55.14 
b    A   46.21 

Two Factor Anova

plot(TukeyHSD(data.anova,"Processor:Workload",
              conf.level=.9))

plot of chunk unnamed-chunk-22

tx <- with(data,interaction(Processor,Workload))
HSD.test(aov(Time~tx,data),"tx",alpha=0.1,
         console=TRUE)

Study: aov(Time ~ tx, data) ~ "tx"

HSD Test for Time 

Mean Square Error:  2.9861 

tx,  means

        Time       std r   Min   Max
A.I 40.91333 1.7828442 3 39.02 42.56
A.J 51.50000 1.0000000 3 50.50 52.50
B.I 62.21667 2.6233058 3 59.25 64.23
B.J 48.05333 0.9402836 3 47.10 48.98

alpha: 0.1 ; Df Error: 8 
Critical Value of Studentized Range: 3.834156 

Honestly Significant Difference: 3.825263 

Means with the same letter are not significantly different.

Groups, Treatments and means
a    B.I     62.22 
b    A.J     51.5 
b    B.J     48.05 
c    A.I     40.91 

Links

Help

  • Stack Overflow is a great resource. Most questions will get answered within 30 minutes
  • You can type question mark command to get help on any command
?sum
  • You can get examples with the command 'example'
example(sum)
  • Install packages with the install.packages command
install.packages("agricolae")
  • Note that once you save your R markdown file, whatever directory it is in will be the working directory.

Extra Credit Assignment

  • Redo Problem 4b from the midterm in R.
  • Also include a Tukey test on the interactions using the HSD.test function from the agricolae package as was shown in the presentation
  • Use R Markdown to generate solution.
  • Turn in either the HTML or PDF output from R-Studio
  • If you import your data from a csv file, include the csv with submission

Due Date: Thursday June 11th, 2015