First step for the exploration of the two datasets, we load the libaries we will be using in the analysis.
library(tidyr)
library(dplyr)
library(magrittr)
library(MASS)
For the CO2 dataset, let’s do a summary of the dataset first to see what we have in the dataset. Also, we will perform a further group by analysis to see the average and standard deviation grouping by different Type and Treatment:
str(CO2)
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 84 obs. of 5 variables:
## $ Plant : Ord.factor w/ 12 levels "Qn1"<"Qn2"<"Qn3"<..: 1 1 1 1 1 1 1 2 2 2 ...
## $ Type : Factor w/ 2 levels "Quebec","Mississippi": 1 1 1 1 1 1 1 1 1 1 ...
## $ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 1 1 1 1 1 1 1 1 1 ...
## $ conc : num 95 175 250 350 500 675 1000 95 175 250 ...
## $ uptake : num 16 30.4 34.8 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
## - attr(*, "formula")=Class 'formula' length 3 uptake ~ conc | Plant
## .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
## - attr(*, "outer")=Class 'formula' length 2 ~Treatment * Type
## .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
## - attr(*, "labels")=List of 2
## ..$ x: chr "Ambient carbon dioxide concentration"
## ..$ y: chr "CO2 uptake rate"
## - attr(*, "units")=List of 2
## ..$ x: chr "(uL/L)"
## ..$ y: chr "(umol/m^2 s)"
summary(CO2)
## Plant Type Treatment conc
## Qn1 : 7 Quebec :42 nonchilled:42 Min. : 95
## Qn2 : 7 Mississippi:42 chilled :42 1st Qu.: 175
## Qn3 : 7 Median : 350
## Qc1 : 7 Mean : 435
## Qc3 : 7 3rd Qu.: 675
## Qc2 : 7 Max. :1000
## (Other):42
## uptake
## Min. : 7.70
## 1st Qu.:17.90
## Median :28.30
## Mean :27.21
## 3rd Qu.:37.12
## Max. :45.50
##
Q1_1<-summarise(group_by(CO2,Type,Treatment),mean(uptake),sd(uptake))
Q1_1
## Source: local data frame [4 x 4]
## Groups: Type [?]
##
## Type Treatment `mean(uptake)` `sd(uptake)`
## <fctr> <fctr> <dbl> <dbl>
## 1 Quebec nonchilled 35.33333 9.596371
## 2 Quebec chilled 31.75238 9.644823
## 3 Mississippi nonchilled 25.95238 7.402136
## 4 Mississippi chilled 15.81429 4.058976
After that, we are interested to see if the difference between groups are signifcant, or just by accident. The method used here are one way ANOVA and two way ANOVA:
fit_Type<-aov(CO2$uptake~CO2$Type)
fit_Ttmt<-aov(CO2$uptake~CO2$Treatment)
fit_Ty_Tt<-aov(CO2$uptake~CO2$Type+CO2$Treatment)
summary(fit_Type)
## Df Sum Sq Mean Sq F value Pr(>F)
## CO2$Type 1 3366 3366 43.52 3.83e-09 ***
## Residuals 82 6341 77
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit_Ttmt)
## Df Sum Sq Mean Sq F value Pr(>F)
## CO2$Treatment 1 988 988.1 9.293 0.0031 **
## Residuals 82 8719 106.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit_Ty_Tt)
## Df Sum Sq Mean Sq F value Pr(>F)
## CO2$Type 1 3366 3366 50.92 3.68e-10 ***
## CO2$Treatment 1 988 988 14.95 0.000222 ***
## Residuals 81 5353 66
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So it’s easily indicated by the results above that the difference of uptake is significant at 99% level for different Type, different Treatement and different Type+Treatment combined.
First, let’s take a summary look of mtcars dataset:
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
attach(mtcars)
As we are interested in the 3 sets of relationships, we just do three summary tables as below:
t1<-table(vs,am)
t2<-table(gear,carb)
t3<-table(cyl,gear)
t1
## am
## vs 0 1
## 0 12 6
## 1 7 7
t2
## carb
## gear 1 2 3 4 6 8
## 3 3 4 3 5 0 0
## 4 4 4 0 4 0 0
## 5 0 2 0 1 1 1
t3
## gear
## cyl 3 4 5
## 4 1 8 2
## 6 2 4 1
## 8 12 0 2
Guessing from the above results, I think the first two relationship are not independent, while the 3rd one is independent. Let’s prove it by apply chisq test:
chisq.test(t1)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: t1
## X-squared = 0.34754, df = 1, p-value = 0.5555
chisq.test(t2)
##
## Pearson's Chi-squared test
##
## data: t2
## X-squared = 16.518, df = 10, p-value = 0.08573
chisq.test(t3)
##
## Pearson's Chi-squared test
##
## data: t3
## X-squared = 18.036, df = 4, p-value = 0.001214
The 3rd relationship is significant at about 99% significance level, while the first 2 are not significant at all.