Load Librarys

First step for the exploration of the two datasets, we load the libaries we will be using in the analysis.

library(tidyr)
library(dplyr)
library(magrittr)
library(MASS)

Question 1 - CO2 Dataset Exploration

For the CO2 dataset, let’s do a summary of the dataset first to see what we have in the dataset. Also, we will perform a further group by analysis to see the average and standard deviation grouping by different Type and Treatment:

str(CO2)
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':   84 obs. of  5 variables:
##  $ Plant    : Ord.factor w/ 12 levels "Qn1"<"Qn2"<"Qn3"<..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ Type     : Factor w/ 2 levels "Quebec","Mississippi": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 1 1 1 1 1 1 1 1 1 ...
##  $ conc     : num  95 175 250 350 500 675 1000 95 175 250 ...
##  $ uptake   : num  16 30.4 34.8 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
##  - attr(*, "formula")=Class 'formula' length 3 uptake ~ conc | Plant
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "outer")=Class 'formula' length 2 ~Treatment * Type
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "labels")=List of 2
##   ..$ x: chr "Ambient carbon dioxide concentration"
##   ..$ y: chr "CO2 uptake rate"
##  - attr(*, "units")=List of 2
##   ..$ x: chr "(uL/L)"
##   ..$ y: chr "(umol/m^2 s)"
summary(CO2)
##      Plant             Type         Treatment       conc     
##  Qn1    : 7   Quebec     :42   nonchilled:42   Min.   :  95  
##  Qn2    : 7   Mississippi:42   chilled   :42   1st Qu.: 175  
##  Qn3    : 7                                    Median : 350  
##  Qc1    : 7                                    Mean   : 435  
##  Qc3    : 7                                    3rd Qu.: 675  
##  Qc2    : 7                                    Max.   :1000  
##  (Other):42                                                  
##      uptake     
##  Min.   : 7.70  
##  1st Qu.:17.90  
##  Median :28.30  
##  Mean   :27.21  
##  3rd Qu.:37.12  
##  Max.   :45.50  
## 
Q1_1<-summarise(group_by(CO2,Type,Treatment),mean(uptake),sd(uptake))
Q1_1
## Source: local data frame [4 x 4]
## Groups: Type [?]
## 
##          Type  Treatment `mean(uptake)` `sd(uptake)`
##        <fctr>     <fctr>          <dbl>        <dbl>
## 1      Quebec nonchilled       35.33333     9.596371
## 2      Quebec    chilled       31.75238     9.644823
## 3 Mississippi nonchilled       25.95238     7.402136
## 4 Mississippi    chilled       15.81429     4.058976

After that, we are interested to see if the difference between groups are signifcant, or just by accident. The method used here are one way ANOVA and two way ANOVA:

fit_Type<-aov(CO2$uptake~CO2$Type)
fit_Ttmt<-aov(CO2$uptake~CO2$Treatment)
fit_Ty_Tt<-aov(CO2$uptake~CO2$Type+CO2$Treatment)

summary(fit_Type)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## CO2$Type     1   3366    3366   43.52 3.83e-09 ***
## Residuals   82   6341      77                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit_Ttmt)
##               Df Sum Sq Mean Sq F value Pr(>F)   
## CO2$Treatment  1    988   988.1   9.293 0.0031 **
## Residuals     82   8719   106.3                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit_Ty_Tt)
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## CO2$Type       1   3366    3366   50.92 3.68e-10 ***
## CO2$Treatment  1    988     988   14.95 0.000222 ***
## Residuals     81   5353      66                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So it’s easily indicated by the results above that the difference of uptake is significant at 99% level for different Type, different Treatement and different Type+Treatment combined.

Question 2 - mtcars Dataset Exploration

First, let’s take a summary look of mtcars dataset:

summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
attach(mtcars)

As we are interested in the 3 sets of relationships, we just do three summary tables as below:

t1<-table(vs,am)
t2<-table(gear,carb)
t3<-table(cyl,gear)
t1
##    am
## vs   0  1
##   0 12  6
##   1  7  7
t2
##     carb
## gear 1 2 3 4 6 8
##    3 3 4 3 5 0 0
##    4 4 4 0 4 0 0
##    5 0 2 0 1 1 1
t3
##    gear
## cyl  3  4  5
##   4  1  8  2
##   6  2  4  1
##   8 12  0  2

Guessing from the above results, I think the first two relationship are not independent, while the 3rd one is independent. Let’s prove it by apply chisq test:

chisq.test(t1)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  t1
## X-squared = 0.34754, df = 1, p-value = 0.5555
chisq.test(t2)
## 
##  Pearson's Chi-squared test
## 
## data:  t2
## X-squared = 16.518, df = 10, p-value = 0.08573
chisq.test(t3)
## 
##  Pearson's Chi-squared test
## 
## data:  t3
## X-squared = 18.036, df = 4, p-value = 0.001214

The 3rd relationship is significant at about 99% significance level, while the first 2 are not significant at all.