Summarization Tabulation and Merging Data

Read the cereal data, and produce quick summaries using str, summary, contents and describe. Interpret the results

cereal <- read.csv("cereals.csv")
str(cereal)

## 'data.frame':    77 obs. of  13 variables:
##  $ Cereal.name : Factor w/ 77 levels "100%_Bran","100%_Natural_Bran",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Manufacturer: Factor w/ 7 levels "A","G","K","N",..: 4 6 3 3 7 2 3 2 7 5 ...
##  $ Cold.or.Hot : Factor w/ 2 levels "C","H": 1 1 1 1 1 1 1 1 1 1 ...
##  $ calories    : int  70 120 70 50 110 110 110 130 90 90 ...
##  $ protein     : int  4 3 4 4 2 2 2 3 2 3 ...
##  $ fat         : int  1 5 1 0 2 2 0 2 1 0 ...
##  $ sodium      : int  130 15 260 140 200 180 125 210 200 210 ...
##  $ fiber       : num  10 2 9 14 1 1.5 1 2 4 5 ...
##  $ carbo       : num  5 8 7 8 14 10.5 11 18 15 13 ...
##  $ sugars      : int  6 8 5 0 8 10 14 8 6 5 ...
##  $ potass      : int  280 135 320 330 NA 70 30 100 125 190 ...
##  $ vitamins    : int  25 0 25 25 25 25 25 25 25 25 ...
##  $ rating      : num  68.4 34 59.4 93.7 34.4 ...

summary(cereal)

##                     Cereal.name Manufacturer Cold.or.Hot    calories    
##  100%_Bran                : 1   A: 1         C:74        Min.   : 50.0  
##  100%_Natural_Bran        : 1   G:22         H: 3        1st Qu.:100.0  
##  All-Bran                 : 1   K:23                     Median :110.0  
##  All-Bran_with_Extra_Fiber: 1   N: 6                     Mean   :106.9  
##  Almond_Delight           : 1   P: 9                     3rd Qu.:110.0  
##  Apple_Cinnamon_Cheerios  : 1   Q: 8                     Max.   :160.0  
##  (Other)                  :71   R: 8                                    
##     protein           fat            sodium          fiber       
##  Min.   :1.000   Min.   :0.000   Min.   :  0.0   Min.   : 0.000  
##  1st Qu.:2.000   1st Qu.:0.000   1st Qu.:130.0   1st Qu.: 1.000  
##  Median :3.000   Median :1.000   Median :180.0   Median : 2.000  
##  Mean   :2.545   Mean   :1.013   Mean   :159.7   Mean   : 2.152  
##  3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:210.0   3rd Qu.: 3.000  
##  Max.   :6.000   Max.   :5.000   Max.   :320.0   Max.   :14.000  
##                                                                  
##      carbo          sugars           potass          vitamins     
##  Min.   : 5.0   Min.   : 0.000   Min.   : 15.00   Min.   :  0.00  
##  1st Qu.:12.0   1st Qu.: 3.000   1st Qu.: 42.50   1st Qu.: 25.00  
##  Median :14.5   Median : 7.000   Median : 90.00   Median : 25.00  
##  Mean   :14.8   Mean   : 7.026   Mean   : 98.67   Mean   : 28.25  
##  3rd Qu.:17.0   3rd Qu.:11.000   3rd Qu.:120.00   3rd Qu.: 25.00  
##  Max.   :23.0   Max.   :15.000   Max.   :330.00   Max.   :100.00  
##  NA's   :1      NA's   :1        NA's   :2                        
##      rating     
##  Min.   :18.04  
##  1st Qu.:33.17  
##  Median :40.40  
##  Mean   :42.67  
##  3rd Qu.:50.83  
##  Max.   :93.70  
##

library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units

contents(cereal)

## 
## Data frame:cereal    77 observations and 13 variables    Maximum # NAs:2
## 
## 
##              Levels Storage NAs
## Cereal.name      77 integer   0
## Manufacturer      7 integer   0
## Cold.or.Hot       2 integer   0
## calories            integer   0
## protein             integer   0
## fat                 integer   0
## sodium              integer   0
## fiber                double   0
## carbo                double   1
## sugars              integer   1
## potass              integer   2
## vitamins            integer   0
## rating               double   0
## 
## +------------+-----------------------------------------------------------+
## |Variable    |Levels                                                     |
## +------------+-----------------------------------------------------------+
## |Cereal.name |100%_Bran,100%_Natural_Bran,All-Bran                       |
## |            |All-Bran_with_Extra_Fiber,Almond_Delight                   |
## |            |Apple_Cinnamon_Cheerios,Apple_Jacks,Basic_4,Bran_Chex      |
## |            |Bran_Flakes,Cap'n'Crunch,Cheerios,Cinnamon_Toast_Crunch    |
## |            |Clusters,Cocoa_Puffs,Corn_Chex,Corn_Flakes,Corn_Pops       |
## |            |Count_Chocula,Cracklin'_Oat_Bran,Cream_of_Wheat_(Quick)    |
## |            |Crispix,Crispy_Wheat_&_Raisins,Double_Chex,Froot_Loops     |
## |            |Frosted_Flakes,Frosted_Mini-Wheats                         |
## |            |Fruit_&_Fibre_Dates,_Walnuts,_and_Oats,Fruitful_Bran       |
## |            |Fruity_Pebbles,Golden_Crisp,Golden_Grahams,Grape-Nuts      |
## |            |Grape_Nuts_Flakes,Great_Grains_Pecan,Honey-comb            |
## |            |Honey_Graham_Ohs,Honey_Nut_Cheerios                        |
## |            |Just_Right_Crunchy__Nuggets,Just_Right_Fruit_&_Nut,Kix,Life|
## |            |Lucky_Charms,Maypo,Muesli_Raisins,_Dates,_&_Almonds        |
## |            |Muesli_Raisins,_Peaches,_&_Pecans,Mueslix_Crispy_Blend     |
## |            |Multi-Grain_Cheerios,Nut&Honey_Crunch                      |
## |            |Nutri-Grain_Almond-Raisin,Nutri-grain_Wheat                |
## |            |Oatmeal_Raisin_Crisp,Post_Nat._Raisin_Bran,Product_19      |
## |            |Puffed_Rice,Puffed_Wheat,Quaker_Oat_Squares,Quaker_Oatmeal |
## |            |Raisin_Bran,Raisin_Nut_Bran,Raisin_Squares,Rice_Chex       |
## |            |Rice_Krispies,Shredded_Wheat,Shredded_Wheat_'n'Bran        |
## |            |Shredded_Wheat_spoon_size,Smacks,Special_K                 |
## |            |Strawberry_Fruit_Wheats,Total_Corn_Flakes,Total_Raisin_Bran|
## |            |Total_Whole_Grain,Triples,Trix,Wheat_Chex,Wheaties         |
## |            |Wheaties_Honey_Gold                                        |
## +------------+-----------------------------------------------------------+
## |Manufacturer|A,G,K,N,P,Q,R                                              |
## +------------+-----------------------------------------------------------+
## |Cold.or.Hot |C,H                                                        |
## +------------+-----------------------------------------------------------+

describe(cereal)

## cereal 
## 
##  13  Variables      77  Observations
## ---------------------------------------------------------------------------
## Cereal.name 
##        n  missing distinct 
##       77        0       77 
## 
## lowest : 100%_Bran                 100%_Natural_Bran         All-Bran                  All-Bran_with_Extra_Fiber Almond_Delight           
## highest: Triples                   Trix                      Wheat_Chex                Wheaties                  Wheaties_Honey_Gold      
## ---------------------------------------------------------------------------
## Manufacturer 
##        n  missing distinct 
##       77        0        7 
##                                                     
## Value          A     G     K     N     P     Q     R
## Frequency      1    22    23     6     9     8     8
## Proportion 0.013 0.286 0.299 0.078 0.117 0.104 0.104
## ---------------------------------------------------------------------------
## Cold.or.Hot 
##        n  missing distinct 
##       77        0        2 
##                       
## Value          C     H
## Frequency     74     3
## Proportion 0.961 0.039
## ---------------------------------------------------------------------------
## calories 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       77        0       11    0.933    106.9    19.86       70       90 
##      .25      .50      .75      .90      .95 
##      100      110      110      124      140 
##                                                                       
## Value         50    70    80    90   100   110   120   130   140   150
## Frequency      3     2     1     7    17    29    10     2     3     2
## Proportion 0.039 0.026 0.013 0.091 0.221 0.377 0.130 0.026 0.039 0.026
##                 
## Value        160
## Frequency      1
## Proportion 0.013
## ---------------------------------------------------------------------------
## protein 
##        n  missing distinct     Info     Mean      Gmd 
##       77        0        6    0.912    2.545    1.166 
##                                               
## Value          1     2     3     4     5     6
## Frequency     13    25    28     8     1     2
## Proportion 0.169 0.325 0.364 0.104 0.013 0.026
## ---------------------------------------------------------------------------
## fat 
##        n  missing distinct     Info     Mean      Gmd 
##       77        0        5    0.892    1.013    1.049 
##                                         
## Value          0     1     2     3     5
## Frequency     27    30    14     5     1
## Proportion 0.351 0.390 0.182 0.065 0.013
## ---------------------------------------------------------------------------
## sodium 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       77        0       27    0.995    159.7    93.51        0        0 
##      .25      .50      .75      .90      .95 
##      130      180      210      254      282 
## 
## lowest :   0  15  45  70  75, highest: 250 260 280 290 320
## ---------------------------------------------------------------------------
## fiber 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       77        0       13    0.966    2.152    2.289      0.0      0.0 
##      .25      .50      .75      .90      .95 
##      1.0      2.0      3.0      4.4      5.2 
##                                                                       
## Value        0.0   1.0   1.5   2.0   2.5   2.7   3.0   4.0   5.0   6.0
## Frequency     19    16     3    10     1     1    15     4     4     1
## Proportion 0.247 0.208 0.039 0.130 0.013 0.013 0.195 0.052 0.052 0.013
##                             
## Value        9.0  10.0  14.0
## Frequency      1     1     1
## Proportion 0.013 0.013 0.013
## ---------------------------------------------------------------------------
## carbo 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       76        1       21    0.994     14.8    4.434     8.75    10.50 
##      .25      .50      .75      .90      .95 
##    12.00    14.50    17.00    21.00    21.00 
## 
## lowest :  5  7  8  9 10, highest: 19 20 21 22 23
## ---------------------------------------------------------------------------
## sugars 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       76        1       16    0.991    7.026    5.061      0.0      1.5 
##      .25      .50      .75      .90      .95 
##      3.0      7.0     11.0     13.0     14.0 
##                                                                       
## Value          0     1     2     3     4     5     6     7     8     9
## Frequency      7     1     3    13     1     5     7     4     5     4
## Proportion 0.092 0.013 0.039 0.171 0.013 0.066 0.092 0.053 0.066 0.053
##                                               
## Value         10    11    12    13    14    15
## Frequency      5     5     7     4     3     2
## Proportion 0.066 0.066 0.092 0.053 0.039 0.026
## ---------------------------------------------------------------------------
## potass 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       75        2       35    0.998    98.67     74.3     25.0     30.0 
##      .25      .50      .75      .90      .95 
##     42.5     90.0    120.0    190.0    246.0 
## 
## lowest :  15  20  25  30  35, highest: 240 260 280 320 330
## ---------------------------------------------------------------------------
## vitamins 
##        n  missing distinct     Info     Mean      Gmd 
##       77        0        3    0.451    28.25    15.64 
##                             
## Value          0    25   100
## Frequency      8    63     6
## Proportion 0.104 0.818 0.078
## ---------------------------------------------------------------------------
## rating 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       77        0       77        1    42.67    15.48    22.67    27.92 
##      .25      .50      .75      .90      .95 
##    33.17    40.40    50.83    60.09    68.27 
## 
## lowest : 18.04285 19.82357 21.87129 22.39651 22.73645
## highest: 68.23588 68.40297 72.80179 74.47295 93.70491
## ---------------------------------------------------------------------------

Find the average sodium, fiber and carbohydrate contents by Manufacturer.

library(doBy)
summaryBy(sodium + fiber + carbo ~ Manufacturer, data=cereal, FUN=mean)

##   Manufacturer sodium.mean fiber.mean carbo.mean
## 1            A      0.0000   0.000000   16.00000
## 2            G    200.4545   1.272727   14.72727
## 3            K    174.7826   2.739130   15.13043
## 4            N     37.5000   4.000000   16.00000
## 5            P    146.1111   2.777778   13.22222
## 6            Q     92.5000   1.337500         NA
## 7            R    198.1250   1.875000   17.62500

Add a new variable ‘SodiumClass’, which is ‘high’ when sodium >150 and ‘low’ otherwise. Make sure the new variable is a factor. Now, find the average, minimum and maximum sugar content for ‘low’ and ‘high’ sodium.

cereal$sodiumClass <- factor(ifelse(cereal$sodium < 150,"low","high"))

with(cereal, tapply(sugars,sodiumClass,min,na.rm=TRUE))

## high  low 
##    1    0

with(cereal, tapply(sugars,sodiumClass,max,na.rm=TRUE))

## high  low 
##   14   15

with(cereal, tapply(sugars,sodiumClass,mean,na.rm=TRUE))

##     high      low 
## 7.250000 6.642857

#or

library(doBy)
summaryBy(sugars ~ sodiumClass, data=cereal, FUN=c(min,max,mean), na.rm=TRUE)

##   sodiumClass sugars.min sugars.max sugars.mean
## 1        high          1         14    7.250000
## 2         low          0         15    6.642857

Find the maximum sugar content by Manufacturer and sodiumClass, using tapply. Inspect the result and notice there are missing values. Try to use na.rm=TRUE as an additional argument to tapply, only to find out that the values are still missing. Finally, use xtabs to count the number of observations by the two factors to find out we have missing values in the tapply result.

with(cereal, tapply(sugars, list(sodiumClass,Manufacturer), FUN=max, na.rm=TRUE))

##       A  G  K  N  P  Q  R
## high NA 14 13 NA 14 12 11
## low   3 12 15  6 15  8 11

Repeat the previous question with summaryBy. Compare the results.

library(doBy)
summaryBy(sugars ~ sodiumClass + Manufacturer, data=cereal, FUN=max, na.rm=TRUE)

##    sodiumClass Manufacturer sugars.max
## 1         high            G         14
## 2         high            K         13
## 3         high            P         14
## 4         high            Q         12
## 5         high            R         11
## 6          low            A          3
## 7          low            G         12
## 8          low            K         15
## 9          low            N          6
## 10         low            P         15
## 11         low            Q          8
## 12         low            R         11

Count the number of observations by Manufacturer and whether the cereal is ‘hot’ or ‘cold’, using xtabs

cereal <- read.csv("Cereals.csv")
xtabs( ~ Cold.or.Hot + Manufacturer, data=cereal)

##            Manufacturer
## Cold.or.Hot  A  G  K  N  P  Q  R
##           C  0 22 23  5  9  7  8
##           H  1  0  0  1  0  1  0

Using the ‘Age and memory’ dataset, find the mean and maximum number of words recalled by ‘Older’ and ‘Younger’ age classes.

words <- read.table("eysenck.txt", header=TRUE)

with(words, tapply(Words, Age, mean))

##   Older Younger 
##   10.06   13.16

with(words, tapply(Words, Age, max))

##   Older Younger 
##      23      22

#or
library(doBy)
summaryBy(Words ~ Age, data=words, FUN=c(mean,max))

##       Age Words.mean Words.max
## 1   Older      10.06        23
## 2 Younger      13.16        22

Using the HFE weather dataset find the mean air temperature by month.

hfemet <- read.csv("HFEmet2008.csv")

tail(hfemet)

##               DateTime  Tair AirPress   RH       VPD PAR Rain  wind
## 17563 12/31/2008 21:00 17.14   100.60 77.8 0.4356361   0    0 0.000
## 17564 12/31/2008 21:30 16.30   100.60 81.8 0.3385888   0    0 0.000
## 17565 12/31/2008 22:00 16.14   100.61 83.4 0.3056883   0    0 0.031
## 17566 12/31/2008 22:30 17.02   100.68 78.1 0.4264954   0    0 0.079
## 17567 12/31/2008 23:00 16.42   100.69 82.3 0.3318132   0    0 0.007
## 17568 12/31/2008 23:30 15.67   100.66 86.9 0.2340969   0    0 0.000
##       winddirection
## 17563         66.89
## 17564         66.89
## 17565         94.60
## 17566        118.20
## 17567        136.30
## 17568        143.50

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

hfemet$DateTime <- mdy_hm(hfemet$DateTime)

hfemet$month <- month(hfemet$DateTime)

with(hfemet, tapply(Tair, month, mean, na.rm=TRUE))

##         1         2         3         4         5         6         7 
## 22.742312 20.463714 19.940297 15.251501 11.894728 12.517657  9.205703 
##         8         9        10        11        12 
##  9.522370 14.989961 18.062657 19.417347 21.650544

First read the pupae data. Also read this short dataset, which gives a label ‘roomnumber’ for each CO2 treatment

CO2_treatment	Roomnumber
280	1
400	2

Merge the short dataset onto the pupae data.

pupae <- read.csv("pupae.csv")

CO2room <- data.frame(CO2_treatment=c(280,400), Roomnumber=1:2)

pupae <- merge(pupae, CO2room)

head(pupae)

##   CO2_treatment T_treatment Gender PupalWeight Frass Roomnumber
## 1           280     ambient      0       0.244 1.900          1
## 2           280     ambient      1       0.319 2.770          1
## 3           280     ambient      0       0.221    NA          1
## 4           280     ambient      0       0.280 1.996          1
## 5           280     ambient      0       0.257 1.069          1
## 6           280     ambient      1       0.333 2.257          1

First, run the following code to construct three dataframes that we will attempt to merge together.

dataset1 <- data.frame(unit=letters[1:9], treatment=rep(LETTERS[1:3],each=3),
Damage=runif(9,50,100))
unitweight <- data.frame(unit=letters[c(1,2,4,6,8,9)], Weight = rnorm(6,100,0.3))
treatlocation <- data.frame(treatment=LETTERS[1:3], Glasshouse=c("G1","G2","G3"))

Merge the three datasets together, to end up with one dataframe that has the columns ‘unit’, ‘treatment’, ‘Glasshouse’, ‘Damage’ and ‘Weight’. Some units do not have measurements of Weight. Merge the datasets in two ways to either include or exclude the units without Weight measurements.

dtreat <- merge(dataset1, treatlocation, by="treatment")

merge(dtreat, unitweight, all=TRUE, by="unit")

##   unit treatment   Damage Glasshouse    Weight
## 1    a         A 70.16935         G1 100.08244
## 2    b         A 87.97205         G1  99.67351
## 3    c         A 64.02076         G1        NA
## 4    d         B 96.30700         G2  99.84443
## 5    e         B 84.67580         G2        NA
## 6    f         B 91.38782         G2 100.24163
## 7    g         C 70.96169         G3        NA
## 8    h         C 62.30425         G3 100.58621
## 9    i         C 89.77666         G3 100.42488

merge(dtreat, unitweight, all=FALSE, by="unit")

##   unit treatment   Damage Glasshouse    Weight
## 1    a         A 70.16935         G1 100.08244
## 2    b         A 87.97205         G1  99.67351
## 3    d         B 96.30700         G2  99.84443
## 4    f         B 91.38782         G2 100.24163
## 5    h         C 62.30425         G3 100.58621
## 6    i         C 89.77666         G3 100.42488

cereal <- read.csv("Cereals.csv")

boxplot(sodium ~ Manufacturer, data=cereal, ylab="Sodium content", xlab="Manufacturer")

Now, redraw the plot with Manufacturer in order of increasing mean sodium content

cereal$Manufacturer <- with(cereal, reorder(Manufacturer, sodium, mean))
boxplot(sodium ~ Manufacturer, data=cereal, ylab="Sodium content", xlab="Manufacturer")

change the boxplots so that the width varies with the number of observations per manufacturer

boxplot(sodium ~ Manufacturer, data=cereal, ylab="Sodium content", xlab="Manufacturer",
varwidth=TRUE)

hfeif <- read.csv('HFEIFplotmeans.csv', stringsAsFactors=FALSE)
hfeif$Date <- as.Date(mdy(hfeif$Date))
hfeif$Year <- year(hfeif$Date)
hfeif2012 <- subset(hfeif, Year == 2012)

There are four treatments in the dataframe. Calculate the variance of diameter for each of the treatments.These are the within-treatment variances. Also calculate thevariance of tree diameter across all plots (this is one number). This is the plot-to-plot variance.

withinvar <- with(hfeif2012, tapply(diameter,treat,FUN=var))

plotvar <- var(hfeif2012$diameter)

also calculate the mean within-treatment variance. Compare the value to the plot-to-plot variance. What can you tentatively conclude about the treatment effect?

mean(withinvar)

## [1] 1.150171

plotvar

## [1] 3.988293

Variance within treatments <<<<< variance across all plots. There is likely a treatment effect.

Read the dataset (‘weightloss.csv’), and convert the ‘Date’ variable to the Date class.

weightloss <- read.csv("weightloss.csv")
str(weightloss)

## 'data.frame':    67 obs. of  2 variables:
##  $ Date  : Factor w/ 67 levels "1/04/05","10/02/05",..: 2 11 14 17 20 22 24 29 31 33 ...
##  $ Weight: num  224 222 220 219 218 ...

weightloss$Date <- as.Date(dmy(weightloss$Date))

Add a new variable to the dataset, with the subjects’s weight in kilograms

weightloss$weightkg <- weightloss$Weight / 2.204

Produce a line plot that shows weight (in kg) versus time.

plot(weightkg ~ Date, data=weightloss, type='l')

The problem with the plot you just produced is that all measurements are connected by a line, although we would like to have line breaks for the days where the weight was not measured. To do this, construct a dataframe based on the weightloss dataset that has daily values.

Make an entirely new dataframe, with a Date variable, ranging from the 1rst to last days in the weightloss dataset, with a step of one day

Using merge, paste the Weight data onto this new dataframe. Check for missing values. Use the new dataframe to make the plot.

dateseq <- seq(min(weightloss$Date), max(weightloss$Date), by="day")

weightall <- data.frame(Date=dateseq)

weightall <- merge(weightall, weightloss, all=TRUE)

plot(weightkg ~ Date, data=weightall, type='l')

Based on the new dataframe you just produced, graph the daily change in weight versus time. Also add a dashed horizontal line at y=0.

weightall$weightloss <- c(NA, diff(weightall$weightkg))
plot(weightloss ~ Date, data=weightall, type='l', col="red", lwd=2)
abline(h=0)

Summarization Tabulation and Merging Data

Kushan De Silva

August 16, 2017