Read the cereal data, and produce quick summaries using str, summary, contents and describe. Interpret the results
cereal <- read.csv("cereals.csv")
str(cereal)
## 'data.frame': 77 obs. of 13 variables:
## $ Cereal.name : Factor w/ 77 levels "100%_Bran","100%_Natural_Bran",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Manufacturer: Factor w/ 7 levels "A","G","K","N",..: 4 6 3 3 7 2 3 2 7 5 ...
## $ Cold.or.Hot : Factor w/ 2 levels "C","H": 1 1 1 1 1 1 1 1 1 1 ...
## $ calories : int 70 120 70 50 110 110 110 130 90 90 ...
## $ protein : int 4 3 4 4 2 2 2 3 2 3 ...
## $ fat : int 1 5 1 0 2 2 0 2 1 0 ...
## $ sodium : int 130 15 260 140 200 180 125 210 200 210 ...
## $ fiber : num 10 2 9 14 1 1.5 1 2 4 5 ...
## $ carbo : num 5 8 7 8 14 10.5 11 18 15 13 ...
## $ sugars : int 6 8 5 0 8 10 14 8 6 5 ...
## $ potass : int 280 135 320 330 NA 70 30 100 125 190 ...
## $ vitamins : int 25 0 25 25 25 25 25 25 25 25 ...
## $ rating : num 68.4 34 59.4 93.7 34.4 ...
summary(cereal)
## Cereal.name Manufacturer Cold.or.Hot calories
## 100%_Bran : 1 A: 1 C:74 Min. : 50.0
## 100%_Natural_Bran : 1 G:22 H: 3 1st Qu.:100.0
## All-Bran : 1 K:23 Median :110.0
## All-Bran_with_Extra_Fiber: 1 N: 6 Mean :106.9
## Almond_Delight : 1 P: 9 3rd Qu.:110.0
## Apple_Cinnamon_Cheerios : 1 Q: 8 Max. :160.0
## (Other) :71 R: 8
## protein fat sodium fiber
## Min. :1.000 Min. :0.000 Min. : 0.0 Min. : 0.000
## 1st Qu.:2.000 1st Qu.:0.000 1st Qu.:130.0 1st Qu.: 1.000
## Median :3.000 Median :1.000 Median :180.0 Median : 2.000
## Mean :2.545 Mean :1.013 Mean :159.7 Mean : 2.152
## 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:210.0 3rd Qu.: 3.000
## Max. :6.000 Max. :5.000 Max. :320.0 Max. :14.000
##
## carbo sugars potass vitamins
## Min. : 5.0 Min. : 0.000 Min. : 15.00 Min. : 0.00
## 1st Qu.:12.0 1st Qu.: 3.000 1st Qu.: 42.50 1st Qu.: 25.00
## Median :14.5 Median : 7.000 Median : 90.00 Median : 25.00
## Mean :14.8 Mean : 7.026 Mean : 98.67 Mean : 28.25
## 3rd Qu.:17.0 3rd Qu.:11.000 3rd Qu.:120.00 3rd Qu.: 25.00
## Max. :23.0 Max. :15.000 Max. :330.00 Max. :100.00
## NA's :1 NA's :1 NA's :2
## rating
## Min. :18.04
## 1st Qu.:33.17
## Median :40.40
## Mean :42.67
## 3rd Qu.:50.83
## Max. :93.70
##
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
contents(cereal)
##
## Data frame:cereal 77 observations and 13 variables Maximum # NAs:2
##
##
## Levels Storage NAs
## Cereal.name 77 integer 0
## Manufacturer 7 integer 0
## Cold.or.Hot 2 integer 0
## calories integer 0
## protein integer 0
## fat integer 0
## sodium integer 0
## fiber double 0
## carbo double 1
## sugars integer 1
## potass integer 2
## vitamins integer 0
## rating double 0
##
## +------------+-----------------------------------------------------------+
## |Variable |Levels |
## +------------+-----------------------------------------------------------+
## |Cereal.name |100%_Bran,100%_Natural_Bran,All-Bran |
## | |All-Bran_with_Extra_Fiber,Almond_Delight |
## | |Apple_Cinnamon_Cheerios,Apple_Jacks,Basic_4,Bran_Chex |
## | |Bran_Flakes,Cap'n'Crunch,Cheerios,Cinnamon_Toast_Crunch |
## | |Clusters,Cocoa_Puffs,Corn_Chex,Corn_Flakes,Corn_Pops |
## | |Count_Chocula,Cracklin'_Oat_Bran,Cream_of_Wheat_(Quick) |
## | |Crispix,Crispy_Wheat_&_Raisins,Double_Chex,Froot_Loops |
## | |Frosted_Flakes,Frosted_Mini-Wheats |
## | |Fruit_&_Fibre_Dates,_Walnuts,_and_Oats,Fruitful_Bran |
## | |Fruity_Pebbles,Golden_Crisp,Golden_Grahams,Grape-Nuts |
## | |Grape_Nuts_Flakes,Great_Grains_Pecan,Honey-comb |
## | |Honey_Graham_Ohs,Honey_Nut_Cheerios |
## | |Just_Right_Crunchy__Nuggets,Just_Right_Fruit_&_Nut,Kix,Life|
## | |Lucky_Charms,Maypo,Muesli_Raisins,_Dates,_&_Almonds |
## | |Muesli_Raisins,_Peaches,_&_Pecans,Mueslix_Crispy_Blend |
## | |Multi-Grain_Cheerios,Nut&Honey_Crunch |
## | |Nutri-Grain_Almond-Raisin,Nutri-grain_Wheat |
## | |Oatmeal_Raisin_Crisp,Post_Nat._Raisin_Bran,Product_19 |
## | |Puffed_Rice,Puffed_Wheat,Quaker_Oat_Squares,Quaker_Oatmeal |
## | |Raisin_Bran,Raisin_Nut_Bran,Raisin_Squares,Rice_Chex |
## | |Rice_Krispies,Shredded_Wheat,Shredded_Wheat_'n'Bran |
## | |Shredded_Wheat_spoon_size,Smacks,Special_K |
## | |Strawberry_Fruit_Wheats,Total_Corn_Flakes,Total_Raisin_Bran|
## | |Total_Whole_Grain,Triples,Trix,Wheat_Chex,Wheaties |
## | |Wheaties_Honey_Gold |
## +------------+-----------------------------------------------------------+
## |Manufacturer|A,G,K,N,P,Q,R |
## +------------+-----------------------------------------------------------+
## |Cold.or.Hot |C,H |
## +------------+-----------------------------------------------------------+
describe(cereal)
## cereal
##
## 13 Variables 77 Observations
## ---------------------------------------------------------------------------
## Cereal.name
## n missing distinct
## 77 0 77
##
## lowest : 100%_Bran 100%_Natural_Bran All-Bran All-Bran_with_Extra_Fiber Almond_Delight
## highest: Triples Trix Wheat_Chex Wheaties Wheaties_Honey_Gold
## ---------------------------------------------------------------------------
## Manufacturer
## n missing distinct
## 77 0 7
##
## Value A G K N P Q R
## Frequency 1 22 23 6 9 8 8
## Proportion 0.013 0.286 0.299 0.078 0.117 0.104 0.104
## ---------------------------------------------------------------------------
## Cold.or.Hot
## n missing distinct
## 77 0 2
##
## Value C H
## Frequency 74 3
## Proportion 0.961 0.039
## ---------------------------------------------------------------------------
## calories
## n missing distinct Info Mean Gmd .05 .10
## 77 0 11 0.933 106.9 19.86 70 90
## .25 .50 .75 .90 .95
## 100 110 110 124 140
##
## Value 50 70 80 90 100 110 120 130 140 150
## Frequency 3 2 1 7 17 29 10 2 3 2
## Proportion 0.039 0.026 0.013 0.091 0.221 0.377 0.130 0.026 0.039 0.026
##
## Value 160
## Frequency 1
## Proportion 0.013
## ---------------------------------------------------------------------------
## protein
## n missing distinct Info Mean Gmd
## 77 0 6 0.912 2.545 1.166
##
## Value 1 2 3 4 5 6
## Frequency 13 25 28 8 1 2
## Proportion 0.169 0.325 0.364 0.104 0.013 0.026
## ---------------------------------------------------------------------------
## fat
## n missing distinct Info Mean Gmd
## 77 0 5 0.892 1.013 1.049
##
## Value 0 1 2 3 5
## Frequency 27 30 14 5 1
## Proportion 0.351 0.390 0.182 0.065 0.013
## ---------------------------------------------------------------------------
## sodium
## n missing distinct Info Mean Gmd .05 .10
## 77 0 27 0.995 159.7 93.51 0 0
## .25 .50 .75 .90 .95
## 130 180 210 254 282
##
## lowest : 0 15 45 70 75, highest: 250 260 280 290 320
## ---------------------------------------------------------------------------
## fiber
## n missing distinct Info Mean Gmd .05 .10
## 77 0 13 0.966 2.152 2.289 0.0 0.0
## .25 .50 .75 .90 .95
## 1.0 2.0 3.0 4.4 5.2
##
## Value 0.0 1.0 1.5 2.0 2.5 2.7 3.0 4.0 5.0 6.0
## Frequency 19 16 3 10 1 1 15 4 4 1
## Proportion 0.247 0.208 0.039 0.130 0.013 0.013 0.195 0.052 0.052 0.013
##
## Value 9.0 10.0 14.0
## Frequency 1 1 1
## Proportion 0.013 0.013 0.013
## ---------------------------------------------------------------------------
## carbo
## n missing distinct Info Mean Gmd .05 .10
## 76 1 21 0.994 14.8 4.434 8.75 10.50
## .25 .50 .75 .90 .95
## 12.00 14.50 17.00 21.00 21.00
##
## lowest : 5 7 8 9 10, highest: 19 20 21 22 23
## ---------------------------------------------------------------------------
## sugars
## n missing distinct Info Mean Gmd .05 .10
## 76 1 16 0.991 7.026 5.061 0.0 1.5
## .25 .50 .75 .90 .95
## 3.0 7.0 11.0 13.0 14.0
##
## Value 0 1 2 3 4 5 6 7 8 9
## Frequency 7 1 3 13 1 5 7 4 5 4
## Proportion 0.092 0.013 0.039 0.171 0.013 0.066 0.092 0.053 0.066 0.053
##
## Value 10 11 12 13 14 15
## Frequency 5 5 7 4 3 2
## Proportion 0.066 0.066 0.092 0.053 0.039 0.026
## ---------------------------------------------------------------------------
## potass
## n missing distinct Info Mean Gmd .05 .10
## 75 2 35 0.998 98.67 74.3 25.0 30.0
## .25 .50 .75 .90 .95
## 42.5 90.0 120.0 190.0 246.0
##
## lowest : 15 20 25 30 35, highest: 240 260 280 320 330
## ---------------------------------------------------------------------------
## vitamins
## n missing distinct Info Mean Gmd
## 77 0 3 0.451 28.25 15.64
##
## Value 0 25 100
## Frequency 8 63 6
## Proportion 0.104 0.818 0.078
## ---------------------------------------------------------------------------
## rating
## n missing distinct Info Mean Gmd .05 .10
## 77 0 77 1 42.67 15.48 22.67 27.92
## .25 .50 .75 .90 .95
## 33.17 40.40 50.83 60.09 68.27
##
## lowest : 18.04285 19.82357 21.87129 22.39651 22.73645
## highest: 68.23588 68.40297 72.80179 74.47295 93.70491
## ---------------------------------------------------------------------------
Find the average sodium, fiber and carbohydrate contents by Manufacturer.
library(doBy)
summaryBy(sodium + fiber + carbo ~ Manufacturer, data=cereal, FUN=mean)
## Manufacturer sodium.mean fiber.mean carbo.mean
## 1 A 0.0000 0.000000 16.00000
## 2 G 200.4545 1.272727 14.72727
## 3 K 174.7826 2.739130 15.13043
## 4 N 37.5000 4.000000 16.00000
## 5 P 146.1111 2.777778 13.22222
## 6 Q 92.5000 1.337500 NA
## 7 R 198.1250 1.875000 17.62500
Add a new variable ‘SodiumClass’, which is ‘high’ when sodium >150 and ‘low’ otherwise. Make sure the new variable is a factor. Now, find the average, minimum and maximum sugar content for ‘low’ and ‘high’ sodium.
cereal$sodiumClass <- factor(ifelse(cereal$sodium < 150,"low","high"))
with(cereal, tapply(sugars,sodiumClass,min,na.rm=TRUE))
## high low
## 1 0
with(cereal, tapply(sugars,sodiumClass,max,na.rm=TRUE))
## high low
## 14 15
with(cereal, tapply(sugars,sodiumClass,mean,na.rm=TRUE))
## high low
## 7.250000 6.642857
#or
library(doBy)
summaryBy(sugars ~ sodiumClass, data=cereal, FUN=c(min,max,mean), na.rm=TRUE)
## sodiumClass sugars.min sugars.max sugars.mean
## 1 high 1 14 7.250000
## 2 low 0 15 6.642857
Find the maximum sugar content by Manufacturer and sodiumClass, using tapply. Inspect the result and notice there are missing values. Try to use na.rm=TRUE as an additional argument to tapply, only to find out that the values are still missing. Finally, use xtabs to count the number of observations by the two factors to find out we have missing values in the tapply result.
with(cereal, tapply(sugars, list(sodiumClass,Manufacturer), FUN=max, na.rm=TRUE))
## A G K N P Q R
## high NA 14 13 NA 14 12 11
## low 3 12 15 6 15 8 11
Repeat the previous question with summaryBy. Compare the results.
library(doBy)
summaryBy(sugars ~ sodiumClass + Manufacturer, data=cereal, FUN=max, na.rm=TRUE)
## sodiumClass Manufacturer sugars.max
## 1 high G 14
## 2 high K 13
## 3 high P 14
## 4 high Q 12
## 5 high R 11
## 6 low A 3
## 7 low G 12
## 8 low K 15
## 9 low N 6
## 10 low P 15
## 11 low Q 8
## 12 low R 11
Count the number of observations by Manufacturer and whether the cereal is ‘hot’ or ‘cold’, using xtabs
cereal <- read.csv("Cereals.csv")
xtabs( ~ Cold.or.Hot + Manufacturer, data=cereal)
## Manufacturer
## Cold.or.Hot A G K N P Q R
## C 0 22 23 5 9 7 8
## H 1 0 0 1 0 1 0
Using the ‘Age and memory’ dataset, find the mean and maximum number of words recalled by ‘Older’ and ‘Younger’ age classes.
words <- read.table("eysenck.txt", header=TRUE)
with(words, tapply(Words, Age, mean))
## Older Younger
## 10.06 13.16
with(words, tapply(Words, Age, max))
## Older Younger
## 23 22
#or
library(doBy)
summaryBy(Words ~ Age, data=words, FUN=c(mean,max))
## Age Words.mean Words.max
## 1 Older 10.06 23
## 2 Younger 13.16 22
Using the HFE weather dataset find the mean air temperature by month.
hfemet <- read.csv("HFEmet2008.csv")
tail(hfemet)
## DateTime Tair AirPress RH VPD PAR Rain wind
## 17563 12/31/2008 21:00 17.14 100.60 77.8 0.4356361 0 0 0.000
## 17564 12/31/2008 21:30 16.30 100.60 81.8 0.3385888 0 0 0.000
## 17565 12/31/2008 22:00 16.14 100.61 83.4 0.3056883 0 0 0.031
## 17566 12/31/2008 22:30 17.02 100.68 78.1 0.4264954 0 0 0.079
## 17567 12/31/2008 23:00 16.42 100.69 82.3 0.3318132 0 0 0.007
## 17568 12/31/2008 23:30 15.67 100.66 86.9 0.2340969 0 0 0.000
## winddirection
## 17563 66.89
## 17564 66.89
## 17565 94.60
## 17566 118.20
## 17567 136.30
## 17568 143.50
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
hfemet$DateTime <- mdy_hm(hfemet$DateTime)
hfemet$month <- month(hfemet$DateTime)
with(hfemet, tapply(Tair, month, mean, na.rm=TRUE))
## 1 2 3 4 5 6 7
## 22.742312 20.463714 19.940297 15.251501 11.894728 12.517657 9.205703
## 8 9 10 11 12
## 9.522370 14.989961 18.062657 19.417347 21.650544
First read the pupae data. Also read this short dataset, which gives a label ‘roomnumber’ for each CO2 treatment
| CO2_treatment | Roomnumber |
|---|---|
| 280 | 1 |
| 400 | 2 |
Merge the short dataset onto the pupae data.
pupae <- read.csv("pupae.csv")
CO2room <- data.frame(CO2_treatment=c(280,400), Roomnumber=1:2)
pupae <- merge(pupae, CO2room)
head(pupae)
## CO2_treatment T_treatment Gender PupalWeight Frass Roomnumber
## 1 280 ambient 0 0.244 1.900 1
## 2 280 ambient 1 0.319 2.770 1
## 3 280 ambient 0 0.221 NA 1
## 4 280 ambient 0 0.280 1.996 1
## 5 280 ambient 0 0.257 1.069 1
## 6 280 ambient 1 0.333 2.257 1
First, run the following code to construct three dataframes that we will attempt to merge together.
dataset1 <- data.frame(unit=letters[1:9], treatment=rep(LETTERS[1:3],each=3),
Damage=runif(9,50,100))
unitweight <- data.frame(unit=letters[c(1,2,4,6,8,9)], Weight = rnorm(6,100,0.3))
treatlocation <- data.frame(treatment=LETTERS[1:3], Glasshouse=c("G1","G2","G3"))
Merge the three datasets together, to end up with one dataframe that has the columns ‘unit’, ‘treatment’, ‘Glasshouse’, ‘Damage’ and ‘Weight’. Some units do not have measurements of Weight. Merge the datasets in two ways to either include or exclude the units without Weight measurements.
dtreat <- merge(dataset1, treatlocation, by="treatment")
merge(dtreat, unitweight, all=TRUE, by="unit")
## unit treatment Damage Glasshouse Weight
## 1 a A 70.16935 G1 100.08244
## 2 b A 87.97205 G1 99.67351
## 3 c A 64.02076 G1 NA
## 4 d B 96.30700 G2 99.84443
## 5 e B 84.67580 G2 NA
## 6 f B 91.38782 G2 100.24163
## 7 g C 70.96169 G3 NA
## 8 h C 62.30425 G3 100.58621
## 9 i C 89.77666 G3 100.42488
merge(dtreat, unitweight, all=FALSE, by="unit")
## unit treatment Damage Glasshouse Weight
## 1 a A 70.16935 G1 100.08244
## 2 b A 87.97205 G1 99.67351
## 3 d B 96.30700 G2 99.84443
## 4 f B 91.38782 G2 100.24163
## 5 h C 62.30425 G3 100.58621
## 6 i C 89.77666 G3 100.42488
cereal <- read.csv("Cereals.csv")
boxplot(sodium ~ Manufacturer, data=cereal, ylab="Sodium content", xlab="Manufacturer")
Now, redraw the plot with Manufacturer in order of increasing mean sodium content
cereal$Manufacturer <- with(cereal, reorder(Manufacturer, sodium, mean))
boxplot(sodium ~ Manufacturer, data=cereal, ylab="Sodium content", xlab="Manufacturer")
change the boxplots so that the width varies with the number of observations per manufacturer
boxplot(sodium ~ Manufacturer, data=cereal, ylab="Sodium content", xlab="Manufacturer",
varwidth=TRUE)
hfeif <- read.csv('HFEIFplotmeans.csv', stringsAsFactors=FALSE)
hfeif$Date <- as.Date(mdy(hfeif$Date))
hfeif$Year <- year(hfeif$Date)
hfeif2012 <- subset(hfeif, Year == 2012)
There are four treatments in the dataframe. Calculate the variance of diameter for each of the treatments.These are the within-treatment variances. Also calculate thevariance of tree diameter across all plots (this is one number). This is the plot-to-plot variance.
withinvar <- with(hfeif2012, tapply(diameter,treat,FUN=var))
plotvar <- var(hfeif2012$diameter)
also calculate the mean within-treatment variance. Compare the value to the plot-to-plot variance. What can you tentatively conclude about the treatment effect?
mean(withinvar)
## [1] 1.150171
plotvar
## [1] 3.988293
Variance within treatments <<<<< variance across all plots. There is likely a treatment effect.
Read the dataset (‘weightloss.csv’), and convert the ‘Date’ variable to the Date class.
weightloss <- read.csv("weightloss.csv")
str(weightloss)
## 'data.frame': 67 obs. of 2 variables:
## $ Date : Factor w/ 67 levels "1/04/05","10/02/05",..: 2 11 14 17 20 22 24 29 31 33 ...
## $ Weight: num 224 222 220 219 218 ...
weightloss$Date <- as.Date(dmy(weightloss$Date))
Add a new variable to the dataset, with the subjects’s weight in kilograms
weightloss$weightkg <- weightloss$Weight / 2.204
Produce a line plot that shows weight (in kg) versus time.
plot(weightkg ~ Date, data=weightloss, type='l')
The problem with the plot you just produced is that all measurements are connected by a line, although we would like to have line breaks for the days where the weight was not measured. To do this, construct a dataframe based on the weightloss dataset that has daily values.
Make an entirely new dataframe, with a Date variable, ranging from the 1rst to last days in the weightloss dataset, with a step of one day
Using merge, paste the Weight data onto this new dataframe. Check for missing values. Use the new dataframe to make the plot.
dateseq <- seq(min(weightloss$Date), max(weightloss$Date), by="day")
weightall <- data.frame(Date=dateseq)
weightall <- merge(weightall, weightloss, all=TRUE)
plot(weightkg ~ Date, data=weightall, type='l')
Based on the new dataframe you just produced, graph the daily change in weight versus time. Also add a dashed horizontal line at y=0.
weightall$weightloss <- c(NA, diff(weightall$weightkg))
plot(weightloss ~ Date, data=weightall, type='l', col="red", lwd=2)
abline(h=0)