2.1 read a.Download the Cereal.csv file from the Canvas page and use the read.csv command to read in the csv file into R and assign it to the object called cereal.
cereal = read.csv("cereal.csv", head=T)
2.2 Data frames a.There should be a default dataset in R called cereal. Use the head function to inspect the first few lines of the data frame and use class to check that cereal is in fact a data frame.
head(cereal)
## name mfr type calories protein fat sodium fiber carbo
## 1 100%_Bran N C 70 4 1 130 10.0 5.0
## 2 100%_Natural_Bran Q C 120 3 5 15 2.0 8.0
## 3 All-Bran K C 70 4 1 260 9.0 7.0
## 4 All-Bran_with_Extra_Fiber K C 50 4 0 140 14.0 8.0
## 5 Almond_Delight R C 110 2 2 200 1.0 14.0
## 6 Apple_Cinnamon_Cheerios G C 110 2 2 180 1.5 10.5
## sugars potass vitamins shelf weight cups rating
## 1 6 280 25 3 1 0.33 68.40297
## 2 8 135 0 3 1 1.00 33.98368
## 3 5 320 25 3 1 0.33 59.42551
## 4 0 330 25 3 1 0.50 93.70491
## 5 8 -1 25 3 1 0.75 34.38484
## 6 10 70 25 1 1 0.75 29.50954
class(cereal)
## [1] "data.frame"
b.What are the column names of the cereal data frame? How many rows are there? (dim and nrow)
colnames(cereal)
## [1] "name" "mfr" "type" "calories" "protein" "fat"
## [7] "sodium" "fiber" "carbo" "sugars" "potass" "vitamins"
## [13] "shelf" "weight" "cups" "rating"
dim(cereal)
## [1] 77 16
nrow(cereal)
## [1] 77
c.Extract the calories column using the $ operator and using the [[ operator.
cereal$calories
## [1] 70 120 70 50 110 110 110 130 90 90 120 110 120 110 110 110 100 110 110
## [20] 110 100 110 100 100 110 110 100 120 120 110 100 110 100 110 120 120 110 110
## [39] 110 140 110 100 110 100 150 150 160 100 120 140 90 130 120 100 50 50 100
## [58] 100 120 100 90 110 110 80 90 90 110 110 90 110 140 100 110 110 100 100
## [77] 110
cereal[['calories']]
## [1] 70 120 70 50 110 110 110 130 90 90 120 110 120 110 110 110 100 110 110
## [20] 110 100 110 100 100 110 110 100 120 120 110 100 110 100 110 120 120 110 110
## [39] 110 140 110 100 110 100 150 150 160 100 120 140 90 130 120 100 50 50 100
## [58] 100 120 100 90 110 110 80 90 90 110 110 90 110 140 100 110 110 100 100
## [77] 110
d.Extract rows 1 to 10 from the cereal data frame.
cereal[1:10, ]
## name mfr type calories protein fat sodium fiber carbo
## 1 100%_Bran N C 70 4 1 130 10.0 5.0
## 2 100%_Natural_Bran Q C 120 3 5 15 2.0 8.0
## 3 All-Bran K C 70 4 1 260 9.0 7.0
## 4 All-Bran_with_Extra_Fiber K C 50 4 0 140 14.0 8.0
## 5 Almond_Delight R C 110 2 2 200 1.0 14.0
## 6 Apple_Cinnamon_Cheerios G C 110 2 2 180 1.5 10.5
## 7 Apple_Jacks K C 110 2 0 125 1.0 11.0
## 8 Basic_4 G C 130 3 2 210 2.0 18.0
## 9 Bran_Chex R C 90 2 1 200 4.0 15.0
## 10 Bran_Flakes P C 90 3 0 210 5.0 13.0
## sugars potass vitamins shelf weight cups rating
## 1 6 280 25 3 1.00 0.33 68.40297
## 2 8 135 0 3 1.00 1.00 33.98368
## 3 5 320 25 3 1.00 0.33 59.42551
## 4 0 330 25 3 1.00 0.50 93.70491
## 5 8 -1 25 3 1.00 0.75 34.38484
## 6 10 70 25 1 1.00 0.75 29.50954
## 7 14 30 25 2 1.00 1.00 33.17409
## 8 8 100 25 3 1.33 0.75 37.03856
## 9 6 125 25 1 1.00 0.67 49.12025
## 10 5 190 25 3 1.00 0.67 53.31381
e.Make a new data frame called Kelloggs which only contains rows that belongs to manufacturer, Kellogs (when mfr takes the value “K”).
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Kelloggs <- cereal %>% filter(mfr == "K")
print(Kelloggs)
## name mfr type calories protein fat sodium fiber carbo
## 1 All-Bran K C 70 4 1 260 9 7
## 2 All-Bran_with_Extra_Fiber K C 50 4 0 140 14 8
## 3 Apple_Jacks K C 110 2 0 125 1 11
## 4 Corn_Flakes K C 100 2 0 290 1 21
## 5 Corn_Pops K C 110 1 0 90 1 13
## 6 Cracklin'_Oat_Bran K C 110 3 3 140 4 10
## 7 Crispix K C 110 2 0 220 1 21
## 8 Froot_Loops K C 110 2 1 125 1 11
## 9 Frosted_Flakes K C 110 1 0 200 1 14
## 10 Frosted_Mini-Wheats K C 100 3 0 0 3 14
## 11 Fruitful_Bran K C 120 3 0 240 5 14
## 12 Just_Right_Crunchy__Nuggets K C 110 2 1 170 1 17
## 13 Just_Right_Fruit_&_Nut K C 140 3 1 170 2 20
## 14 Mueslix_Crispy_Blend K C 160 3 2 150 3 17
## 15 Nut&Honey_Crunch K C 120 2 1 190 0 15
## 16 Nutri-Grain_Almond-Raisin K C 140 3 2 220 3 21
## 17 Nutri-grain_Wheat K C 90 3 0 170 3 18
## 18 Product_19 K C 100 3 0 320 1 20
## 19 Raisin_Bran K C 120 3 1 210 5 14
## 20 Raisin_Squares K C 90 2 0 0 2 15
## 21 Rice_Krispies K C 110 2 0 290 0 22
## 22 Smacks K C 110 2 1 70 1 9
## 23 Special_K K C 110 6 0 230 1 16
## sugars potass vitamins shelf weight cups rating
## 1 5 320 25 3 1.00 0.33 59.42551
## 2 0 330 25 3 1.00 0.50 93.70491
## 3 14 30 25 2 1.00 1.00 33.17409
## 4 2 35 25 1 1.00 1.00 45.86332
## 5 12 20 25 2 1.00 1.00 35.78279
## 6 7 160 25 3 1.00 0.50 40.44877
## 7 3 30 25 3 1.00 1.00 46.89564
## 8 13 30 25 2 1.00 1.00 32.20758
## 9 11 25 25 1 1.00 0.75 31.43597
## 10 7 100 25 2 1.00 0.80 58.34514
## 11 12 190 25 3 1.33 0.67 41.01549
## 12 6 60 100 3 1.00 1.00 36.52368
## 13 9 95 100 3 1.30 0.75 36.47151
## 14 13 160 25 3 1.50 0.67 30.31335
## 15 9 40 25 2 1.00 0.67 29.92429
## 16 7 130 25 3 1.33 0.67 40.69232
## 17 2 90 25 3 1.00 1.00 59.64284
## 18 3 45 100 3 1.00 1.00 41.50354
## 19 12 240 25 2 1.33 0.75 39.25920
## 20 6 110 25 3 1.00 0.50 55.33314
## 21 3 35 25 1 1.00 1.00 40.56016
## 22 15 40 25 2 1.00 0.75 31.23005
## 23 3 55 25 1 1.00 1.00 53.13132
2.3 Factors a.Load the Cereal data again with the read.csv command again. This time, use the optional argument, stringsAsFactors = TRUE.
cereal = read.csv("cereal.csv", stringsAsFactors = TRUE)
b.The mfr and type columns are now factors. Check that this is true.
class(cereal$mfr)
## [1] "factor"
class(cereal$type)
## [1] "factor"
c.How many levels are there in mfr and type? (use the functions levels or nlevels)
levels(cereal$mfr)
## [1] "A" "G" "K" "N" "P" "Q" "R"
levels(cereal$type)
## [1] "C" "H"
2.4 Vectors a.Extract the calories into a new vector called cereal.calories.
cereal.calories <- cereal$calories
b.How many elements are there in cereal.calories? (length)
length(cereal.calories)
## [1] 77
c.Extract the 5th to the 10th element from cereal.calories.
cereal.calories[5:10]
## [1] 110 110 110 130 90 90
d.Add one more element to cereal.calories using c().
cereal.calories <- c(cereal.calories,100)
print(cereal.calories)
## [1] 70 120 70 50 110 110 110 130 90 90 120 110 120 110 110 110 100 110 110
## [20] 110 100 110 100 100 110 110 100 120 120 110 100 110 100 110 120 120 110 110
## [39] 110 140 110 100 110 100 150 150 160 100 120 140 90 130 120 100 50 50 100
## [58] 100 120 100 90 110 110 80 90 90 110 110 90 110 140 100 110 110 100 100
## [77] 110 100
2.5 Matrix a.Can you force the cereal data frame to be a Matrix? (as.matrix(cereal)). Check that the elements have been forced into the character type.
cereal_matrix <- as.matrix(cereal)
class(cereal_matrix[1,9])
## [1] "character"
b.Now do this again, but this time leave out the mfr, name and type columns. Check that the elements are now numeric.
cereal_abc <- cereal %>% select(-mfr,-name,-type)
str(cereal_abc)
## 'data.frame': 77 obs. of 13 variables:
## $ calories: int 70 120 70 50 110 110 110 130 90 90 ...
## $ protein : int 4 3 4 4 2 2 2 3 2 3 ...
## $ fat : int 1 5 1 0 2 2 0 2 1 0 ...
## $ sodium : int 130 15 260 140 200 180 125 210 200 210 ...
## $ fiber : num 10 2 9 14 1 1.5 1 2 4 5 ...
## $ carbo : num 5 8 7 8 14 10.5 11 18 15 13 ...
## $ sugars : int 6 8 5 0 8 10 14 8 6 5 ...
## $ potass : int 280 135 320 330 -1 70 30 100 125 190 ...
## $ vitamins: int 25 0 25 25 25 25 25 25 25 25 ...
## $ shelf : int 3 3 3 3 3 1 2 3 1 3 ...
## $ weight : num 1 1 1 1 1 1 1 1.33 1 1 ...
## $ cups : num 0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
## $ rating : num 68.4 34 59.4 93.7 34.4 ...
3 Numerical summary 3.1 Summary Use the summary function to extract the median, 1st quartile and 3rd quartile data from the sodium column.
summary(cereal$sodium)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 130.0 180.0 159.7 210.0 320.0
3.2 Basic statistics a.Find the max, min, standard deviation and mean of the sodium (max(), min(), sd(), mean())
max(cereal$sodium)
## [1] 320
min(cereal$sodium)
## [1] 0
sd(cereal$sodium)
## [1] 83.8323
mean(cereal$sodium)
## [1] 159.6753
b.Find the mean sodium of each mfr.
mean.sodium.by.mfr <- aggregate(sodium ~ mfr, data = cereal, FUN = mean, na.rm = TRUE)
print(mean.sodium.by.mfr)
## mfr sodium
## 1 A 0.0000
## 2 G 200.4545
## 3 K 174.7826
## 4 N 37.5000
## 5 P 146.1111
## 6 Q 92.5000
## 7 R 198.1250
4 Graphical summary 4.1 Boxplot a.Make a boxplot of the sodium against mfr using boxplot()
boxplot(sodium ~ mfr, data=cereal, horizontal = TRUE,
main="Sodium by Manufacturer",
xlab = "Sodium",
ylab = "Manufacturer")
4.2 Scatterplot a.Plot calories against sodium using plot().
plot(calories ~ sodium, data=cereal, main="calories against sodium")
5 Write Data to File b.Write data frame with only the Kellogg’s observations to a file called kelloggs.csv. Use the write.csv command.
write.csv(Kelloggs,file='kelloggs.csv')
head(Kelloggs)
## name mfr type calories protein fat sodium fiber carbo
## 1 All-Bran K C 70 4 1 260 9 7
## 2 All-Bran_with_Extra_Fiber K C 50 4 0 140 14 8
## 3 Apple_Jacks K C 110 2 0 125 1 11
## 4 Corn_Flakes K C 100 2 0 290 1 21
## 5 Corn_Pops K C 110 1 0 90 1 13
## 6 Cracklin'_Oat_Bran K C 110 3 3 140 4 10
## sugars potass vitamins shelf weight cups rating
## 1 5 320 25 3 1 0.33 59.42551
## 2 0 330 25 3 1 0.50 93.70491
## 3 14 30 25 2 1 1.00 33.17409
## 4 2 35 25 1 1 1.00 45.86332
## 5 12 20 25 2 1 1.00 35.78279
## 6 7 160 25 3 1 0.50 40.44877