Before you start the tutorial, create a new RStudio project in a new directory. Then, start your tutorial by writing your answers in a new R Markdown file.
(a). Download the Cereal.csv file from the Canvas page and use the read.csv command to read in the csv file into R and assign it to the object called cereal.
cereal = read.csv("Cereal.csv", header = TRUE)
#print (cereal)
(a). There should be a default dataset in R called cereal. Use the head function to inspect the first few lines of the data frame and use class to check that cereal is in fact a data frame.
head(cereal, n = 3)
## name mfr type calories protein fat sodium fiber carbo sugars
## 1 100%_Bran N C 70 4 1 130 10 5 6
## 2 100%_Natural_Bran Q C 120 3 5 15 2 8 8
## 3 All-Bran K C 70 4 1 260 9 7 5
## potass vitamins shelf weight cups rating
## 1 280 25 3 1 0.33 68.40297
## 2 135 0 3 1 1.00 33.98368
## 3 320 25 3 1 0.33 59.42551
class(cereal)
## [1] "data.frame"
(b). What are the column names of the cereal data frame? How many rows are there? (dim and nrow)
dim(cereal)
## [1] 77 16
nrow(cereal)
## [1] 77
(c). Extract the calories column using the $ operator and using the [[ operator.
cereal$calories
## [1] 70 120 70 50 110 110 110 130 90 90 120 110 120 110 110 110 100 110 110
## [20] 110 100 110 100 100 110 110 100 120 120 110 100 110 100 110 120 120 110 110
## [39] 110 140 110 100 110 100 150 150 160 100 120 140 90 130 120 100 50 50 100
## [58] 100 120 100 90 110 110 80 90 90 110 110 90 110 140 100 110 110 100 100
## [77] 110
identical(cereal$calories, cereal[["calories"]])
## [1] TRUE
(d). Extract rows 1 to 10 from the cereal data frame.
cereal[1:10,]
## name mfr type calories protein fat sodium fiber carbo
## 1 100%_Bran N C 70 4 1 130 10.0 5.0
## 2 100%_Natural_Bran Q C 120 3 5 15 2.0 8.0
## 3 All-Bran K C 70 4 1 260 9.0 7.0
## 4 All-Bran_with_Extra_Fiber K C 50 4 0 140 14.0 8.0
## 5 Almond_Delight R C 110 2 2 200 1.0 14.0
## 6 Apple_Cinnamon_Cheerios G C 110 2 2 180 1.5 10.5
## 7 Apple_Jacks K C 110 2 0 125 1.0 11.0
## 8 Basic_4 G C 130 3 2 210 2.0 18.0
## 9 Bran_Chex R C 90 2 1 200 4.0 15.0
## 10 Bran_Flakes P C 90 3 0 210 5.0 13.0
## sugars potass vitamins shelf weight cups rating
## 1 6 280 25 3 1.00 0.33 68.40297
## 2 8 135 0 3 1.00 1.00 33.98368
## 3 5 320 25 3 1.00 0.33 59.42551
## 4 0 330 25 3 1.00 0.50 93.70491
## 5 8 -1 25 3 1.00 0.75 34.38484
## 6 10 70 25 1 1.00 0.75 29.50954
## 7 14 30 25 2 1.00 1.00 33.17409
## 8 8 100 25 3 1.33 0.75 37.03856
## 9 6 125 25 1 1.00 0.67 49.12025
## 10 5 190 25 3 1.00 0.67 53.31381
(e). Make a new data frame called Kelloggs which only contains rows that belongs to manufacturer, Kellogs (when mfr takes the value “K”).
Kelloggs = cereal[cereal$mfr=="K",]
print (Kelloggs)
## name mfr type calories protein fat sodium fiber carbo
## 3 All-Bran K C 70 4 1 260 9 7
## 4 All-Bran_with_Extra_Fiber K C 50 4 0 140 14 8
## 7 Apple_Jacks K C 110 2 0 125 1 11
## 17 Corn_Flakes K C 100 2 0 290 1 21
## 18 Corn_Pops K C 110 1 0 90 1 13
## 20 Cracklin'_Oat_Bran K C 110 3 3 140 4 10
## 22 Crispix K C 110 2 0 220 1 21
## 25 Froot_Loops K C 110 2 1 125 1 11
## 26 Frosted_Flakes K C 110 1 0 200 1 14
## 27 Frosted_Mini-Wheats K C 100 3 0 0 3 14
## 29 Fruitful_Bran K C 120 3 0 240 5 14
## 39 Just_Right_Crunchy__Nuggets K C 110 2 1 170 1 17
## 40 Just_Right_Fruit_&_Nut K C 140 3 1 170 2 20
## 47 Mueslix_Crispy_Blend K C 160 3 2 150 3 17
## 49 Nut&Honey_Crunch K C 120 2 1 190 0 15
## 50 Nutri-Grain_Almond-Raisin K C 140 3 2 220 3 21
## 51 Nutri-grain_Wheat K C 90 3 0 170 3 18
## 54 Product_19 K C 100 3 0 320 1 20
## 59 Raisin_Bran K C 120 3 1 210 5 14
## 61 Raisin_Squares K C 90 2 0 0 2 15
## 63 Rice_Krispies K C 110 2 0 290 0 22
## 67 Smacks K C 110 2 1 70 1 9
## 68 Special_K K C 110 6 0 230 1 16
## sugars potass vitamins shelf weight cups rating
## 3 5 320 25 3 1.00 0.33 59.42551
## 4 0 330 25 3 1.00 0.50 93.70491
## 7 14 30 25 2 1.00 1.00 33.17409
## 17 2 35 25 1 1.00 1.00 45.86332
## 18 12 20 25 2 1.00 1.00 35.78279
## 20 7 160 25 3 1.00 0.50 40.44877
## 22 3 30 25 3 1.00 1.00 46.89564
## 25 13 30 25 2 1.00 1.00 32.20758
## 26 11 25 25 1 1.00 0.75 31.43597
## 27 7 100 25 2 1.00 0.80 58.34514
## 29 12 190 25 3 1.33 0.67 41.01549
## 39 6 60 100 3 1.00 1.00 36.52368
## 40 9 95 100 3 1.30 0.75 36.47151
## 47 13 160 25 3 1.50 0.67 30.31335
## 49 9 40 25 2 1.00 0.67 29.92429
## 50 7 130 25 3 1.33 0.67 40.69232
## 51 2 90 25 3 1.00 1.00 59.64284
## 54 3 45 100 3 1.00 1.00 41.50354
## 59 12 240 25 2 1.33 0.75 39.25920
## 61 6 110 25 3 1.00 0.50 55.33314
## 63 3 35 25 1 1.00 1.00 40.56016
## 67 15 40 25 2 1.00 0.75 31.23005
## 68 3 55 25 1 1.00 1.00 53.13132
(a). Load the Cereal data again with the read.csv command again. This time, use the optional argument, stringsAsFactors = TRUE.
cereal2 = read.csv("Cereal.csv", header = TRUE, stringsAsFactors = TRUE)
(b). The mfr and type columns are now factors. Check that this is true.
class(cereal2$mfr) == class(cereal2$type)
## [1] TRUE
class(cereal2$type) == "factor"
## [1] TRUE
(c). How many levels are there in mfr and type? (use the functions levels or nlevels)
levels(cereal2$mfr)
## [1] "A" "G" "K" "N" "P" "Q" "R"
levels(cereal2$type)
## [1] "C" "H"
(a). Extract the calories into a new vector called cereal.calories.
cereal.calories = cereal2$calories
print (cereal.calories)
## [1] 70 120 70 50 110 110 110 130 90 90 120 110 120 110 110 110 100 110 110
## [20] 110 100 110 100 100 110 110 100 120 120 110 100 110 100 110 120 120 110 110
## [39] 110 140 110 100 110 100 150 150 160 100 120 140 90 130 120 100 50 50 100
## [58] 100 120 100 90 110 110 80 90 90 110 110 90 110 140 100 110 110 100 100
## [77] 110
(b). How many elements are there in cereal.calories? (length)
length (cereal.calories)
## [1] 77
(c). Extract the 5th to the 10th element from cereal.calories.
cereal.calories[5:10]
## [1] 110 110 110 130 90 90
(d). Add one more element to cereal.calories using c().
cereal.calories <- c(cereal.calories, 99999)
print (cereal.calories)
## [1] 70 120 70 50 110 110 110 130 90 90 120 110
## [13] 120 110 110 110 100 110 110 110 100 110 100 100
## [25] 110 110 100 120 120 110 100 110 100 110 120 120
## [37] 110 110 110 140 110 100 110 100 150 150 160 100
## [49] 120 140 90 130 120 100 50 50 100 100 120 100
## [61] 90 110 110 80 90 90 110 110 90 110 140 100
## [73] 110 110 100 100 110 99999
(a). Can you force the cereal data frame to be a Matrix? (as.matrix(cereal)). Check that the elements have been forced into the character type.
matrix_cereal = as.matrix(cereal)
is.matrix(matrix_cereal)
## [1] TRUE
class (matrix_cereal[,6])
## [1] "character"
(b). Now do this again, but this time leave out the mfr, name and type columns. Check that the elements are now numeric.
tichu_cereal = cereal[,4:16]
is.numeric(tichu_cereal[,4])
## [1] TRUE
(a). Use the summary function to extract the median, 1st quartile and 3rd quartile data from the sodium column.
summary(cereal$sodium)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 130.0 180.0 159.7 210.0 320.0
(a). Find the max, min, standard deviation and mean of the sodium (max(), min(), sd(), mean())
max(cereal$sodium)
## [1] 320
min(cereal$sodium)
## [1] 0
sd(cereal$sodium)
## [1] 83.8323
mean(cereal$sodium)
## [1] 159.6753
(b). Find the mean sodium of each mfr.
K = cereal[cereal$mfr=="K",]
mean(K$sodium)
## [1] 174.7826
A = cereal[cereal$mfr=="A",]
mean(A$sodium)
## [1] 0
G = cereal[cereal$mfr=="G",]
mean(G$sodium)
## [1] 200.4545
N = cereal[cereal$mfr=="N",]
mean(N$sodium)
## [1] 37.5
P = cereal[cereal$mfr=="P",]
mean(P$sodium)
## [1] 146.1111
Q = cereal[cereal$mfr=="Q",]
mean(Q$sodium)
## [1] 92.5
R = cereal[cereal$mfr=="R",]
mean(R$sodium)
## [1] 198.125
aggregate(cereal$sodium, by = list(cereal$mfr), FUN = mean)
## Group.1 x
## 1 A 0.0000
## 2 G 200.4545
## 3 K 174.7826
## 4 N 37.5000
## 5 P 146.1111
## 6 Q 92.5000
## 7 R 198.1250
# aggregate(which column u want to calculate, which column u want to class, what kind of method you want to operate)
(a). Make a boxplot of the sodium against mfr using boxplot().
cereal = read.csv("Cereal.csv", header = TRUE)
boxplot(sodium~mfr, data=cereal, xlab = "Manufacturer", ylab = "Sodium", main = "Boxplot Plot of Manufacturer VS Sodium")
(a).Plot calories against sodium using plot().
plot(calories~sodium, data = cereal, xlab = "Sodium", ylab = "Calories", main = "Scatter Plot of Sodium VS. Calories")
(a). Write data frame with only the Kellogg’s observations to a file called kelloggs.csv. Use the write.csv command.
write.csv(Kelloggs, file = "kelloggs.csv")