Preparation and assumed knowledge

  • Download the Cereal.csv data file from Canvas for this lab.

Aims

  • In this intro lab, you will learn to write an R markdown file, run some simple code. The questions here aim to get you familiar with the R syntax.
  • Please submit this lab work on Canvas.


1 Data structure

Before you start the tutorial, create a new RStudio project in a new directory. Then, start your tutorial by writing your answers in a new R Markdown file.

2 File I/O

2.1 Read

  1. Download the Cereal.csv file from the Canvas page and use the read.csv command to read in the csv file into R and assign it to the object called cereal.
setwd("/Users/zixuan/Desktop/2024S1/stat5003 /w1")
cereal <- read.csv("Cereal.csv")

2.2 Data frames

  1. There should be a default dataset in R called cereal. Use the head function to inspect the first few lines of the data frame and use class to check that cereal is in fact a data frame.
head(cereal,10)
##                         name mfr type calories protein fat sodium fiber carbo
## 1                  100%_Bran   N    C       70       4   1    130  10.0   5.0
## 2          100%_Natural_Bran   Q    C      120       3   5     15   2.0   8.0
## 3                   All-Bran   K    C       70       4   1    260   9.0   7.0
## 4  All-Bran_with_Extra_Fiber   K    C       50       4   0    140  14.0   8.0
## 5             Almond_Delight   R    C      110       2   2    200   1.0  14.0
## 6    Apple_Cinnamon_Cheerios   G    C      110       2   2    180   1.5  10.5
## 7                Apple_Jacks   K    C      110       2   0    125   1.0  11.0
## 8                    Basic_4   G    C      130       3   2    210   2.0  18.0
## 9                  Bran_Chex   R    C       90       2   1    200   4.0  15.0
## 10               Bran_Flakes   P    C       90       3   0    210   5.0  13.0
##    sugars potass vitamins shelf weight cups   rating
## 1       6    280       25     3   1.00 0.33 68.40297
## 2       8    135        0     3   1.00 1.00 33.98368
## 3       5    320       25     3   1.00 0.33 59.42551
## 4       0    330       25     3   1.00 0.50 93.70491
## 5       8     -1       25     3   1.00 0.75 34.38484
## 6      10     70       25     1   1.00 0.75 29.50954
## 7      14     30       25     2   1.00 1.00 33.17409
## 8       8    100       25     3   1.33 0.75 37.03856
## 9       6    125       25     1   1.00 0.67 49.12025
## 10      5    190       25     3   1.00 0.67 53.31381
class(cereal)
## [1] "data.frame"
  1. What are the column names of the cereal data frame? How many rows are there? (dim and nrow)
dim(cereal)
## [1] 77 16
nrow(cereal)
## [1] 77
  1. Extract the calories column using the $ operator and using the [[ operator.
cereal$calories
##  [1]  70 120  70  50 110 110 110 130  90  90 120 110 120 110 110 110 100 110 110
## [20] 110 100 110 100 100 110 110 100 120 120 110 100 110 100 110 120 120 110 110
## [39] 110 140 110 100 110 100 150 150 160 100 120 140  90 130 120 100  50  50 100
## [58] 100 120 100  90 110 110  80  90  90 110 110  90 110 140 100 110 110 100 100
## [77] 110
cereal[["calories"]]
##  [1]  70 120  70  50 110 110 110 130  90  90 120 110 120 110 110 110 100 110 110
## [20] 110 100 110 100 100 110 110 100 120 120 110 100 110 100 110 120 120 110 110
## [39] 110 140 110 100 110 100 150 150 160 100 120 140  90 130 120 100  50  50 100
## [58] 100 120 100  90 110 110  80  90  90 110 110  90 110 140 100 110 110 100 100
## [77] 110
  1. Extract rows 1 to 10 from the cereal data frame.
head(cereal,10)
##                         name mfr type calories protein fat sodium fiber carbo
## 1                  100%_Bran   N    C       70       4   1    130  10.0   5.0
## 2          100%_Natural_Bran   Q    C      120       3   5     15   2.0   8.0
## 3                   All-Bran   K    C       70       4   1    260   9.0   7.0
## 4  All-Bran_with_Extra_Fiber   K    C       50       4   0    140  14.0   8.0
## 5             Almond_Delight   R    C      110       2   2    200   1.0  14.0
## 6    Apple_Cinnamon_Cheerios   G    C      110       2   2    180   1.5  10.5
## 7                Apple_Jacks   K    C      110       2   0    125   1.0  11.0
## 8                    Basic_4   G    C      130       3   2    210   2.0  18.0
## 9                  Bran_Chex   R    C       90       2   1    200   4.0  15.0
## 10               Bran_Flakes   P    C       90       3   0    210   5.0  13.0
##    sugars potass vitamins shelf weight cups   rating
## 1       6    280       25     3   1.00 0.33 68.40297
## 2       8    135        0     3   1.00 1.00 33.98368
## 3       5    320       25     3   1.00 0.33 59.42551
## 4       0    330       25     3   1.00 0.50 93.70491
## 5       8     -1       25     3   1.00 0.75 34.38484
## 6      10     70       25     1   1.00 0.75 29.50954
## 7      14     30       25     2   1.00 1.00 33.17409
## 8       8    100       25     3   1.33 0.75 37.03856
## 9       6    125       25     1   1.00 0.67 49.12025
## 10      5    190       25     3   1.00 0.67 53.31381
  1. Make a new data frame called Kelloggs which only contains rows that belongs to manufacturer, Kellogs (when mfr takes the value "K").
Kelloggs <- subset(cereal, mfr == "K")

2.3 Factors

  1. Load the Cereal data again with the read.csv command again. This time, use the optional argument, stringsAsFactors = TRUE.
cereal <- read.csv('Cereal.csv',stringsAsFactors = TRUE)
  1. The mfr and type columns are now factors. Check that this is true.
class(cereal$mfr)
## [1] "factor"
class(cereal$type)
## [1] "factor"
  1. How many levels are there in mfr and type? (use the functions levels or nlevels)
levels(cereal$mfr)
## [1] "A" "G" "K" "N" "P" "Q" "R"
nlevels(cereal$mfr)
## [1] 7

2.4 Vectors

  1. Extract the calories into a new vector called cereal.calories.
cereal.calories <- cereal$calories
cereal.calories
##  [1]  70 120  70  50 110 110 110 130  90  90 120 110 120 110 110 110 100 110 110
## [20] 110 100 110 100 100 110 110 100 120 120 110 100 110 100 110 120 120 110 110
## [39] 110 140 110 100 110 100 150 150 160 100 120 140  90 130 120 100  50  50 100
## [58] 100 120 100  90 110 110  80  90  90 110 110  90 110 140 100 110 110 100 100
## [77] 110
  1. How many elements are there in cereal.calories? (length)
length(cereal.calories)
## [1] 77
  1. Extract the 5th to the 10th element from cereal.calories.
cereal.calories[c(5:10)]
## [1] 110 110 110 130  90  90
  1. Add one more element to cereal.calories using c().
length(cereal.calories)
## [1] 77
cereal.calories <- c(cereal.calories, 3)
length(cereal.calories)
## [1] 78

2.5 Matrix

  1. Can you force the cereal data frame to be a Matrix? (as.matrix(cereal)). Check that the elements have been forced into the character type.
cereal_matrix <- as.matrix(cereal)
class(cereal_matrix)
## [1] "matrix" "array"
  1. Now do this again, but this time leave out the mfr, name and type columns. Check that the elements are now numeric.
cereal_matrix <- cereal[, !(names(cereal) %in% c("mfr", "name", "type"))]
cereal_matrix <- as.matrix(cereal_matrix)
str(cereal_matrix)
##  num [1:77, 1:13] 70 120 70 50 110 110 110 130 90 90 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:13] "calories" "protein" "fat" "sodium" ...

3 Numerical summary

3.1 Summary

  1. Use the summary function to extract the median, 1st quartile and 3rd quartile data from the sodium column.
cereal_number <- summary(cereal$sodium)
sodium_stats <- cereal_number[c("Median", "1st Qu.", "3rd Qu.")]
sodium_stats
##  Median 1st Qu. 3rd Qu. 
##     180     130     210

3.2 Basic statistics

  1. Find the max, min, standard deviation and mean of the sodium (max(), min(), sd(), mean())
max(cereal$sodium)
## [1] 320
min(cereal$sodium)
## [1] 0
sd(cereal$sodium)
## [1] 83.8323
mean(cereal$sodium)
## [1] 159.6753
  1. Find the mean sodium of each mfr.
mean_sodium_per_mfr <- aggregate(sodium ~ mfr, data = cereal, FUN=mean)
mean_sodium_per_mfr
##   mfr   sodium
## 1   A   0.0000
## 2   G 200.4545
## 3   K 174.7826
## 4   N  37.5000
## 5   P 146.1111
## 6   Q  92.5000
## 7   R 198.1250

4 Graphical summary

4.1 Boxplot

  1. Make a boxplot of the sodium against mfr using boxplot().
boxplot(sodium ~ mfr, data = cereal, xlab = 'Manufacturer', ylab = 'Sodium', main = "Something")

4.2 Scatterplot

  1. Plot calories against sodium using plot().
plot(calories ~ sodium, data = cereal, main = "Something")

5 Write Data to File

  1. Write data frame with only the Kellogg’s observations to a file called kelloggs.csv. Use the write.csv command.
write.csv(Kelloggs, "kelloggs.csv", row.names = FALSE)
file.exists("Kelloggs.csv")
## [1] TRUE