Preparation and assumed knowledge

Download the Cereal.csv data file from Canvas for this lab.

Aims

In this intro lab, you will learn to write an R markdown file, run some simple code. The questions here aim to get you familiar with the R syntax.
Please submit this lab work on Canvas.

1 Data structure

Before you start the tutorial, create a new RStudio project in a new directory. Then, start your tutorial by writing your answers in a new R Markdown file.

2 File I/O

2.1 Read

Download the Cereal.csv file from the Canvas page and use the read.csv command to read in the csv file into R and assign it to the object called cereal.

setwd("/Users/zixuan/Desktop/2024S1/stat5003 /w1")
cereal <- read.csv("Cereal.csv")

2.2 Data frames

There should be a default dataset in R called cereal. Use the head function to inspect the first few lines of the data frame and use class to check that cereal is in fact a data frame.

head(cereal,10)

##                         name mfr type calories protein fat sodium fiber carbo
## 1                  100%_Bran   N    C       70       4   1    130  10.0   5.0
## 2          100%_Natural_Bran   Q    C      120       3   5     15   2.0   8.0
## 3                   All-Bran   K    C       70       4   1    260   9.0   7.0
## 4  All-Bran_with_Extra_Fiber   K    C       50       4   0    140  14.0   8.0
## 5             Almond_Delight   R    C      110       2   2    200   1.0  14.0
## 6    Apple_Cinnamon_Cheerios   G    C      110       2   2    180   1.5  10.5
## 7                Apple_Jacks   K    C      110       2   0    125   1.0  11.0
## 8                    Basic_4   G    C      130       3   2    210   2.0  18.0
## 9                  Bran_Chex   R    C       90       2   1    200   4.0  15.0
## 10               Bran_Flakes   P    C       90       3   0    210   5.0  13.0
##    sugars potass vitamins shelf weight cups   rating
## 1       6    280       25     3   1.00 0.33 68.40297
## 2       8    135        0     3   1.00 1.00 33.98368
## 3       5    320       25     3   1.00 0.33 59.42551
## 4       0    330       25     3   1.00 0.50 93.70491
## 5       8     -1       25     3   1.00 0.75 34.38484
## 6      10     70       25     1   1.00 0.75 29.50954
## 7      14     30       25     2   1.00 1.00 33.17409
## 8       8    100       25     3   1.33 0.75 37.03856
## 9       6    125       25     1   1.00 0.67 49.12025
## 10      5    190       25     3   1.00 0.67 53.31381

class(cereal)

## [1] "data.frame"

What are the column names of the cereal data frame? How many rows are there? (dim and nrow)

dim(cereal)

## [1] 77 16

nrow(cereal)

## [1] 77

Extract the calories column using the $ operator and using the [[ operator.

cereal$calories

##  [1]  70 120  70  50 110 110 110 130  90  90 120 110 120 110 110 110 100 110 110
## [20] 110 100 110 100 100 110 110 100 120 120 110 100 110 100 110 120 120 110 110
## [39] 110 140 110 100 110 100 150 150 160 100 120 140  90 130 120 100  50  50 100
## [58] 100 120 100  90 110 110  80  90  90 110 110  90 110 140 100 110 110 100 100
## [77] 110

cereal[["calories"]]

##  [1]  70 120  70  50 110 110 110 130  90  90 120 110 120 110 110 110 100 110 110
## [20] 110 100 110 100 100 110 110 100 120 120 110 100 110 100 110 120 120 110 110
## [39] 110 140 110 100 110 100 150 150 160 100 120 140  90 130 120 100  50  50 100
## [58] 100 120 100  90 110 110  80  90  90 110 110  90 110 140 100 110 110 100 100
## [77] 110

Extract rows 1 to 10 from the cereal data frame.

head(cereal,10)

##                         name mfr type calories protein fat sodium fiber carbo
## 1                  100%_Bran   N    C       70       4   1    130  10.0   5.0
## 2          100%_Natural_Bran   Q    C      120       3   5     15   2.0   8.0
## 3                   All-Bran   K    C       70       4   1    260   9.0   7.0
## 4  All-Bran_with_Extra_Fiber   K    C       50       4   0    140  14.0   8.0
## 5             Almond_Delight   R    C      110       2   2    200   1.0  14.0
## 6    Apple_Cinnamon_Cheerios   G    C      110       2   2    180   1.5  10.5
## 7                Apple_Jacks   K    C      110       2   0    125   1.0  11.0
## 8                    Basic_4   G    C      130       3   2    210   2.0  18.0
## 9                  Bran_Chex   R    C       90       2   1    200   4.0  15.0
## 10               Bran_Flakes   P    C       90       3   0    210   5.0  13.0
##    sugars potass vitamins shelf weight cups   rating
## 1       6    280       25     3   1.00 0.33 68.40297
## 2       8    135        0     3   1.00 1.00 33.98368
## 3       5    320       25     3   1.00 0.33 59.42551
## 4       0    330       25     3   1.00 0.50 93.70491
## 5       8     -1       25     3   1.00 0.75 34.38484
## 6      10     70       25     1   1.00 0.75 29.50954
## 7      14     30       25     2   1.00 1.00 33.17409
## 8       8    100       25     3   1.33 0.75 37.03856
## 9       6    125       25     1   1.00 0.67 49.12025
## 10      5    190       25     3   1.00 0.67 53.31381

Make a new data frame called Kelloggs which only contains rows that belongs to manufacturer, Kellogs (when mfr takes the value "K").

Kelloggs <- subset(cereal, mfr == "K")

2.3 Factors

Load the Cereal data again with the read.csv command again. This time, use the optional argument, stringsAsFactors = TRUE.

cereal <- read.csv('Cereal.csv',stringsAsFactors = TRUE)

The mfr and type columns are now factors. Check that this is true.

class(cereal$mfr)

## [1] "factor"

class(cereal$type)

## [1] "factor"

How many levels are there in mfr and type? (use the functions levels or nlevels)

levels(cereal$mfr)

## [1] "A" "G" "K" "N" "P" "Q" "R"

nlevels(cereal$mfr)

## [1] 7

2.4 Vectors

Extract the calories into a new vector called cereal.calories.

cereal.calories <- cereal$calories
cereal.calories

##  [1]  70 120  70  50 110 110 110 130  90  90 120 110 120 110 110 110 100 110 110
## [20] 110 100 110 100 100 110 110 100 120 120 110 100 110 100 110 120 120 110 110
## [39] 110 140 110 100 110 100 150 150 160 100 120 140  90 130 120 100  50  50 100
## [58] 100 120 100  90 110 110  80  90  90 110 110  90 110 140 100 110 110 100 100
## [77] 110

How many elements are there in cereal.calories? (length)

length(cereal.calories)

## [1] 77

Extract the 5th to the 10th element from cereal.calories.

cereal.calories[c(5:10)]

## [1] 110 110 110 130  90  90

Add one more element to cereal.calories using c().

length(cereal.calories)

## [1] 77

cereal.calories <- c(cereal.calories, 3)
length(cereal.calories)

## [1] 78

2.5 Matrix

Can you force the cereal data frame to be a Matrix? (as.matrix(cereal)). Check that the elements have been forced into the character type.

cereal_matrix <- as.matrix(cereal)
class(cereal_matrix)

## [1] "matrix" "array"

Now do this again, but this time leave out the mfr, name and type columns. Check that the elements are now numeric.

cereal_matrix <- cereal[, !(names(cereal) %in% c("mfr", "name", "type"))]
cereal_matrix <- as.matrix(cereal_matrix)
str(cereal_matrix)

##  num [1:77, 1:13] 70 120 70 50 110 110 110 130 90 90 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:13] "calories" "protein" "fat" "sodium" ...

3 Numerical summary

3.1 Summary

Use the summary function to extract the median, 1st quartile and 3rd quartile data from the sodium column.

cereal_number <- summary(cereal$sodium)
sodium_stats <- cereal_number[c("Median", "1st Qu.", "3rd Qu.")]
sodium_stats

##  Median 1st Qu. 3rd Qu. 
##     180     130     210

3.2 Basic statistics

Find the max, min, standard deviation and mean of the sodium (max(), min(), sd(), mean())

max(cereal$sodium)

## [1] 320

min(cereal$sodium)

## [1] 0

sd(cereal$sodium)

## [1] 83.8323

mean(cereal$sodium)

## [1] 159.6753

Find the mean sodium of each mfr.

mean_sodium_per_mfr <- aggregate(sodium ~ mfr, data = cereal, FUN=mean)
mean_sodium_per_mfr

##   mfr   sodium
## 1   A   0.0000
## 2   G 200.4545
## 3   K 174.7826
## 4   N  37.5000
## 5   P 146.1111
## 6   Q  92.5000
## 7   R 198.1250

4 Graphical summary

4.1 Boxplot

Make a boxplot of the sodium against mfr using boxplot().

boxplot(sodium ~ mfr, data = cereal, xlab = 'Manufacturer', ylab = 'Sodium', main = "Something")

4.2 Scatterplot

Plot calories against sodium using plot().

plot(calories ~ sodium, data = cereal, main = "Something")

5 Write Data to File

Write data frame with only the Kellogg’s observations to a file called kelloggs.csv. Use the write.csv command.

write.csv(Kelloggs, "kelloggs.csv", row.names = FALSE)
file.exists("Kelloggs.csv")

## [1] TRUE

Lab Week 1

STAT5003

Dr. Qiuzhuang Sun

Preparation and assumed knowledge

Aims

1 Data structure

2 File I/O

2.1 Read

2.2 Data frames

2.3 Factors

2.4 Vectors

2.5 Matrix

3 Numerical summary

3.1 Summary

3.2 Basic statistics

4 Graphical summary

4.1 Boxplot

4.2 Scatterplot

5 Write Data to File