tutorial1

Question 1. Data Structure

Before you start the tutorial, create a new RStudio project in a new directory. Then, start your tutorial by writing your answers in a new R Markdown file.

Question 2. File I/O

2.1 Read

(a). Download the Cereal.csv file from the Canvas page and use the read.csv command to read in the csv file into R and assign it to the object called cereal.

cereal = read.csv("Cereal.csv", header = TRUE)
#print (cereal)

2.2 Data frames

(a). There should be a default dataset in R called cereal. Use the head function to inspect the first few lines of the data frame and use class to check that cereal is in fact a data frame.

head(cereal, n = 3)

##                name mfr type calories protein fat sodium fiber carbo sugars
## 1         100%_Bran   N    C       70       4   1    130    10     5      6
## 2 100%_Natural_Bran   Q    C      120       3   5     15     2     8      8
## 3          All-Bran   K    C       70       4   1    260     9     7      5
##   potass vitamins shelf weight cups   rating
## 1    280       25     3      1 0.33 68.40297
## 2    135        0     3      1 1.00 33.98368
## 3    320       25     3      1 0.33 59.42551

class(cereal)

## [1] "data.frame"

(b). What are the column names of the cereal data frame? How many rows are there? (dim and nrow)

dim(cereal)

## [1] 77 16

nrow(cereal)

## [1] 77

(c). Extract the calories column using the $ operator and using the [[ operator.

cereal$calories

##  [1]  70 120  70  50 110 110 110 130  90  90 120 110 120 110 110 110 100 110 110
## [20] 110 100 110 100 100 110 110 100 120 120 110 100 110 100 110 120 120 110 110
## [39] 110 140 110 100 110 100 150 150 160 100 120 140  90 130 120 100  50  50 100
## [58] 100 120 100  90 110 110  80  90  90 110 110  90 110 140 100 110 110 100 100
## [77] 110

identical(cereal$calories, cereal[["calories"]])

## [1] TRUE

(d). Extract rows 1 to 10 from the cereal data frame.

cereal[1:10,]

##                         name mfr type calories protein fat sodium fiber carbo
## 1                  100%_Bran   N    C       70       4   1    130  10.0   5.0
## 2          100%_Natural_Bran   Q    C      120       3   5     15   2.0   8.0
## 3                   All-Bran   K    C       70       4   1    260   9.0   7.0
## 4  All-Bran_with_Extra_Fiber   K    C       50       4   0    140  14.0   8.0
## 5             Almond_Delight   R    C      110       2   2    200   1.0  14.0
## 6    Apple_Cinnamon_Cheerios   G    C      110       2   2    180   1.5  10.5
## 7                Apple_Jacks   K    C      110       2   0    125   1.0  11.0
## 8                    Basic_4   G    C      130       3   2    210   2.0  18.0
## 9                  Bran_Chex   R    C       90       2   1    200   4.0  15.0
## 10               Bran_Flakes   P    C       90       3   0    210   5.0  13.0
##    sugars potass vitamins shelf weight cups   rating
## 1       6    280       25     3   1.00 0.33 68.40297
## 2       8    135        0     3   1.00 1.00 33.98368
## 3       5    320       25     3   1.00 0.33 59.42551
## 4       0    330       25     3   1.00 0.50 93.70491
## 5       8     -1       25     3   1.00 0.75 34.38484
## 6      10     70       25     1   1.00 0.75 29.50954
## 7      14     30       25     2   1.00 1.00 33.17409
## 8       8    100       25     3   1.33 0.75 37.03856
## 9       6    125       25     1   1.00 0.67 49.12025
## 10      5    190       25     3   1.00 0.67 53.31381

(e). Make a new data frame called Kelloggs which only contains rows that belongs to manufacturer, Kellogs (when mfr takes the value “K”).

Kelloggs = cereal[cereal$mfr=="K",]
print (Kelloggs)

##                           name mfr type calories protein fat sodium fiber carbo
## 3                     All-Bran   K    C       70       4   1    260     9     7
## 4    All-Bran_with_Extra_Fiber   K    C       50       4   0    140    14     8
## 7                  Apple_Jacks   K    C      110       2   0    125     1    11
## 17                 Corn_Flakes   K    C      100       2   0    290     1    21
## 18                   Corn_Pops   K    C      110       1   0     90     1    13
## 20          Cracklin'_Oat_Bran   K    C      110       3   3    140     4    10
## 22                     Crispix   K    C      110       2   0    220     1    21
## 25                 Froot_Loops   K    C      110       2   1    125     1    11
## 26              Frosted_Flakes   K    C      110       1   0    200     1    14
## 27         Frosted_Mini-Wheats   K    C      100       3   0      0     3    14
## 29               Fruitful_Bran   K    C      120       3   0    240     5    14
## 39 Just_Right_Crunchy__Nuggets   K    C      110       2   1    170     1    17
## 40      Just_Right_Fruit_&_Nut   K    C      140       3   1    170     2    20
## 47        Mueslix_Crispy_Blend   K    C      160       3   2    150     3    17
## 49            Nut&Honey_Crunch   K    C      120       2   1    190     0    15
## 50   Nutri-Grain_Almond-Raisin   K    C      140       3   2    220     3    21
## 51           Nutri-grain_Wheat   K    C       90       3   0    170     3    18
## 54                  Product_19   K    C      100       3   0    320     1    20
## 59                 Raisin_Bran   K    C      120       3   1    210     5    14
## 61              Raisin_Squares   K    C       90       2   0      0     2    15
## 63               Rice_Krispies   K    C      110       2   0    290     0    22
## 67                      Smacks   K    C      110       2   1     70     1     9
## 68                   Special_K   K    C      110       6   0    230     1    16
##    sugars potass vitamins shelf weight cups   rating
## 3       5    320       25     3   1.00 0.33 59.42551
## 4       0    330       25     3   1.00 0.50 93.70491
## 7      14     30       25     2   1.00 1.00 33.17409
## 17      2     35       25     1   1.00 1.00 45.86332
## 18     12     20       25     2   1.00 1.00 35.78279
## 20      7    160       25     3   1.00 0.50 40.44877
## 22      3     30       25     3   1.00 1.00 46.89564
## 25     13     30       25     2   1.00 1.00 32.20758
## 26     11     25       25     1   1.00 0.75 31.43597
## 27      7    100       25     2   1.00 0.80 58.34514
## 29     12    190       25     3   1.33 0.67 41.01549
## 39      6     60      100     3   1.00 1.00 36.52368
## 40      9     95      100     3   1.30 0.75 36.47151
## 47     13    160       25     3   1.50 0.67 30.31335
## 49      9     40       25     2   1.00 0.67 29.92429
## 50      7    130       25     3   1.33 0.67 40.69232
## 51      2     90       25     3   1.00 1.00 59.64284
## 54      3     45      100     3   1.00 1.00 41.50354
## 59     12    240       25     2   1.33 0.75 39.25920
## 61      6    110       25     3   1.00 0.50 55.33314
## 63      3     35       25     1   1.00 1.00 40.56016
## 67     15     40       25     2   1.00 0.75 31.23005
## 68      3     55       25     1   1.00 1.00 53.13132

2.3 Factors

(a). Load the Cereal data again with the read.csv command again. This time, use the optional argument, stringsAsFactors = TRUE.

cereal2 = read.csv("Cereal.csv", header = TRUE, stringsAsFactors = TRUE)

(b). The mfr and type columns are now factors. Check that this is true.

class(cereal2$mfr) == class(cereal2$type)

## [1] TRUE

class(cereal2$type) == "factor"

## [1] TRUE

(c). How many levels are there in mfr and type? (use the functions levels or nlevels)

levels(cereal2$mfr)

## [1] "A" "G" "K" "N" "P" "Q" "R"

levels(cereal2$type)

## [1] "C" "H"

2.4 Vectors

(a). Extract the calories into a new vector called cereal.calories.

cereal.calories = cereal2$calories
print (cereal.calories)

##  [1]  70 120  70  50 110 110 110 130  90  90 120 110 120 110 110 110 100 110 110
## [20] 110 100 110 100 100 110 110 100 120 120 110 100 110 100 110 120 120 110 110
## [39] 110 140 110 100 110 100 150 150 160 100 120 140  90 130 120 100  50  50 100
## [58] 100 120 100  90 110 110  80  90  90 110 110  90 110 140 100 110 110 100 100
## [77] 110

(b). How many elements are there in cereal.calories? (length)

length (cereal.calories)

## [1] 77

(c). Extract the 5th to the 10th element from cereal.calories.

cereal.calories[5:10]

## [1] 110 110 110 130  90  90

(d). Add one more element to cereal.calories using c().

cereal.calories <- c(cereal.calories, 99999)
print (cereal.calories)

##  [1]    70   120    70    50   110   110   110   130    90    90   120   110
## [13]   120   110   110   110   100   110   110   110   100   110   100   100
## [25]   110   110   100   120   120   110   100   110   100   110   120   120
## [37]   110   110   110   140   110   100   110   100   150   150   160   100
## [49]   120   140    90   130   120   100    50    50   100   100   120   100
## [61]    90   110   110    80    90    90   110   110    90   110   140   100
## [73]   110   110   100   100   110 99999

2.5 Matrix

(a). Can you force the cereal data frame to be a Matrix? (as.matrix(cereal)). Check that the elements have been forced into the character type.

matrix_cereal = as.matrix(cereal)
is.matrix(matrix_cereal)

## [1] TRUE

class (matrix_cereal[,6])

## [1] "character"

(b). Now do this again, but this time leave out the mfr, name and type columns. Check that the elements are now numeric.

tichu_cereal = cereal[,4:16]
is.numeric(tichu_cereal[,4])

## [1] TRUE

Question 3. Numerical Summary

3.1 Summary

(a). Use the summary function to extract the median, 1st quartile and 3rd quartile data from the sodium column.

summary(cereal$sodium)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   130.0   180.0   159.7   210.0   320.0

3.2 Basic statistics

(a). Find the max, min, standard deviation and mean of the sodium (max(), min(), sd(), mean())

max(cereal$sodium)

## [1] 320

min(cereal$sodium)

## [1] 0

sd(cereal$sodium)

## [1] 83.8323

mean(cereal$sodium)

## [1] 159.6753

(b). Find the mean sodium of each mfr.

K = cereal[cereal$mfr=="K",]
mean(K$sodium)

## [1] 174.7826

A = cereal[cereal$mfr=="A",]
mean(A$sodium)

## [1] 0

G = cereal[cereal$mfr=="G",]
mean(G$sodium)

## [1] 200.4545

N = cereal[cereal$mfr=="N",]
mean(N$sodium)

## [1] 37.5

P = cereal[cereal$mfr=="P",]
mean(P$sodium)

## [1] 146.1111

Q = cereal[cereal$mfr=="Q",]
mean(Q$sodium)

## [1] 92.5

R = cereal[cereal$mfr=="R",]
mean(R$sodium)

## [1] 198.125

aggregate(cereal$sodium, by = list(cereal$mfr), FUN = mean)

##   Group.1        x
## 1       A   0.0000
## 2       G 200.4545
## 3       K 174.7826
## 4       N  37.5000
## 5       P 146.1111
## 6       Q  92.5000
## 7       R 198.1250

# aggregate(which column u want to calculate, which column u want to class, what kind of method you want to operate)

Question 4. Graphical Summary

4.1 Box plot

(a). Make a boxplot of the sodium against mfr using boxplot().

cereal = read.csv("Cereal.csv", header = TRUE)
boxplot(sodium~mfr, data=cereal, xlab = "Manufacturer", ylab = "Sodium", main = "Boxplot Plot of Manufacturer VS Sodium")

4.2 Scatterplot

(a).Plot calories against sodium using plot().

plot(calories~sodium, data = cereal, xlab = "Sodium", ylab = "Calories", main = "Scatter Plot of Sodium VS. Calories")

Question 5. Write Data to File

(a). Write data frame with only the Kellogg’s observations to a file called kelloggs.csv. Use the write.csv command.

write.csv(Kelloggs, file = "kelloggs.csv")