WQD7004 PROGRAMMING FOR DATA SCIENCE: Assignment 1

Task

Write a simple R Markdown to explain some of the R codes you have learned regarding data frame. You are free to write your own program and add new codes such as how would you show all the rows except the last one or how would you get the last 6 rows of the data frame. Eg. you can explain what a data frame is, and then write the code to show how R handles data frame, and so on. But one mandatory topic is you must explain the different ways R returns back a vector or a data frame when you access values from a data frame. Your program doesn’t have to be long. Play around with the different font sizes in markdown to make your markdown readable. Explore. Publish your markdown on RPubs and submit the link only. Adding new codes that have not been discussed will earn you high marks.

Note : Best to create your own data frame or use a data frame which is small in size and so that it would be easy to see the results after the execution of each or after a few codes.

1. What is data frame?

Let’s load some data.

data("iris")

Next, test some basic R command.

class(iris)

## [1] "data.frame"

typeof(iris)

## [1] "list"

Data frame is a two-dimension data structure in R to stored tabular data.

Based on the this stackoverflow articles, typeof function returns the way an object is stored in memory, while class function return on the abstract type of an object.

In other words, a data frame is stored in memory as lists. In fact, a data frame can be regarded as a collection of special list with equal length.

A data frame looks very similar to an Excel spreadsheet

2. Creating a data frame

A data frame can be created in several ways:-

Method 1:load from R built-in dataset.

data("iris")

Method 2:read from external source using read.csv() or read.table() functions.

#Read from external source:
Data_Entry<-read.csv("Data Entry.csv",stringsAsFactors = F,header = T)

Method 3:explicitly create a data frame using the data.frame() function.

allowance<-data.frame(Name=c("Ali","Sam","Chan","Vijay","Intan")
                      ,Allowance=c(500,450,650,300,495)
                      )

Method 4: coerce from other object types using as.data.frame() function.

workhour<-as.data.frame(list(Name=c("Ali","Sam","Chan","Vijay","Intan")
                             ,workhour=c(100,90,120,70,95)))

3. Exploring data frame

Dimensions of a data frame

The dimensions of a data frame can be obtained via several method:-

dim() function

dim(iris)
is.vector(dim(iris))

## [1] 150   5
## [1] TRUE

ncol() and nrow()

ncol(iris)
nrow(iris)

## [1] 5
## [1] 150

dim() function returns a vector with two elements. The first element of the vector is the row numbers of the data frame (same as the value returned from ncol()) and the second element of the vector is the column numbers of the data frame (same as the value returned from nrow()).

Structure of data frame

cat("\nstr function \n")
str(iris)
cat("\nSummary function \n")
summary(iris)

## 
## str function 
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## 
## Summary function 
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

str() and summary() functions are handy for having an overview of the data frame. str() function returns the column names, value type/class and the first few value for each column from the data frame. summary() function summarise the data into measures of central tendency and quantiles.

Unlike matrix, a data frame can contain different class under different columns or vectors.

View data frame

There are several ways in viewing data frame, depending on your purpose.

View(iris)

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

head() function allow us to view the first 6 rows of a data frame, by default.

head(iris,10)

##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa

head() function will return different numbers of rows if it is specified in the function.

tail(iris)

##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

tail() function allow us to view the last 6 rows of a data frame, by default.

tail(iris,10)

##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 141          6.7         3.1          5.6         2.4 virginica
## 142          6.9         3.1          5.1         2.3 virginica
## 143          5.8         2.7          5.1         1.9 virginica
## 144          6.8         3.2          5.9         2.3 virginica
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

tail() function will return different numbers of rows if it is specified in the function.

Row names and columns names of a data frame

names(iris)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

row.names(iris)

##   [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11"  "12" 
##  [13] "13"  "14"  "15"  "16"  "17"  "18"  "19"  "20"  "21"  "22"  "23"  "24" 
##  [25] "25"  "26"  "27"  "28"  "29"  "30"  "31"  "32"  "33"  "34"  "35"  "36" 
##  [37] "37"  "38"  "39"  "40"  "41"  "42"  "43"  "44"  "45"  "46"  "47"  "48" 
##  [49] "49"  "50"  "51"  "52"  "53"  "54"  "55"  "56"  "57"  "58"  "59"  "60" 
##  [61] "61"  "62"  "63"  "64"  "65"  "66"  "67"  "68"  "69"  "70"  "71"  "72" 
##  [73] "73"  "74"  "75"  "76"  "77"  "78"  "79"  "80"  "81"  "82"  "83"  "84" 
##  [85] "85"  "86"  "87"  "88"  "89"  "90"  "91"  "92"  "93"  "94"  "95"  "96" 
##  [97] "97"  "98"  "99"  "100" "101" "102" "103" "104" "105" "106" "107" "108"
## [109] "109" "110" "111" "112" "113" "114" "115" "116" "117" "118" "119" "120"
## [121] "121" "122" "123" "124" "125" "126" "127" "128" "129" "130" "131" "132"
## [133] "133" "134" "135" "136" "137" "138" "139" "140" "141" "142" "143" "144"
## [145] "145" "146" "147" "148" "149" "150"

names() returns the column names of a data frame, while row.names() returns the row names of a data frame.

iris2<-iris

cat(paste0("\nThe original column names of the data frame:",paste0(names(iris2),collapse = ", "),"\n"))

names(iris2)[1]<-"Sepal.Length2"
names(iris2)[names(iris2)=="Species"]<-"Species2"

cat(paste0("\nThe new column names of the data frame:",paste0(names(iris2),collapse = ", "),"\n"))

## 
## The original column names of the data frame:Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
## 
## The new column names of the data frame:Sepal.Length2, Sepal.Width, Petal.Length, Petal.Width, Species2

The column names of a data frame can be renamed by assigning a new character to a specific index of the data frame names.

4. Accessing Elements from Data Frame

There are several ways to access the elements in the data frame

Method 1: Integer Vectors as index

head(iris)
 cat("\n")

iris[1,1]

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
## 
## [1] 5.1

iris[1,1] return the element from the first row and first column.

iris[2,1]

## [1] 4.9

iris[1,3]

## [1] 1.4

As shown in the example above, iris[2,1] returns the element from the second row and first column. While iris[1,3] returns the element from the first row and the third column. We could generalise the rule as data frame[x,y] returns the element from x row and y column of the data frame.

Now, we learn how to access a specific element in the data frame, what if we want to access more than one element in the data frame?

1:4
c(2,4)
iris[1:4,c(2,4)]

## [1] 1 2 3 4
## [1] 2 4
##   Sepal.Width Petal.Width
## 1         3.5         0.2
## 2         3.0         0.2
## 3         3.2         0.2
## 4         3.1         0.2

We could make use of a series of integer vectors. Insert 1:4, which is equivalent to c(1,2,3,4) as the row indices return the first to fourth rows. Insert c(2,4) as the column indices return the second and fourth columns of the data frame.

cat("head(iris)\n")
head(iris)

cat("\nhead(iris[-2,-5])\n")
head(iris[,c(-2,-5)])

## head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
## 
## head(iris[-2,-5])
##   Sepal.Length Petal.Length Petal.Width
## 1          5.1          1.4         0.2
## 2          4.9          1.4         0.2
## 3          4.7          1.3         0.2
## 4          4.6          1.5         0.2
## 5          5.0          1.4         0.2
## 6          5.4          1.7         0.4

From the example above, negative indices remove the indicated columns from the data frame, e.g. c(-2,-5) remove the second columns and the fifth columns from the data frame.

Method 2: Logical vectors as index

head(iris[,c(T,F,T,F,T)])

##   Sepal.Length Petal.Length Species
## 1          5.1          1.4  setosa
## 2          4.9          1.4  setosa
## 3          4.7          1.3  setosa
## 4          4.6          1.5  setosa
## 5          5.0          1.4  setosa
## 6          5.4          1.7  setosa

R returns the elements where its corresponding logical vector is TRUE.

iris[c(T,F,F,F,F,T),]

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 150          5.9         3.0          5.1         1.8  virginica

If the T/F vector is less than the total length of the data frame row/columns, R merely replicates (recycles) the T/F vector until its length is the same as the data frame row/columns

iris[iris[,1]>5,1]
is.data.frame(iris[iris[,1]>5,1])
is.data.frame(iris[iris[,1]>5,1,drop=F]) #coerce to data frame

##   [1] 5.1 5.4 5.4 5.8 5.7 5.4 5.1 5.7 5.1 5.4 5.1 5.1 5.2 5.2 5.4 5.2 5.5 5.5
##  [19] 5.1 5.1 5.1 5.3 7.0 6.4 6.9 5.5 6.5 5.7 6.3 6.6 5.2 5.9 6.0 6.1 5.6 6.7
##  [37] 5.6 5.8 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0
##  [55] 5.4 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1
##  [73] 6.3 6.5 7.6 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6
##  [91] 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9
## [109] 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9
## [1] FALSE
## [1] TRUE

Since T/F vectors are allowed to access elements in the data frame, conditions which return a T/F vectors could return elements in R.

Notice from the above examples that R automatically return a vector if only 1 column is selected. To coerce the result into a data frame, we could use “drop=F” statement.

Method 3: Character vectors as index

names(iris)

cat('\niris[["Species"]]\n')
head(iris[["Species"]])
is.data.frame(iris[["Species"]])
cat('\niris["Species"]\n')
head(iris["Species"])
is.data.frame(iris["Species"])
cat('\niris$Species\n')
head(iris$Species)
is.data.frame(iris$Species)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
## 
## iris[["Species"]]
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
## [1] FALSE
## 
## iris["Species"]
##   Species
## 1  setosa
## 2  setosa
## 3  setosa
## 4  setosa
## 5  setosa
## 6  setosa
## [1] TRUE
## 
## iris$Species
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
## [1] FALSE

We could use the names of the data frame to access the elements of the data frame.

“[[]]”,“[]” and “$” return similar results with the exception that “[]” returns the result in data frame while the other two methods return the result in vectors.

5. Modifying a data frame

A data frame can be modified in several ways:-

Override values: A value can be assigned to a specific position of a data frame

cat("\nThe original value if iris2[1,1] is:\n")
iris2[1,1]

iris2[1,1]<--999

cat("\nThe new value if iris2[1,1] is:\n")
iris2[1,1]

## 
## The original value if iris2[1,1] is:
## [1] 5.1
## 
## The new value if iris2[1,1] is:
## [1] -999

binding data frame

head(cbind(iris,Data_Entry))

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Data_Entry   ID
## 1          5.1         3.5          1.4         0.2  setosa        Ali A003
## 2          4.9         3.0          1.4         0.2  setosa        Ali A003
## 3          4.7         3.2          1.3         0.2  setosa        Ali A003
## 4          4.6         3.1          1.5         0.2  setosa        Ali A003
## 5          5.0         3.6          1.4         0.2  setosa        Ali A003
## 6          5.4         3.9          1.7         0.4  setosa        Ali A003
##   Student
## 1     Yes
## 2     Yes
## 3     Yes
## 4     Yes
## 5     Yes
## 6     Yes

cbind(), which stands for column-bind, combines two data frame by column.

iris2<-rbind(iris
             ,c(Sepal.Length=-999
                ,Sepal.Width=999
                ,Petal.Length=-999
                ,Petal.Width=0.999
                ,Species="setosa")
             )

tail(iris2)

##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 146          6.7           3          5.2         2.3 virginica
## 147          6.3         2.5            5         1.9 virginica
## 148          6.5           3          5.2           2 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9           3          5.1         1.8 virginica
## 151         -999         999         -999       0.999    setosa

rbind(), which stands for row-bind, combines two data frame by row.

Create a new column

s<-rep(c("train","test"),150/2)
iris2<-iris
iris2$sample<-s
head(iris2)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species sample
## 1          5.1         3.5          1.4         0.2  setosa  train
## 2          4.9         3.0          1.4         0.2  setosa   test
## 3          4.7         3.2          1.3         0.2  setosa  train
## 4          4.6         3.1          1.5         0.2  setosa   test
## 5          5.0         3.6          1.4         0.2  setosa  train
## 6          5.4         3.9          1.7         0.4  setosa   test

In R, we can create a new column by assigning a vector to the column name after the $ sign.

Remove a column

iris2$sample<-NULL
head(iris2)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Assigning “NULL” to an existing columns leads to the removal of the column from the data frame.

6. Merging data frames

It’s easy to combine two data frames with the same length, as shown by the cbind() function.

head(iris)
head(Data_Entry)

head(iris2<-cbind(iris,Data_Entry))

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
##   Data_Entry   ID Student
## 1        Ali A003     Yes
## 2        Ali A003     Yes
## 3        Ali A003     Yes
## 4        Ali A003     Yes
## 5        Ali A003     Yes
## 6        Ali A003     Yes
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Data_Entry   ID
## 1          5.1         3.5          1.4         0.2  setosa        Ali A003
## 2          4.9         3.0          1.4         0.2  setosa        Ali A003
## 3          4.7         3.2          1.3         0.2  setosa        Ali A003
## 4          4.6         3.1          1.5         0.2  setosa        Ali A003
## 5          5.0         3.6          1.4         0.2  setosa        Ali A003
## 6          5.4         3.9          1.7         0.4  setosa        Ali A003
##   Student
## 1     Yes
## 2     Yes
## 3     Yes
## 4     Yes
## 5     Yes
## 6     Yes

But, what if we want to merge two data frame based on specific a merging key, rather than column by column?

head(iris2)
allowance

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Data_Entry   ID
## 1          5.1         3.5          1.4         0.2  setosa        Ali A003
## 2          4.9         3.0          1.4         0.2  setosa        Ali A003
## 3          4.7         3.2          1.3         0.2  setosa        Ali A003
## 4          4.6         3.1          1.5         0.2  setosa        Ali A003
## 5          5.0         3.6          1.4         0.2  setosa        Ali A003
## 6          5.4         3.9          1.7         0.4  setosa        Ali A003
##   Student
## 1     Yes
## 2     Yes
## 3     Yes
## 4     Yes
## 5     Yes
## 6     Yes
##    Name Allowance
## 1   Ali       500
## 2   Sam       450
## 3  Chan       650
## 4 Vijay       300
## 5 Intan       495

How should we merge the “Allowance” in the allowance data frame to iris2, using Name (names of the data entry staff) as the merging key?

Method 1: merge()

iris3<-merge(iris2,allowance,by.x = "Data_Entry",by.y="Name")
head(iris3)

##   Data_Entry Sepal.Length Sepal.Width Petal.Length Petal.Width Species   ID
## 1        Ali          5.1         3.5          1.4         0.2  setosa A003
## 2        Ali          4.9         3.0          1.4         0.2  setosa A003
## 3        Ali          4.7         3.2          1.3         0.2  setosa A003
## 4        Ali          4.6         3.1          1.5         0.2  setosa A003
## 5        Ali          5.0         3.6          1.4         0.2  setosa A003
## 6        Ali          5.4         3.9          1.7         0.4  setosa A003
##   Student Allowance
## 1     Yes       500
## 2     Yes       500
## 3     Yes       500
## 4     Yes       500
## 5     Yes       500
## 6     Yes       500

Method 2: left_join() from library(dplyr)

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

iris4<-left_join(iris3,workhour,by=c("Data_Entry"="Name"))
head(iris4)

##   Data_Entry Sepal.Length Sepal.Width Petal.Length Petal.Width Species   ID
## 1        Ali          5.1         3.5          1.4         0.2  setosa A003
## 2        Ali          4.9         3.0          1.4         0.2  setosa A003
## 3        Ali          4.7         3.2          1.3         0.2  setosa A003
## 4        Ali          4.6         3.1          1.5         0.2  setosa A003
## 5        Ali          5.0         3.6          1.4         0.2  setosa A003
## 6        Ali          5.4         3.9          1.7         0.4  setosa A003
##   Student Allowance workhour
## 1     Yes       500      100
## 2     Yes       500      100
## 3     Yes       500      100
## 4     Yes       500      100
## 5     Yes       500      100
## 6     Yes       500      100

7. Preliminary Exploratory Analysis

table() function summarise the count of records based on two variables (columns). The table above shows the records collected by each staff for each flower species.

table(iris4$Data_Entry,iris$Species)

##        
##         setosa versicolor virginica
##   Ali       29          0         0
##   Chan      21         32         0
##   Intan      0         18         3
##   Sam        0          0        14
##   Vijay      0          0        33

hist() and plot() function enable us to visualise data. The following are some example on visualising the size of the flower petals:-

Example 1: Petal Length

ax=c(min(iris4$Petal.Length),max(iris4$Petal.Length))
brk<-pretty(ax,n=30)

h1<-hist(iris4$Petal.Length[iris4$Species=="setosa"],plot = F,breaks=brk)
h2<-hist(iris4$Petal.Length[iris4$Species=="versicolor"],plot = F,breaks=brk)
h3<-hist(iris4$Petal.Length[iris4$Species=="virginica"],plot = F,breaks=brk)

plot(h1,col="lightblue",xlim = ax,ylim = c(0,25),main=NULL,xlab = "Petal Length by Flower Species")
plot(h2,add=T,col="#FFC0CB7F",xlim = ax)
plot(h3,add=T,col="#00FF0033",xlim = ax)

legend("topright",legend = c("setosa","versicolor","virginica")
       ,col=c("lightblue","#FFC0CB7F","#00FF0033"), lwd=10)

Example 2: Petal Width

ax=c(min(iris4$Petal.Width),max(iris4$Petal.Width))
brk<-pretty(ax,n=30)

h1<-hist(iris4$Petal.Width[iris4$Species=="setosa"],plot = F,breaks=brk)
h2<-hist(iris4$Petal.Width[iris4$Species=="versicolor"],plot = F,breaks=brk)
h3<-hist(iris4$Petal.Width[iris4$Species=="virginica"],plot = F,breaks=brk)

plot(h1,col="lightblue",xlim = ax,ylim = c(0,30),main=NULL,xlab = "Petal Width by Flower Species")
plot(h2,add=T,col="#FFC0CB7F",xlim = ax)
plot(h3,add=T,col="#00FF0033",xlim = ax)

legend("topright",legend = c("setosa","versicolor","virginica")
       ,col=c("lightblue","#FFC0CB7F","#00FF0033"), lwd=10)

Example 3: Petal Size = Petal Length X Petal Width

iris4$Petal.Size<-iris4$Petal.Length*iris4$Petal.Width

ax=c(min(iris4$Petal.Size),max(iris4$Petal.Size))
brk<-pretty(ax,n=20)

h1<-hist(iris4$Petal.Size[iris4$Species=="setosa"],plot = F,breaks=brk)
h2<-hist(iris4$Petal.Size[iris4$Species=="versicolor"],plot = F,breaks=brk)
h3<-hist(iris4$Petal.Size[iris4$Species=="virginica"],plot = F,breaks=brk)



plot(h1,col="lightblue",xlim = ax,ylim = c(0,55),main=NULL,xlab = "Petal Size by Flower Species")
plot(h2,add=T,col="#FFC0CB7F",xlim = ax)
plot(h3,add=T,col="#00FF0033",xlim = ax)

legend("topright",legend = c("setosa","versicolor","virginica")
       ,col=c("lightblue","#FFC0CB7F","#00FF0033"), lwd=10)

The histogram above shows that the size of the petals varies among setosa, vericolor and virgica species.