Question 1a)

Chick Weight

The ChickWeight data frame has 578 rows and 4 columns from an experiment on the effect of diet on early growth of chicks.

ChickWeight Dataset

ChickWeight

Boxplot of Weight of Chicks on Diet 1

Five Number Summary for Weight of Chicks on Diet 1

It is evident that the maximum weight of chicks on diet 1 is 305 grams, the minimum weight is 35 grams, First Quartile is 57.5, Third Quartile is 137 and median weight is 88 grams.

No of Outliers For Weight of Chicks on Diet 1

Boxplot of Weight of Chicks on Diet 2

Five Number Summary for Weight of Chicks on Diet 2

It is evident that the maximum weight of chicks on diet 2 is 331 grams, the minimum weight is 39 grams, First Quartile is 65, Third Quartile is 163 and median weight is 104.5 grams.

No of Outliers For Weight of Chicks on Diet 2

Boxplot of Weight of Chicks on Diet 3

Five Number Summary for Weight of Chicks on Diet 3

It is evident that the maximum weight of chicks on diet 3 is 373 grams, the minimum weight is 39 grams, First Quartile is 67, Third Quartile is 199.5 and median weight is 125.5 grams.

No of Outliers For Weight of Chicks on Diet 3

Boxplot of Weight of Chicks on Diet 4

Five Number Summary for Weight of Chicks on Diet 4

It is evident that the maximum weight of chicks on diet 4 is 322 grams, the minimum weight is 39 grams, First Quartile is 69, Third Quartile is 185 and median weight is 129.5 grams.

No of Outliers For Weight of Chicks on Diet 4

ChickWeight DataFrame Codebook

data <- data.set(ChickWeight)
codebook(data)
## ================================================================================
## 
##    ChickWeight.weight
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: double
##    Measurement: interval
## 
##         Min:  35.000
##         Max: 373.000
##        Mean: 121.818
##    Std.Dev.:  71.010
## 
## ================================================================================
## 
##    ChickWeight.Time
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: double
##    Measurement: interval
## 
##         Min:  0.000
##         Max: 21.000
##        Mean: 10.718
##    Std.Dev.:  6.753
## 
## ================================================================================
## 
##    ChickWeight.Chick
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: integer
##    Measurement: ordinal
## 
##    Values and labels       N Percent
##                                     
##     1 '18'                 2     0.3
##     2 '16'                 7     1.2
##     3 '15'                 8     1.4
##     4 '13'                12     2.1
##     5 '9'                 12     2.1
##     6 '20'                12     2.1
##     7 '10'                12     2.1
##     8 '8'                 11     1.9
##     9 '17'                12     2.1
##    10 '19'                12     2.1
##    11 '4'                 12     2.1
##    12 '6'                 12     2.1
##    13 '11'                12     2.1
##    14 '3'                 12     2.1
##    15 '1'                 12     2.1
##    16 '12'                12     2.1
##    17 '2'                 12     2.1
##    18 '5'                 12     2.1
##    19 '14'                12     2.1
##    20 '7'                 12     2.1
##    21 '24'                12     2.1
##    22 '30'                12     2.1
##    23 '22'                12     2.1
##    24 '23'                12     2.1
##    25 '27'                12     2.1
##    26 '28'                12     2.1
##    27 '26'                12     2.1
##    28 '25'                12     2.1
##    29 '29'                12     2.1
##    30 '21'                12     2.1
##    31 '33'                12     2.1
##    32 '37'                12     2.1
##    33 '36'                12     2.1
##    34 '31'                12     2.1
##    35 '39'                12     2.1
##    36 '38'                12     2.1
##    37 '32'                12     2.1
##    38 '40'                12     2.1
##    39 '34'                12     2.1
##    40 '35'                12     2.1
##    41 '44'                10     1.7
##    42 '45'                12     2.1
##    43 '43'                12     2.1
##    44 '41'                12     2.1
##    45 '47'                12     2.1
##    46 '49'                12     2.1
##    47 '46'                12     2.1
##    48 '50'                12     2.1
##    49 '42'                12     2.1
##    50 '48'                12     2.1
## 
## ================================================================================
## 
##    ChickWeight.Diet
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: integer
##    Measurement: nominal
## 
##    Values and labels       N Percent
##                                     
##    1 '1'                 220    38.1
##    2 '2'                 120    20.8
##    3 '3'                 120    20.8
##    4 '4'                 118    20.4

Summary of ChickWeight Dataframe


Question 1b)

Demonstrate these FIVE (5) functions of dplyr for data manipulation:
i. filter()
ii. arrange()
iii. mutate()
iv. select()
v. summarise()


Creating a dataset containing the state, gender, blood oxygen level and days of contracting covid for covid patients in Malaysia
state <- rep(c("Perak", "Selangor", "Johor", "Kedah", "Kelantan", "Terengganu", "Pahang", "Perlis"), each = 5)
state <- sample(state)

gender <- rep_len(c("Male", "Female"), length.out = length(state))
gender <- sample(gender)

bloodOx <- seq(80, 100, by = 0.1)
bloodOx <- sample(bloodOx, length(state), replace = TRUE)

daysCovid <- seq(1, 120, by = 1)
daysCovid <- sample(daysCovid, length(state), replace = TRUE)

names <- c(randomNames(length(state)))

covidData <- data.frame(names, state, gender, bloodOx, daysCovid)
colnames(covidData) <- c("Names", "State", "Gender", "Blood Oxygen Concentration", "Days Contracted Covid")

Dataset Created

covidData

i) filter()

The function filter() from the package dplyr is used to subset a data frame and retain all rows that satisfy certain logical conditions. Let’s say that we can discharge a patient if their blood oxygen concentration is above 95. We can obtain the name of the patients that have a blood oxygen concentration of higher than 95 by running the following line of R code :

filter(covidData, bloodOx > 95)

What will happen is that R subsets all the rows in the data frame for which the value in the “Blood Oxygen Concentration” column is more than 95. The result is a data frame containing the subsetted data


ii) arrange()

The function arrange() from the package dplyr is used arrange the rows of a data frame by the values of selected columns. Let’s say we want to arrange the names of the Covid patients based on how many days they have contracted Covid (from the most numbers of days to the least). We can run the following R code :

arrange(covidData, desc(daysCovid))

What will happen is that R will rearrange the rows in “covidData” such the first row contains the largest value for the column “Days Contracted Covid”. In other words, the value of the column “Days COntracted Covid” is arranged in descending order. The desc() function is used to sort the variable in its parentheses in descending order


iii) mutate()

The function mutate() from the package dplyr is used create new variables from the data. The mutate function takes as parameters the data set, the new variable name that will be created, and the mutation. Let’s say that we want to determine the current cumulative hospital fees for all patients, and the nett hospital fees is RM150 per day , we can run the following line of R code :

covidData <- mutate(covidData, Fees =daysCovid*150)
colnames(covidData) <- c("Names", "State", "Gender", "Blood Oxygen Concentration", "Days Contracted Covid", "Fees")
covidData

What will happen is that each row of data in “covidData”, R will multiply the value in the “Days Contracted Covid” column by 150 and store result in a new column called “Fees”


iv) select()

The function select() from the package dplyr is used to subset columns or variables based on conditions. Let’s say we want to have a list consisting of only two columns, the name of the patients and the fees, we can subset the column “Names” and “Fees” by running the R code below :

dplyr::select(covidData,"Names", "Fees")

What will happen is that R will search in the dataset “covidData” for the column “Names” and “Fees”, subset them from the original dataset and create a new dataframe consisting of only the column “Names” and “Fees”


v) summarise()

The function summarise() from the package dplyr is used to summarize data from a dataset. The output of the summarise() function is a dataframe. Let’s say we want to determine the average blood oxygen concentration among all the patients and the median number of days a patient contract covid, We can execute the following line of R code :

summarise(covidData, bloodOx_mean = mean(bloodOx), daysCovid_median = median(daysCovid))

What will happen is that R will sum up all the values in the column “Blood Oxygen Concentration” and divide it by the number of observations in the column and store the result in the variable “bloodOx_mean”. Next, R will sort the values in the column “Days Contracted Covid” and find the middle value and store the result in the variable “daysCovid_mean”. The summarise() function then will return a dataframe consisting of two columns, “bloodOx_mean” and “daysCovid_median”