ChickWeight
It is evident that the maximum weight of chicks on diet 1 is 305 grams, the minimum weight is 35 grams, First Quartile is 57.5, Third Quartile is 137 and median weight is 88 grams.
It is evident that the maximum weight of chicks on diet 2 is 331 grams, the minimum weight is 39 grams, First Quartile is 65, Third Quartile is 163 and median weight is 104.5 grams.
It is evident that the maximum weight of chicks on diet 3 is 373 grams, the minimum weight is 39 grams, First Quartile is 67, Third Quartile is 199.5 and median weight is 125.5 grams.
It is evident that the maximum weight of chicks on diet 4 is 322 grams, the minimum weight is 39 grams, First Quartile is 69, Third Quartile is 185 and median weight is 129.5 grams.
data <- data.set(ChickWeight)
codebook(data)
## ================================================================================
##
## ChickWeight.weight
##
## --------------------------------------------------------------------------------
##
## Storage mode: double
## Measurement: interval
##
## Min: 35.000
## Max: 373.000
## Mean: 121.818
## Std.Dev.: 71.010
##
## ================================================================================
##
## ChickWeight.Time
##
## --------------------------------------------------------------------------------
##
## Storage mode: double
## Measurement: interval
##
## Min: 0.000
## Max: 21.000
## Mean: 10.718
## Std.Dev.: 6.753
##
## ================================================================================
##
## ChickWeight.Chick
##
## --------------------------------------------------------------------------------
##
## Storage mode: integer
## Measurement: ordinal
##
## Values and labels N Percent
##
## 1 '18' 2 0.3
## 2 '16' 7 1.2
## 3 '15' 8 1.4
## 4 '13' 12 2.1
## 5 '9' 12 2.1
## 6 '20' 12 2.1
## 7 '10' 12 2.1
## 8 '8' 11 1.9
## 9 '17' 12 2.1
## 10 '19' 12 2.1
## 11 '4' 12 2.1
## 12 '6' 12 2.1
## 13 '11' 12 2.1
## 14 '3' 12 2.1
## 15 '1' 12 2.1
## 16 '12' 12 2.1
## 17 '2' 12 2.1
## 18 '5' 12 2.1
## 19 '14' 12 2.1
## 20 '7' 12 2.1
## 21 '24' 12 2.1
## 22 '30' 12 2.1
## 23 '22' 12 2.1
## 24 '23' 12 2.1
## 25 '27' 12 2.1
## 26 '28' 12 2.1
## 27 '26' 12 2.1
## 28 '25' 12 2.1
## 29 '29' 12 2.1
## 30 '21' 12 2.1
## 31 '33' 12 2.1
## 32 '37' 12 2.1
## 33 '36' 12 2.1
## 34 '31' 12 2.1
## 35 '39' 12 2.1
## 36 '38' 12 2.1
## 37 '32' 12 2.1
## 38 '40' 12 2.1
## 39 '34' 12 2.1
## 40 '35' 12 2.1
## 41 '44' 10 1.7
## 42 '45' 12 2.1
## 43 '43' 12 2.1
## 44 '41' 12 2.1
## 45 '47' 12 2.1
## 46 '49' 12 2.1
## 47 '46' 12 2.1
## 48 '50' 12 2.1
## 49 '42' 12 2.1
## 50 '48' 12 2.1
##
## ================================================================================
##
## ChickWeight.Diet
##
## --------------------------------------------------------------------------------
##
## Storage mode: integer
## Measurement: nominal
##
## Values and labels N Percent
##
## 1 '1' 220 38.1
## 2 '2' 120 20.8
## 3 '3' 120 20.8
## 4 '4' 118 20.4
Demonstrate these FIVE (5) functions of dplyr for data manipulation:
i. filter()
ii. arrange()
iii. mutate()
iv. select()
v. summarise()
state <- rep(c("Perak", "Selangor", "Johor", "Kedah", "Kelantan", "Terengganu", "Pahang", "Perlis"), each = 5)
state <- sample(state)
gender <- rep_len(c("Male", "Female"), length.out = length(state))
gender <- sample(gender)
bloodOx <- seq(80, 100, by = 0.1)
bloodOx <- sample(bloodOx, length(state), replace = TRUE)
daysCovid <- seq(1, 120, by = 1)
daysCovid <- sample(daysCovid, length(state), replace = TRUE)
names <- c(randomNames(length(state)))
covidData <- data.frame(names, state, gender, bloodOx, daysCovid)
colnames(covidData) <- c("Names", "State", "Gender", "Blood Oxygen Concentration", "Days Contracted Covid")
covidData
The function filter() from the package dplyr is used to subset a data frame and retain all rows that satisfy certain logical conditions. Let’s say that we can discharge a patient if their blood oxygen concentration is above 95. We can obtain the name of the patients that have a blood oxygen concentration of higher than 95 by running the following line of R code :
filter(covidData, bloodOx > 95)
What will happen is that R subsets all the rows in the data frame for which the value in the “Blood Oxygen Concentration” column is more than 95. The result is a data frame containing the subsetted data
The function arrange() from the package dplyr is used arrange the rows of a data frame by the values of selected columns. Let’s say we want to arrange the names of the Covid patients based on how many days they have contracted Covid (from the most numbers of days to the least). We can run the following R code :
arrange(covidData, desc(daysCovid))
What will happen is that R will rearrange the rows in “covidData” such the first row contains the largest value for the column “Days Contracted Covid”. In other words, the value of the column “Days COntracted Covid” is arranged in descending order. The desc() function is used to sort the variable in its parentheses in descending order
The function mutate() from the package dplyr is used create new variables from the data. The mutate function takes as parameters the data set, the new variable name that will be created, and the mutation. Let’s say that we want to determine the current cumulative hospital fees for all patients, and the nett hospital fees is RM150 per day , we can run the following line of R code :
covidData <- mutate(covidData, Fees =daysCovid*150)
colnames(covidData) <- c("Names", "State", "Gender", "Blood Oxygen Concentration", "Days Contracted Covid", "Fees")
covidData
What will happen is that each row of data in “covidData”, R will multiply the value in the “Days Contracted Covid” column by 150 and store result in a new column called “Fees”
The function select() from the package dplyr is used to subset columns or variables based on conditions. Let’s say we want to have a list consisting of only two columns, the name of the patients and the fees, we can subset the column “Names” and “Fees” by running the R code below :
dplyr::select(covidData,"Names", "Fees")
What will happen is that R will search in the dataset “covidData” for the column “Names” and “Fees”, subset them from the original dataset and create a new dataframe consisting of only the column “Names” and “Fees”
The function summarise() from the package dplyr is used to summarize data from a dataset. The output of the summarise() function is a dataframe. Let’s say we want to determine the average blood oxygen concentration among all the patients and the median number of days a patient contract covid, We can execute the following line of R code :
summarise(covidData, bloodOx_mean = mean(bloodOx), daysCovid_median = median(daysCovid))
What will happen is that R will sum up all the values in the column “Blood Oxygen Concentration” and divide it by the number of observations in the column and store the result in the variable “bloodOx_mean”. Next, R will sort the values in the column “Days Contracted Covid” and find the middle value and store the result in the variable “daysCovid_mean”. The summarise() function then will return a dataframe consisting of two columns, “bloodOx_mean” and “daysCovid_median”