a) Confusion Matrix

b) Exploratory Data Analysis

Weight & Height of American women aged 30-36

data(women)
women
##    height weight
## 1      58    115
## 2      59    117
## 3      60    120
## 4      61    123
## 5      62    126
## 6      63    129
## 7      64    132
## 8      65    135
## 9      66    139
## 10     67    142
## 11     68    146
## 12     69    150
## 13     70    154
## 14     71    159
## 15     72    164

Summary of the height of the sample of American women:

summary(women$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    58.0    61.5    65.0    65.0    68.5    72.0

To describe the data and provide a summary, the summary() function is used on the height variable. Based on the result above, the minimum height is 58 inches. The maximum height is 72 inches. The mean is 65. The median is also 65.

quantile(women$height)
##   0%  25%  50%  75% 100% 
## 58.0 61.5 65.0 68.5 72.0

To find the quantile of the height, the quartile() function is used.

IQR(women$height)
## [1] 7

To find the interquartile range of the height, the IQR() function is used. The interquartile range is 7.

var(women$height)
## [1] 20

To find the variance of the height, the var() function is used. The variance is 20.

sd(women$height)
## [1] 4.472136

To find the standard deviation of the height, the sd() function is used. The standard deviation is 4.472136.

Summary of the weight of the sample of American women:

summary(women$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   115.0   124.5   135.0   136.7   148.0   164.0

To describe the data and provide a summary, the summary() function is used on the weight variable. Based on the result above, the minimum weight is 115 lbs. The maximum weight is 164 lbs. The mean is 135.0. The median is 136.7.

quantile(women$weight)
##    0%   25%   50%   75%  100% 
## 115.0 124.5 135.0 148.0 164.0

To find the quantile of the height, the quantile() function is used.

IQR(women$weight)
## [1] 23.5

To find the interquartile range of the height, the IQR() function is used. The interquartile range is 23.5.

var(women$weight)
## [1] 240.2095

To find the variance of the height, the var() function is used. The variance is 240.2095.

sd(women$weight)
## [1] 15.49869

To find the standard deviation of the height, the sd() function is used. The standard deviation is 15.49869.

Plot:

plot(women, xlab = "Height (in)", ylab = "Weight (lb)", main = "Weight & Height of American Women aged 30-39")
fit <- lm(weight~height, data = women)
abline(fit, lty="dashed")

Codebook:

codebook(women)
## ================================================================================
## 
##    height
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: double
## 
##         Min: 58.000
##         Max: 72.000
##        Mean: 65.000
##    Std.Dev.:  4.320
##    Skewness:  0.000
##    Kurtosis: -1.211
## 
## ================================================================================
## 
##    weight
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: double
## 
##         Min: 115.000
##         Max: 164.000
##        Mean: 136.733
##    Std.Dev.:  14.973
##    Skewness:   0.252
##    Kurtosis:  -1.100
summary(women)
##      height         weight     
##  Min.   :58.0   Min.   :115.0  
##  1st Qu.:61.5   1st Qu.:124.5  
##  Median :65.0   Median :135.0  
##  Mean   :65.0   Mean   :136.7  
##  3rd Qu.:68.5   3rd Qu.:148.0  
##  Max.   :72.0   Max.   :164.0

c) Data Manipulation with dplyr

states <- c("Perlis","Kedah","Penang","Perak","Selangor","Kuala Lumpur","Putrajaya", "Malacca","Johore","Kelantan","Terengganu", "Negeri Sembilan","Pahang")
region <- c("Northern","Northern","Northern","Northern","Central","Central","Central","Southern","Southern","East Coast","East Coast","Southern","East Coast")
cases <- c(8,422,372,279,2235,447,30,209,549,626,266,434,263)
covidmalaysia <- data.frame(states, cases, region)
print(covidmalaysia)
##             states cases     region
## 1           Perlis     8   Northern
## 2            Kedah   422   Northern
## 3           Penang   372   Northern
## 4            Perak   279   Northern
## 5         Selangor  2235    Central
## 6     Kuala Lumpur   447    Central
## 7        Putrajaya    30    Central
## 8          Malacca   209   Southern
## 9           Johore   549   Southern
## 10        Kelantan   626 East Coast
## 11      Terengganu   266 East Coast
## 12 Negeri Sembilan   434   Southern
## 13          Pahang   263 East Coast

This dataset represents the number of Covid-19 cases in Peninsular Malaysia on 23 May 2021. The source is taken from the official website of Ministry of Health. The data is retrieved from covid-19.moh.gov.my

  1. Change the column name
covidmalaysia2 <- rename(covidmalaysia, c(Peninsular_States=states, Covid19_Cases = cases))
covidmalaysia2
##    Peninsular_States Covid19_Cases     region
## 1             Perlis             8   Northern
## 2              Kedah           422   Northern
## 3             Penang           372   Northern
## 4              Perak           279   Northern
## 5           Selangor          2235    Central
## 6       Kuala Lumpur           447    Central
## 7          Putrajaya            30    Central
## 8            Malacca           209   Southern
## 9             Johore           549   Southern
## 10          Kelantan           626 East Coast
## 11        Terengganu           266 East Coast
## 12   Negeri Sembilan           434   Southern
## 13            Pahang           263 East Coast

The rename() function is used to change the column names. The column ‘states’ is changed to ‘Peninsular_States’. The column ‘cases’ is changed to ‘Covid19_Cases’.

  1. Pick rows based on their values
covidmalaysia2 %>% filter(region=="Central")
##   Peninsular_States Covid19_Cases  region
## 1          Selangor          2235 Central
## 2      Kuala Lumpur           447 Central
## 3         Putrajaya            30 Central

The filter()function is used to pick rows that contains the value “Central” from the dataset called “covidmalaysia2”. The output produced Kuala Lumpur, Selangor, and Putrajaya.

  1. Add new columns to a data frame
cluster <- c(0, 79, 121,20,72,40,0,59,194,140,113,55,117)
closecontact <- c(6,194,110,167,1768,263,23,99,254,361,112,253, 110)
covidmalaysia2 %>% mutate(cluster,closecontact, .after=Covid19_Cases)
##    Peninsular_States Covid19_Cases cluster closecontact     region
## 1             Perlis             8       0            6   Northern
## 2              Kedah           422      79          194   Northern
## 3             Penang           372     121          110   Northern
## 4              Perak           279      20          167   Northern
## 5           Selangor          2235      72         1768    Central
## 6       Kuala Lumpur           447      40          263    Central
## 7          Putrajaya            30       0           23    Central
## 8            Malacca           209      59           99   Southern
## 9             Johore           549     194          254   Southern
## 10          Kelantan           626     140          361 East Coast
## 11        Terengganu           266     113          112 East Coast
## 12   Negeri Sembilan           434      55          253   Southern
## 13            Pahang           263     117          110 East Coast

Two new columns called ‘cluster’ and ‘closecontact’ is added to the dataset using the mutate() function. The data is retrieved from covid-19.moh.gov.my

  1. Combine dataset
covidmalaysia3 = data.frame(Peninsular_States = c("Perlis","Kedah","Penang","Perak","Selangor","Kuala Lumpur","Putrajaya", "Malacca","Johore","Kelantan","Terengganu", "Negeri Sembilan","Pahang"), importcase = c(0,0,0,0,0,4,0,0,0,0,0,0,0))

covidmalaysia4 = left_join(x=covidmalaysia2, y=covidmalaysia3, by="Peninsular_States")
covidmalaysia4
##    Peninsular_States Covid19_Cases     region importcase
## 1             Perlis             8   Northern          0
## 2              Kedah           422   Northern          0
## 3             Penang           372   Northern          0
## 4              Perak           279   Northern          0
## 5           Selangor          2235    Central          0
## 6       Kuala Lumpur           447    Central          4
## 7          Putrajaya            30    Central          0
## 8            Malacca           209   Southern          0
## 9             Johore           549   Southern          0
## 10          Kelantan           626 East Coast          0
## 11        Terengganu           266 East Coast          0
## 12   Negeri Sembilan           434   Southern          0
## 13            Pahang           263 East Coast          0

By using the left_join() function, a dataframe called ‘covidmalaysia3’ is combined with ‘covidmalaysia2’, and a dataset called covidmalaysia4 is created. The data is retrieved from covid-19.moh.gov.my