Question a

Below is a table showing the result of prediction for people doing screening test for coronavirus disease (COVID-19).

Confusion matrix for table above is as below where True Positive(TP) = 120, False Negative(FN) = 15, False Positive(FP) = 10, True Negative(TN) = 50

Question b

“iris” data set is chosen to be used to perform EDA and creating the codebook.

data("iris")

Overview of the dataset

Dimension (number of observation and attributes)

dim(iris)

## [1] 150   5

∴ sample size = 150, no.attributes = 5

Class and type of object (data set)

class(iris)

## [1] "data.frame"

typeof(iris)

## [1] "list"

Name of attributes

names(iris)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

Data type of each attribute

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

∴ Sepal.Length,Sepal.Width, Petal.Length, Petal.Width are numeric, while Species is factor variable.

Detect missing values based on columns/ attributes

colSums(is.na(iris))

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##            0            0            0            0            0

∴ No missing value in the data set.

View for some in dataset

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Show some summary using Univariate Analysis

Declare for columns names which are numerical variable, where column “Species” of index 5 in iris dataset being removed

col_names <- names(iris)
col_names <- names(iris)[-c(5)]

Run univariate analysis: minimum, maximum, mean, median, quartile, variance, standard deviation and interquartile range

sapply(iris[col_names], min)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##          4.3          2.0          1.0          0.1

sapply(iris[col_names], max)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##          7.9          4.4          6.9          2.5

sapply(iris[col_names], mean)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333

sapply(iris[col_names], median)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##         5.80         3.00         4.35         1.30

sapply(iris[col_names], quantile)

##      Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0%            4.3         2.0         1.00         0.1
## 25%           5.1         2.8         1.60         0.3
## 50%           5.8         3.0         4.35         1.3
## 75%           6.4         3.3         5.10         1.8
## 100%          7.9         4.4         6.90         2.5

sapply(iris[col_names], var)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.6856935    0.1899794    3.1162779    0.5810063

sapply(iris[col_names], sd)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.8280661    0.4358663    1.7652982    0.7622377

sapply(iris[col_names], IQR)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##          1.3          0.5          3.5          1.5

OR do this directly

summary(iris)

further info can be obtained here https://www.statmethods.net/stats/descriptives.html

Correlation plot for variables

Correlation plot can only be performed using continuous variable

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.0.5

## corrplot 0.88 loaded

cor <- cor(iris[col_names])
corrplot(cor, method="number")

∴ sepal width have inverse relationship with all other columns (sepal length, petal length, petal width)

explanation for correlation can be obtained here http://www.r-tutor.com/elementary-statistics/numerical-measures/correlation-coefficient

Visualiatation for variables

Histograms for numerical variables

hist(iris$Sepal.Length, xlab = "Sepal length", main = "Histogram of sepal length")

hist(iris$Sepal.Width, xlab = "Sepal width", main = "Histogram of sepal width")

hist(iris$Petal.Length, xlab = "Petal length", main = "Histogram of petal length")

hist(iris$Petal.Width, xlab = "Petal width", main = "Histogram of petal width")

Histogram is just for numerical variable, hence for categorical/ factor variable - “Species”, code is moderated

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.0.5

ggplot(iris) + geom_bar(aes(x=Species)) + labs(title = "Count for flower species")

Check for outliers

Boxplots for numerical variables

boxplot(iris$Sepal.Length, ylab = "Sepal length", xlab = "Flower", main = "Boxplot for sepal length")

boxplot(iris$Sepal.Width, ylab = "Sepal width", xlab = "Flower", main = "Boxplot for sepal width")

boxplot(iris$Petal.Length, ylab = "Petal length", xlab = "Flower", main = "Boxplot for petal length")

boxplot(iris$Petal.Width, ylab = "Petal width", xlab = "Flower", main = "Boxplot for petal width")

∴ outliers exist for sepal width

Further checking using “Species” as group - Boxplot of numerical variables with Species as fill

ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) + geom_boxplot(outlier.size = 2)

ggplot(iris, aes(x = Species, y = Sepal.Width, fill = Species)) + geom_boxplot(outlier.size = 2)

ggplot(iris, aes(x = Species, y = Petal.Length, fill = Species)) + geom_boxplot(outlier.size = 2)

ggplot(iris, aes(x = Species, y = Petal.Width, fill = Species)) + geom_boxplot(outlier.size = 2)

Create data frame storing number of outliers for different variables based on each species

Sepal.width <- c(2,0,2)
Sepal.Length <- c(0,0,1)
Sepal.Width <- c(2,0,2)
Petal.Length <- c(3,1,0)
Petal.Width <- c(2,0,0)
name <- c("setosa", "versicolor", "virginica")
out_df <- data.frame(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
rownames(out_df) <- name
out_df

##            Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa                0           2            3           2
## versicolor            0           0            1           0
## virginica             1           2            0           0

if you required a more proper codebook, you can review here https://www.r-bloggers.com/2018/03/generating-codebooks-in-r/

Question c

Load the “dplyr” library

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.0.5

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Read csv file prepared in the work directory

data <- read.csv("person.csv")

Change the existing column name to something new.

data1 <- rename(data, First_Name=ï..f_name, Last_Name=l_name, Age=age, Salary=salary)

“rename” being used to replace original columns name with new names, in form of rename(original data set, original column name = new column name), where [original column name=new column name] can be repeated many times.

Pick rows based on their values.

data2 <- filter(data1, Salary<20000)

“filter” being used to pick those data from the previous dataset for those person having salary less than 20000.

Add new columns to a data frame.

data3 <- mutate(data1, Year_Salary = Salary*12)

“mutate” being used to create a new column “Year_Salary” which is mutated by “Salary” column, that is previously inside the data set, with multiplying it with 12.

Combine data across two or more data frames. Read the csv files (data frames) to be merged with the chosen data set

data_country <- read.csv("country.csv")
data_occupation <- read.csv("occupation.csv")
str(data_country)

## 'data.frame':    10 obs. of  2 variables:
##  $ ï..Country: chr  "Japan" "USA" "USA" "USA" ...
##  $ Last_Name : chr  "Heartfilia" "Cooper" "Stark" "Allen" ...

str(data_occupation)

## 'data.frame':    10 obs. of  2 variables:
##  $ ï..Occupation: chr  "Novel writer" "Scientist" "Chief executive officer" "Crime scene investigator" ...
##  $ Salary       : int  9000 12000 45000 7500 11000 42500 7500 3500 8000 2500

2 methods would be shown on how to merging the data frames.

i. First would be using “inner_join” where we need to identify and declare the common column between the 2 data frames using “by”

data4 <- inner_join(data3, data_country, by = "Last_Name")
data4

##    First_Name  Last_Name Age Salary Year_Salary ï..Country
## 1        Lucy Heartfilia  27   9000      108000      Japan
## 2     Sheldon     Cooper  24  12000      144000        USA
## 3       Tony       Stark  54  45000      540000        USA
## 4       Barry      Allen  29   7500       90000        USA
## 5       Wanda   Maximoff  32  11000      132000        USA
## 6      Bruce       Wayne  38  42500      510000        USA
## 7      Edward      Elric  22   7500       90000      Japan
## 8        Eren     Jaegar  23   3500       42000      Japan
## 9    Sherlock     Holmes  40   8000       96000         UK
## 10     Edward        Wee  23   2500       30000      M'sia

Rename column of the newly added column

data4 <- rename(data4, Country = ï..Country)

ii. Second would be using full_join where code would merged automatically with common column found

data5 <- full_join(data4, data_occupation)

## Joining, by = "Salary"

data5

##    First_Name  Last_Name Age Salary Year_Salary Country
## 1        Lucy Heartfilia  27   9000      108000   Japan
## 2     Sheldon     Cooper  24  12000      144000     USA
## 3       Tony       Stark  54  45000      540000     USA
## 4       Barry      Allen  29   7500       90000     USA
## 5       Barry      Allen  29   7500       90000     USA
## 6       Wanda   Maximoff  32  11000      132000     USA
## 7      Bruce       Wayne  38  42500      510000     USA
## 8      Edward      Elric  22   7500       90000   Japan
## 9      Edward      Elric  22   7500       90000   Japan
## 10       Eren     Jaegar  23   3500       42000   Japan
## 11   Sherlock     Holmes  40   8000       96000      UK
## 12     Edward        Wee  23   2500       30000   M'sia
##               ï..Occupation
## 1              Novel writer
## 2                 Scientist
## 3   Chief executive officer
## 4  Crime scene investigator
## 5                Pharmacist
## 6                   Actress
## 7         Director of board
## 8  Crime scene investigator
## 9                Pharmacist
## 10                  Butcher
## 11     Private investigator
## 12     Full stack developer