Question a

Below is a table showing the result of prediction for people doing screening test for coronavirus disease (COVID-19). Photo

Confusion matrix for table above is as below where True Positive(TP) = 120, False Negative(FN) = 15, False Positive(FP) = 10, True Negative(TN) = 50 Photo

Question b

“iris” data set is chosen to be used to perform EDA and creating the codebook.

data("iris")

Overview of the dataset

  1. Dimension (number of observation and attributes)
dim(iris)
## [1] 150   5

∴ sample size = 150, no.attributes = 5

  1. Class and type of object (data set)
class(iris)
## [1] "data.frame"
typeof(iris)
## [1] "list"
  1. Name of attributes
names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
  1. Data type of each attribute
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

∴ Sepal.Length,Sepal.Width, Petal.Length, Petal.Width are numeric, while Species is factor variable.

  1. Detect missing values based on columns/ attributes
colSums(is.na(iris))
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##            0            0            0            0            0

∴ No missing value in the data set.

  1. View for some in dataset
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
  1. Show some summary using Univariate Analysis
  1. Declare for columns names which are numerical variable, where column “Species” of index 5 in iris dataset being removed
col_names <- names(iris)
col_names <- names(iris)[-c(5)]
  1. Run univariate analysis: minimum, maximum, mean, median, quartile, variance, standard deviation and interquartile range
sapply(iris[col_names], min)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##          4.3          2.0          1.0          0.1
sapply(iris[col_names], max)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##          7.9          4.4          6.9          2.5
sapply(iris[col_names], mean)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333
sapply(iris[col_names], median)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##         5.80         3.00         4.35         1.30
sapply(iris[col_names], quantile)
##      Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0%            4.3         2.0         1.00         0.1
## 25%           5.1         2.8         1.60         0.3
## 50%           5.8         3.0         4.35         1.3
## 75%           6.4         3.3         5.10         1.8
## 100%          7.9         4.4         6.90         2.5
sapply(iris[col_names], var)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.6856935    0.1899794    3.1162779    0.5810063
sapply(iris[col_names], sd)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.8280661    0.4358663    1.7652982    0.7622377
sapply(iris[col_names], IQR)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##          1.3          0.5          3.5          1.5

OR do this directly

summary(iris)

further info can be obtained here https://www.statmethods.net/stats/descriptives.html

Correlation plot for variables

Correlation plot can only be performed using continuous variable

library(corrplot)
## Warning: package 'corrplot' was built under R version 4.0.5
## corrplot 0.88 loaded
cor <- cor(iris[col_names])
corrplot(cor, method="number")

∴ sepal width have inverse relationship with all other columns (sepal length, petal length, petal width)

explanation for correlation can be obtained here http://www.r-tutor.com/elementary-statistics/numerical-measures/correlation-coefficient

Visualiatation for variables

Histograms for numerical variables

hist(iris$Sepal.Length, xlab = "Sepal length", main = "Histogram of sepal length")

hist(iris$Sepal.Width, xlab = "Sepal width", main = "Histogram of sepal width")

hist(iris$Petal.Length, xlab = "Petal length", main = "Histogram of petal length")

hist(iris$Petal.Width, xlab = "Petal width", main = "Histogram of petal width")

Histogram is just for numerical variable, hence for categorical/ factor variable - “Species”, code is moderated

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
ggplot(iris) + geom_bar(aes(x=Species)) + labs(title = "Count for flower species")

Check for outliers

Boxplots for numerical variables
boxplot(iris$Sepal.Length, ylab = "Sepal length", xlab = "Flower", main = "Boxplot for sepal length")

boxplot(iris$Sepal.Width, ylab = "Sepal width", xlab = "Flower", main = "Boxplot for sepal width")

boxplot(iris$Petal.Length, ylab = "Petal length", xlab = "Flower", main = "Boxplot for petal length")

boxplot(iris$Petal.Width, ylab = "Petal width", xlab = "Flower", main = "Boxplot for petal width")

∴ outliers exist for sepal width

Further checking using “Species” as group - Boxplot of numerical variables with Species as fill
ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) + geom_boxplot(outlier.size = 2)

ggplot(iris, aes(x = Species, y = Sepal.Width, fill = Species)) + geom_boxplot(outlier.size = 2)

ggplot(iris, aes(x = Species, y = Petal.Length, fill = Species)) + geom_boxplot(outlier.size = 2)

ggplot(iris, aes(x = Species, y = Petal.Width, fill = Species)) + geom_boxplot(outlier.size = 2)

Create data frame storing number of outliers for different variables based on each species

Sepal.width <- c(2,0,2)
Sepal.Length <- c(0,0,1)
Sepal.Width <- c(2,0,2)
Petal.Length <- c(3,1,0)
Petal.Width <- c(2,0,0)
name <- c("setosa", "versicolor", "virginica")
out_df <- data.frame(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
rownames(out_df) <- name
out_df
##            Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa                0           2            3           2
## versicolor            0           0            1           0
## virginica             1           2            0           0

if you required a more proper codebook, you can review here https://www.r-bloggers.com/2018/03/generating-codebooks-in-r/

Question c

Load the “dplyr” library

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Read csv file prepared in the work directory

data <- read.csv("person.csv")
  1. Change the existing column name to something new.
data1 <- rename(data, First_Name=ï..f_name, Last_Name=l_name, Age=age, Salary=salary)

“rename” being used to replace original columns name with new names, in form of rename(original data set, original column name = new column name), where [original column name=new column name] can be repeated many times.

  1. Pick rows based on their values.
data2 <- filter(data1, Salary<20000)

“filter” being used to pick those data from the previous dataset for those person having salary less than 20000.

  1. Add new columns to a data frame.
data3 <- mutate(data1, Year_Salary = Salary*12)

“mutate” being used to create a new column “Year_Salary” which is mutated by “Salary” column, that is previously inside the data set, with multiplying it with 12.

  1. Combine data across two or more data frames. Read the csv files (data frames) to be merged with the chosen data set
data_country <- read.csv("country.csv")
data_occupation <- read.csv("occupation.csv")
str(data_country)
## 'data.frame':    10 obs. of  2 variables:
##  $ ï..Country: chr  "Japan" "USA" "USA" "USA" ...
##  $ Last_Name : chr  "Heartfilia" "Cooper" "Stark" "Allen" ...
str(data_occupation)
## 'data.frame':    10 obs. of  2 variables:
##  $ ï..Occupation: chr  "Novel writer" "Scientist" "Chief executive officer" "Crime scene investigator" ...
##  $ Salary       : int  9000 12000 45000 7500 11000 42500 7500 3500 8000 2500

2 methods would be shown on how to merging the data frames.

i. First would be using “inner_join” where we need to identify and declare the common column between the 2 data frames using “by”

data4 <- inner_join(data3, data_country, by = "Last_Name")
data4
##    First_Name  Last_Name Age Salary Year_Salary ï..Country
## 1        Lucy Heartfilia  27   9000      108000      Japan
## 2     Sheldon     Cooper  24  12000      144000        USA
## 3       Tony       Stark  54  45000      540000        USA
## 4       Barry      Allen  29   7500       90000        USA
## 5       Wanda   Maximoff  32  11000      132000        USA
## 6      Bruce       Wayne  38  42500      510000        USA
## 7      Edward      Elric  22   7500       90000      Japan
## 8        Eren     Jaegar  23   3500       42000      Japan
## 9    Sherlock     Holmes  40   8000       96000         UK
## 10     Edward        Wee  23   2500       30000      M'sia

Rename column of the newly added column

data4 <- rename(data4, Country = ï..Country)

ii. Second would be using full_join where code would merged automatically with common column found

data5 <- full_join(data4, data_occupation)
## Joining, by = "Salary"
data5
##    First_Name  Last_Name Age Salary Year_Salary Country
## 1        Lucy Heartfilia  27   9000      108000   Japan
## 2     Sheldon     Cooper  24  12000      144000     USA
## 3       Tony       Stark  54  45000      540000     USA
## 4       Barry      Allen  29   7500       90000     USA
## 5       Barry      Allen  29   7500       90000     USA
## 6       Wanda   Maximoff  32  11000      132000     USA
## 7      Bruce       Wayne  38  42500      510000     USA
## 8      Edward      Elric  22   7500       90000   Japan
## 9      Edward      Elric  22   7500       90000   Japan
## 10       Eren     Jaegar  23   3500       42000   Japan
## 11   Sherlock     Holmes  40   8000       96000      UK
## 12     Edward        Wee  23   2500       30000   M'sia
##               ï..Occupation
## 1              Novel writer
## 2                 Scientist
## 3   Chief executive officer
## 4  Crime scene investigator
## 5                Pharmacist
## 6                   Actress
## 7         Director of board
## 8  Crime scene investigator
## 9                Pharmacist
## 10                  Butcher
## 11     Private investigator
## 12     Full stack developer

Rename column of the newly added column

data5 <- rename(data5, Occupation = ï..Occupation)

Look for the differences

data
##    ï..f_name     l_name age salary
## 1       Lucy Heartfilia  27   9000
## 2    Sheldon     Cooper  24  12000
## 3      Tony       Stark  54  45000
## 4      Barry      Allen  29   7500
## 5      Wanda   Maximoff  32  11000
## 6     Bruce       Wayne  38  42500
## 7     Edward      Elric  22   7500
## 8       Eren     Jaegar  23   3500
## 9   Sherlock     Holmes  40   8000
## 10    Edward        Wee  23   2500
data1
##    First_Name  Last_Name Age Salary
## 1        Lucy Heartfilia  27   9000
## 2     Sheldon     Cooper  24  12000
## 3       Tony       Stark  54  45000
## 4       Barry      Allen  29   7500
## 5       Wanda   Maximoff  32  11000
## 6      Bruce       Wayne  38  42500
## 7      Edward      Elric  22   7500
## 8        Eren     Jaegar  23   3500
## 9    Sherlock     Holmes  40   8000
## 10     Edward        Wee  23   2500
data2
##   First_Name  Last_Name Age Salary
## 1       Lucy Heartfilia  27   9000
## 2    Sheldon     Cooper  24  12000
## 3      Barry      Allen  29   7500
## 4      Wanda   Maximoff  32  11000
## 5     Edward      Elric  22   7500
## 6       Eren     Jaegar  23   3500
## 7   Sherlock     Holmes  40   8000
## 8     Edward        Wee  23   2500
data3
##    First_Name  Last_Name Age Salary Year_Salary
## 1        Lucy Heartfilia  27   9000      108000
## 2     Sheldon     Cooper  24  12000      144000
## 3       Tony       Stark  54  45000      540000
## 4       Barry      Allen  29   7500       90000
## 5       Wanda   Maximoff  32  11000      132000
## 6      Bruce       Wayne  38  42500      510000
## 7      Edward      Elric  22   7500       90000
## 8        Eren     Jaegar  23   3500       42000
## 9    Sherlock     Holmes  40   8000       96000
## 10     Edward        Wee  23   2500       30000
data_country
##    ï..Country  Last_Name
## 1       Japan Heartfilia
## 2         USA     Cooper
## 3         USA      Stark
## 4         USA      Allen
## 5         USA   Maximoff
## 6         USA      Wayne
## 7       Japan      Elric
## 8       Japan     Jaegar
## 9          UK     Holmes
## 10      M'sia        Wee
data_occupation
##               ï..Occupation Salary
## 1              Novel writer   9000
## 2                 Scientist  12000
## 3   Chief executive officer  45000
## 4  Crime scene investigator   7500
## 5                   Actress  11000
## 6         Director of board  42500
## 7                Pharmacist   7500
## 8                   Butcher   3500
## 9      Private investigator   8000
## 10     Full stack developer   2500
data4
##    First_Name  Last_Name Age Salary Year_Salary Country
## 1        Lucy Heartfilia  27   9000      108000   Japan
## 2     Sheldon     Cooper  24  12000      144000     USA
## 3       Tony       Stark  54  45000      540000     USA
## 4       Barry      Allen  29   7500       90000     USA
## 5       Wanda   Maximoff  32  11000      132000     USA
## 6      Bruce       Wayne  38  42500      510000     USA
## 7      Edward      Elric  22   7500       90000   Japan
## 8        Eren     Jaegar  23   3500       42000   Japan
## 9    Sherlock     Holmes  40   8000       96000      UK
## 10     Edward        Wee  23   2500       30000   M'sia
data5
##    First_Name  Last_Name Age Salary Year_Salary Country
## 1        Lucy Heartfilia  27   9000      108000   Japan
## 2     Sheldon     Cooper  24  12000      144000     USA
## 3       Tony       Stark  54  45000      540000     USA
## 4       Barry      Allen  29   7500       90000     USA
## 5       Barry      Allen  29   7500       90000     USA
## 6       Wanda   Maximoff  32  11000      132000     USA
## 7      Bruce       Wayne  38  42500      510000     USA
## 8      Edward      Elric  22   7500       90000   Japan
## 9      Edward      Elric  22   7500       90000   Japan
## 10       Eren     Jaegar  23   3500       42000   Japan
## 11   Sherlock     Holmes  40   8000       96000      UK
## 12     Edward        Wee  23   2500       30000   M'sia
##                  Occupation
## 1              Novel writer
## 2                 Scientist
## 3   Chief executive officer
## 4  Crime scene investigator
## 5                Pharmacist
## 6                   Actress
## 7         Director of board
## 8  Crime scene investigator
## 9                Pharmacist
## 10                  Butcher
## 11     Private investigator
## 12     Full stack developer