Below is a table showing the result of prediction for people doing screening test for coronavirus disease (COVID-19).
Confusion matrix for table above is as below where True Positive(TP) = 120, False Negative(FN) = 15, False Positive(FP) = 10, True Negative(TN) = 50
“iris” data set is chosen to be used to perform EDA and creating the codebook.
data("iris")
dim(iris)
## [1] 150 5
∴ sample size = 150, no.attributes = 5
class(iris)
## [1] "data.frame"
typeof(iris)
## [1] "list"
names(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
∴ Sepal.Length,Sepal.Width, Petal.Length, Petal.Width are numeric, while Species is factor variable.
colSums(is.na(iris))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 0 0 0 0 0
∴ No missing value in the data set.
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
col_names <- names(iris)
col_names <- names(iris)[-c(5)]
sapply(iris[col_names], min)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4.3 2.0 1.0 0.1
sapply(iris[col_names], max)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 7.9 4.4 6.9 2.5
sapply(iris[col_names], mean)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843333 3.057333 3.758000 1.199333
sapply(iris[col_names], median)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.80 3.00 4.35 1.30
sapply(iris[col_names], quantile)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0% 4.3 2.0 1.00 0.1
## 25% 5.1 2.8 1.60 0.3
## 50% 5.8 3.0 4.35 1.3
## 75% 6.4 3.3 5.10 1.8
## 100% 7.9 4.4 6.90 2.5
sapply(iris[col_names], var)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0.6856935 0.1899794 3.1162779 0.5810063
sapply(iris[col_names], sd)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0.8280661 0.4358663 1.7652982 0.7622377
sapply(iris[col_names], IQR)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1.3 0.5 3.5 1.5
OR do this directly
summary(iris)
further info can be obtained here https://www.statmethods.net/stats/descriptives.html
Correlation plot can only be performed using continuous variable
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.0.5
## corrplot 0.88 loaded
cor <- cor(iris[col_names])
corrplot(cor, method="number")
∴ sepal width have inverse relationship with all other columns (sepal length, petal length, petal width)
explanation for correlation can be obtained here http://www.r-tutor.com/elementary-statistics/numerical-measures/correlation-coefficient
Histograms for numerical variables
hist(iris$Sepal.Length, xlab = "Sepal length", main = "Histogram of sepal length")
hist(iris$Sepal.Width, xlab = "Sepal width", main = "Histogram of sepal width")
hist(iris$Petal.Length, xlab = "Petal length", main = "Histogram of petal length")
hist(iris$Petal.Width, xlab = "Petal width", main = "Histogram of petal width")
Histogram is just for numerical variable, hence for categorical/ factor variable - “Species”, code is moderated
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
ggplot(iris) + geom_bar(aes(x=Species)) + labs(title = "Count for flower species")
boxplot(iris$Sepal.Length, ylab = "Sepal length", xlab = "Flower", main = "Boxplot for sepal length")
boxplot(iris$Sepal.Width, ylab = "Sepal width", xlab = "Flower", main = "Boxplot for sepal width")
boxplot(iris$Petal.Length, ylab = "Petal length", xlab = "Flower", main = "Boxplot for petal length")
boxplot(iris$Petal.Width, ylab = "Petal width", xlab = "Flower", main = "Boxplot for petal width")
∴ outliers exist for sepal width
ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) + geom_boxplot(outlier.size = 2)
ggplot(iris, aes(x = Species, y = Sepal.Width, fill = Species)) + geom_boxplot(outlier.size = 2)
ggplot(iris, aes(x = Species, y = Petal.Length, fill = Species)) + geom_boxplot(outlier.size = 2)
ggplot(iris, aes(x = Species, y = Petal.Width, fill = Species)) + geom_boxplot(outlier.size = 2)
Create data frame storing number of outliers for different variables based on each species
Sepal.width <- c(2,0,2)
Sepal.Length <- c(0,0,1)
Sepal.Width <- c(2,0,2)
Petal.Length <- c(3,1,0)
Petal.Width <- c(2,0,0)
name <- c("setosa", "versicolor", "virginica")
out_df <- data.frame(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
rownames(out_df) <- name
out_df
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa 0 2 3 2
## versicolor 0 0 1 0
## virginica 1 2 0 0
if you required a more proper codebook, you can review here https://www.r-bloggers.com/2018/03/generating-codebooks-in-r/
Load the “dplyr” library
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Read csv file prepared in the work directory
data <- read.csv("person.csv")
data1 <- rename(data, First_Name=ï..f_name, Last_Name=l_name, Age=age, Salary=salary)
“rename” being used to replace original columns name with new names, in form of rename(original data set, original column name = new column name), where [original column name=new column name] can be repeated many times.
data2 <- filter(data1, Salary<20000)
“filter” being used to pick those data from the previous dataset for those person having salary less than 20000.
data3 <- mutate(data1, Year_Salary = Salary*12)
“mutate” being used to create a new column “Year_Salary” which is mutated by “Salary” column, that is previously inside the data set, with multiplying it with 12.
data_country <- read.csv("country.csv")
data_occupation <- read.csv("occupation.csv")
str(data_country)
## 'data.frame': 10 obs. of 2 variables:
## $ ï..Country: chr "Japan" "USA" "USA" "USA" ...
## $ Last_Name : chr "Heartfilia" "Cooper" "Stark" "Allen" ...
str(data_occupation)
## 'data.frame': 10 obs. of 2 variables:
## $ ï..Occupation: chr "Novel writer" "Scientist" "Chief executive officer" "Crime scene investigator" ...
## $ Salary : int 9000 12000 45000 7500 11000 42500 7500 3500 8000 2500
2 methods would be shown on how to merging the data frames.
i. First would be using “inner_join” where we need to identify and declare the common column between the 2 data frames using “by”
data4 <- inner_join(data3, data_country, by = "Last_Name")
data4
## First_Name Last_Name Age Salary Year_Salary ï..Country
## 1 Lucy Heartfilia 27 9000 108000 Japan
## 2 Sheldon Cooper 24 12000 144000 USA
## 3 Tony Stark 54 45000 540000 USA
## 4 Barry Allen 29 7500 90000 USA
## 5 Wanda Maximoff 32 11000 132000 USA
## 6 Bruce Wayne 38 42500 510000 USA
## 7 Edward Elric 22 7500 90000 Japan
## 8 Eren Jaegar 23 3500 42000 Japan
## 9 Sherlock Holmes 40 8000 96000 UK
## 10 Edward Wee 23 2500 30000 M'sia
Rename column of the newly added column
data4 <- rename(data4, Country = ï..Country)
ii. Second would be using full_join where code would merged automatically with common column found
data5 <- full_join(data4, data_occupation)
## Joining, by = "Salary"
data5
## First_Name Last_Name Age Salary Year_Salary Country
## 1 Lucy Heartfilia 27 9000 108000 Japan
## 2 Sheldon Cooper 24 12000 144000 USA
## 3 Tony Stark 54 45000 540000 USA
## 4 Barry Allen 29 7500 90000 USA
## 5 Barry Allen 29 7500 90000 USA
## 6 Wanda Maximoff 32 11000 132000 USA
## 7 Bruce Wayne 38 42500 510000 USA
## 8 Edward Elric 22 7500 90000 Japan
## 9 Edward Elric 22 7500 90000 Japan
## 10 Eren Jaegar 23 3500 42000 Japan
## 11 Sherlock Holmes 40 8000 96000 UK
## 12 Edward Wee 23 2500 30000 M'sia
## ï..Occupation
## 1 Novel writer
## 2 Scientist
## 3 Chief executive officer
## 4 Crime scene investigator
## 5 Pharmacist
## 6 Actress
## 7 Director of board
## 8 Crime scene investigator
## 9 Pharmacist
## 10 Butcher
## 11 Private investigator
## 12 Full stack developer
Rename column of the newly added column
data5 <- rename(data5, Occupation = ï..Occupation)
Look for the differences
data
## ï..f_name l_name age salary
## 1 Lucy Heartfilia 27 9000
## 2 Sheldon Cooper 24 12000
## 3 Tony Stark 54 45000
## 4 Barry Allen 29 7500
## 5 Wanda Maximoff 32 11000
## 6 Bruce Wayne 38 42500
## 7 Edward Elric 22 7500
## 8 Eren Jaegar 23 3500
## 9 Sherlock Holmes 40 8000
## 10 Edward Wee 23 2500
data1
## First_Name Last_Name Age Salary
## 1 Lucy Heartfilia 27 9000
## 2 Sheldon Cooper 24 12000
## 3 Tony Stark 54 45000
## 4 Barry Allen 29 7500
## 5 Wanda Maximoff 32 11000
## 6 Bruce Wayne 38 42500
## 7 Edward Elric 22 7500
## 8 Eren Jaegar 23 3500
## 9 Sherlock Holmes 40 8000
## 10 Edward Wee 23 2500
data2
## First_Name Last_Name Age Salary
## 1 Lucy Heartfilia 27 9000
## 2 Sheldon Cooper 24 12000
## 3 Barry Allen 29 7500
## 4 Wanda Maximoff 32 11000
## 5 Edward Elric 22 7500
## 6 Eren Jaegar 23 3500
## 7 Sherlock Holmes 40 8000
## 8 Edward Wee 23 2500
data3
## First_Name Last_Name Age Salary Year_Salary
## 1 Lucy Heartfilia 27 9000 108000
## 2 Sheldon Cooper 24 12000 144000
## 3 Tony Stark 54 45000 540000
## 4 Barry Allen 29 7500 90000
## 5 Wanda Maximoff 32 11000 132000
## 6 Bruce Wayne 38 42500 510000
## 7 Edward Elric 22 7500 90000
## 8 Eren Jaegar 23 3500 42000
## 9 Sherlock Holmes 40 8000 96000
## 10 Edward Wee 23 2500 30000
data_country
## ï..Country Last_Name
## 1 Japan Heartfilia
## 2 USA Cooper
## 3 USA Stark
## 4 USA Allen
## 5 USA Maximoff
## 6 USA Wayne
## 7 Japan Elric
## 8 Japan Jaegar
## 9 UK Holmes
## 10 M'sia Wee
data_occupation
## ï..Occupation Salary
## 1 Novel writer 9000
## 2 Scientist 12000
## 3 Chief executive officer 45000
## 4 Crime scene investigator 7500
## 5 Actress 11000
## 6 Director of board 42500
## 7 Pharmacist 7500
## 8 Butcher 3500
## 9 Private investigator 8000
## 10 Full stack developer 2500
data4
## First_Name Last_Name Age Salary Year_Salary Country
## 1 Lucy Heartfilia 27 9000 108000 Japan
## 2 Sheldon Cooper 24 12000 144000 USA
## 3 Tony Stark 54 45000 540000 USA
## 4 Barry Allen 29 7500 90000 USA
## 5 Wanda Maximoff 32 11000 132000 USA
## 6 Bruce Wayne 38 42500 510000 USA
## 7 Edward Elric 22 7500 90000 Japan
## 8 Eren Jaegar 23 3500 42000 Japan
## 9 Sherlock Holmes 40 8000 96000 UK
## 10 Edward Wee 23 2500 30000 M'sia
data5
## First_Name Last_Name Age Salary Year_Salary Country
## 1 Lucy Heartfilia 27 9000 108000 Japan
## 2 Sheldon Cooper 24 12000 144000 USA
## 3 Tony Stark 54 45000 540000 USA
## 4 Barry Allen 29 7500 90000 USA
## 5 Barry Allen 29 7500 90000 USA
## 6 Wanda Maximoff 32 11000 132000 USA
## 7 Bruce Wayne 38 42500 510000 USA
## 8 Edward Elric 22 7500 90000 Japan
## 9 Edward Elric 22 7500 90000 Japan
## 10 Eren Jaegar 23 3500 42000 Japan
## 11 Sherlock Holmes 40 8000 96000 UK
## 12 Edward Wee 23 2500 30000 M'sia
## Occupation
## 1 Novel writer
## 2 Scientist
## 3 Chief executive officer
## 4 Crime scene investigator
## 5 Pharmacist
## 6 Actress
## 7 Director of board
## 8 Crime scene investigator
## 9 Pharmacist
## 10 Butcher
## 11 Private investigator
## 12 Full stack developer