2.a) Begin your markdown by interpreting the confusion matrix. No R code is required to answer this question.
The model correctly predicted the positive class for screening test 120 times and incorrectly predicted it 15 times. The model correctly predicted the negative class for screening test 50 times and incorrectly predicted it 10 times.
The following can be computed from this confusion matrix:
i) The model made 170 correct predictions (120 + 50).
ii) The model made 25 incorrect predictions (15 + 10).
iii) There are 195 total scored cases (120 + 15 + 10 + 50).
iv) The error rate is 25/195 = 0.1282.
v) The overall accuracy rate is 1241/1276 = 0.8718.
2.b) Find and get a data set from the data sets available within R. Perform exploratory data analysis (EDA) and prepare a codebook on that data set. Explain every answer given.
library (memisc)
#List the datasets availableta in R, type: data ()
inspray <- datasets::InsectSprays
typeof (inspray)
## [1] "list"
head(inspray, n=20)
## count spray
## 1 10 A
## 2 7 A
## 3 20 A
## 4 14 A
## 5 14 A
## 6 12 A
## 7 10 A
## 8 23 A
## 9 17 A
## 10 20 A
## 11 14 A
## 12 13 A
## 13 11 B
## 14 17 B
## 15 21 B
## 16 11 B
## 17 16 B
## 18 14 B
## 19 17 B
## 20 17 B
#To find the mean of the insect count
mean(inspray$count)
## [1] 9.5
#Summary of insect count
summary(inspray$count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 3.00 7.00 9.50 14.25 26.00
#The range of insect count
range(inspray$count)
## [1] 0 26
#The quantile of Insect Sprays
quantile(inspray$count)
## 0% 25% 50% 75% 100%
## 0.00 3.00 7.00 14.25 26.00
#The interquartile range of Insect Sprays
IQR(inspray$count)
## [1] 11.25
#Standard deviation of Insect Sprays
sd(inspray$count)
## [1] 7.203286
#summary of InsectSprays dataset
summary(inspray)
## count spray
## Min. : 0.00 A:12
## 1st Qu.: 3.00 B:12
## Median : 7.00 C:12
## Mean : 9.50 D:12
## 3rd Qu.:14.25 E:12
## Max. :26.00 F:12
#The histogram of Insect Sprays count
hist(inspray$count)
#The boxplot of Insect Sprays data
boxplot(count ~ spray, data = inspray,
xlab = "Type of spray", ylab = "Insect count",
main = "InsectSprays data", varwidth = TRUE, col = "lightgreen")
#call the codebook function
codebook (inspray)
## ================================================================================
##
## count
##
## --------------------------------------------------------------------------------
##
## Storage mode: double
##
## Min: 0.000
## Max: 26.000
## Mean: 9.500
## Std.Dev.: 7.153
## Skewness: 0.571
## Kurtosis: -0.774
##
## ================================================================================
##
## spray
##
## --------------------------------------------------------------------------------
##
## Storage mode: integer
## Factor with 6 levels
##
## Levels and labels N Valid
##
## 1 'A' 12 16.7
## 2 'B' 12 16.7
## 3 'C' 12 16.7
## 4 'D' 12 16.7
## 5 'E' 12 16.7
## 6 'F' 12 16.7
The dataset I use is created by myself. First of all, we load the dplyr library.
library (dplyr)
Next, we read the dataset in csv format and see the first 16 data in laptop_price and laptop_spec datasets.
The laptop_spec dataset is used for question c.iv to combine two dataframes.
setwd("D:/")
laptop_price <- read.csv("./laptop_price.csv")
laptop_spec <- read.csv("./laptop_spec.csv")
head(laptop_price, n=16)
## laptop_ID Company Product Ram Price
## 1 1 Apple MacBook Pro 8GB 1339.69
## 2 2 Apple Macbook Air 8GB 898.94
## 3 3 HP 250 G6 8GB 575.00
## 4 4 Apple MacBook Pro 16GB 2537.45
## 5 5 Apple MacBook Pro 8GB 1803.60
## 6 6 Acer Aspire 3 4GB 400.00
## 7 7 Apple MacBook Pro 16GB 2139.97
## 8 8 Apple Macbook Air 8GB 1158.70
## 9 9 Asus ZenBook UX430UN 16GB 1495.00
## 10 10 Acer Swift 3 8GB 770.00
## 11 11 HP 250 G6 4GB 393.90
## 12 12 HP 250 G6 4GB 344.99
## 13 13 Apple MacBook Pro 16GB 2439.97
## 14 14 Dell Inspiron 3567 4GB 498.90
## 15 15 Apple MacBook 12" 8GB 1262.40
## 16 16 Apple MacBook Pro 8GB 1518.55
head(laptop_spec, n=16)
## laptop_ID Product Ram Cpu OpSys
## 1 1 MacBook Pro 8GB Intel Core i5 2.3GHz macOS
## 2 2 Macbook Air 8GB Intel Core i5 1.8GHz macOS
## 3 3 250 G6 8GB Intel Core i5 7200U 2.5GHz No OS
## 4 4 MacBook Pro 16GB Intel Core i7 2.7GHz macOS
## 5 5 MacBook Pro 8GB Intel Core i5 3.1GHz macOS
## 6 6 Aspire 3 4GB AMD A9-Series 9420 3GHz Windows 10
## 7 7 MacBook Pro 16GB Intel Core i7 2.2GHz Mac OS X
## 8 8 Macbook Air 8GB Intel Core i5 1.8GHz macOS
## 9 9 ZenBook UX430UN 16GB Intel Core i7 8550U 1.8GHz Windows 10
## 10 10 Swift 3 8GB Intel Core i5 8250U 1.6GHz Windows 10
## 11 11 250 G6 4GB Intel Core i5 7200U 2.5GHz No OS
## 12 12 250 G6 4GB Intel Core i3 6006U 2GHz No OS
## 13 13 MacBook Pro 16GB Intel Core i7 2.8GHz macOS
## 14 14 Inspiron 3567 4GB Intel Core i3 6006U 2GHz Windows 10
## 15 15 MacBook 12" 8GB Intel Core M m3 1.2GHz macOS
## 16 16 MacBook Pro 8GB Intel Core i5 2.3GHz macOS
c)i. Change the existing column name to something new.
To rename the variable name in the data frame, we can use rename in dplyr. In this example, we rename the Company column in the laptop_price dataset to Brand.
laptop_price <- dplyr::rename(laptop_price, Brand = Company)
head(laptop_price, n=10)
## laptop_ID Brand Product Ram Price
## 1 1 Apple MacBook Pro 8GB 1339.69
## 2 2 Apple Macbook Air 8GB 898.94
## 3 3 HP 250 G6 8GB 575.00
## 4 4 Apple MacBook Pro 16GB 2537.45
## 5 5 Apple MacBook Pro 8GB 1803.60
## 6 6 Acer Aspire 3 4GB 400.00
## 7 7 Apple MacBook Pro 16GB 2139.97
## 8 8 Apple Macbook Air 8GB 1158.70
## 9 9 Asus ZenBook UX430UN 16GB 1495.00
## 10 10 Acer Swift 3 8GB 770.00
c)ii. Pick rows based on their values.
In this example, we filter out the row based on the value. We select the laptops with their price more than 1000 and their brand is Apple.
dplyr::filter(laptop_price,laptop_price$Price > 1000, Brand == "Apple")
## laptop_ID Brand Product Ram Price
## 1 1 Apple MacBook Pro 8GB 1339.69
## 2 4 Apple MacBook Pro 16GB 2537.45
## 3 5 Apple MacBook Pro 8GB 1803.60
## 4 7 Apple MacBook Pro 16GB 2139.97
## 5 8 Apple Macbook Air 8GB 1158.70
## 6 13 Apple MacBook Pro 16GB 2439.97
## 7 15 Apple MacBook 12" 8GB 1262.40
## 8 16 Apple MacBook Pro 8GB 1518.55
## 9 18 Apple MacBook Pro 16GB 2858.00
c)iii. Add new columns to a data frame.
We can use mutate function in dplyr to add the new column in data frame. In this example, we add a new column called Full_Product_Name by combine the Brand and Product column.
laptop_price <- dplyr::mutate(laptop_price,Full_Product_Name = paste(laptop_price$Brand, laptop_price$Product))
head(laptop_price, n=10)
## laptop_ID Brand Product Ram Price Full_Product_Name
## 1 1 Apple MacBook Pro 8GB 1339.69 Apple MacBook Pro
## 2 2 Apple Macbook Air 8GB 898.94 Apple Macbook Air
## 3 3 HP 250 G6 8GB 575.00 HP 250 G6
## 4 4 Apple MacBook Pro 16GB 2537.45 Apple MacBook Pro
## 5 5 Apple MacBook Pro 8GB 1803.60 Apple MacBook Pro
## 6 6 Acer Aspire 3 4GB 400.00 Acer Aspire 3
## 7 7 Apple MacBook Pro 16GB 2139.97 Apple MacBook Pro
## 8 8 Apple Macbook Air 8GB 1158.70 Apple Macbook Air
## 9 9 Asus ZenBook UX430UN 16GB 1495.00 Asus ZenBook UX430UN
## 10 10 Acer Swift 3 8GB 770.00 Acer Swift 3
c)iv. Combine data across two or more data frames.
left_join function in dplyr can be used to join two data frame. In this example, we combine two datasets using left_join method to join them according to same “Product” and “Ram” column.
newdata <- dplyr::left_join(laptop_price,laptop_spec)
## Joining, by = c("laptop_ID", "Product", "Ram")
head(newdata, n=10)
## laptop_ID Brand Product Ram Price Full_Product_Name
## 1 1 Apple MacBook Pro 8GB 1339.69 Apple MacBook Pro
## 2 2 Apple Macbook Air 8GB 898.94 Apple Macbook Air
## 3 3 HP 250 G6 8GB 575.00 HP 250 G6
## 4 4 Apple MacBook Pro 16GB 2537.45 Apple MacBook Pro
## 5 5 Apple MacBook Pro 8GB 1803.60 Apple MacBook Pro
## 6 6 Acer Aspire 3 4GB 400.00 Acer Aspire 3
## 7 7 Apple MacBook Pro 16GB 2139.97 Apple MacBook Pro
## 8 8 Apple Macbook Air 8GB 1158.70 Apple Macbook Air
## 9 9 Asus ZenBook UX430UN 16GB 1495.00 Asus ZenBook UX430UN
## 10 10 Acer Swift 3 8GB 770.00 Acer Swift 3
## Cpu OpSys
## 1 Intel Core i5 2.3GHz macOS
## 2 Intel Core i5 1.8GHz macOS
## 3 Intel Core i5 7200U 2.5GHz No OS
## 4 Intel Core i7 2.7GHz macOS
## 5 Intel Core i5 3.1GHz macOS
## 6 AMD A9-Series 9420 3GHz Windows 10
## 7 Intel Core i7 2.2GHz Mac OS X
## 8 Intel Core i5 1.8GHz macOS
## 9 Intel Core i7 8550U 1.8GHz Windows 10
## 10 Intel Core i5 8250U 1.6GHz Windows 10