Nur Liyana Madihah binti Hazizun

Matric number: 17193796/2

(a)

The model correctly predicted the positive class for the screening test 120 times and incorrectly predicted it 10 times.

The model correctly predicted the negative class for the screening test 50 times and incorrectly predicted it 15 times.

The following can be computed from this confusion matrix:

The model made 170 correct predictions (120 + 50).

The model made 25 incorrect predictions (10 + 15)

There are 195 total scored cases (120 + 15 + 10 + 50)

The error rate is (incorrect prediction)

The overall accuracy rate is 34/39 = 0.8718

The precision is 12/13 = 0.9231

The sensitivity is 8/9 = 0.8889

The specificity is 5/6 = 0.8333

The negative predictive value is 10/13 = 0.7692


(b)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
iris
Codebook
class(iris)
## [1] "data.frame"
sapply(iris, class)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##    "numeric"    "numeric"    "numeric"    "numeric"     "factor"
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
Standard deviation

The standard deviation of the sepal length is 0.8280661

sd(iris$Sepal.Length)
## [1] 0.8280661

The standard deviation of the sepal width is 0.4358663

sd(iris$Sepal.Width)
## [1] 0.4358663

The standard deviation of the petal length is 1.7652982

sd(iris$Petal.Length)
## [1] 1.765298

The standard deviation of the petal width is 0.7622377

sd(iris$Petal.Width)
## [1] 0.7622377
Variance

The variance of the sepal length is 0.6856935

var(iris$Sepal.Length)
## [1] 0.6856935

The variance of the sepal width is 0.1899794

var(iris$Sepal.Width)
## [1] 0.1899794

The variance of the petal length is 3.1162779

var(iris$Petal.Length)
## [1] 3.116278

The variance of the petal width is 0.5810063

var(iris$Petal.Width)
## [1] 0.5810063
Data Visualisation

Numerical Data

Sepal.Length = iris$Sepal.Length
hist(Sepal.Length)

From the histogram above, we have sepal length on the x-axis and frequency of observations on the y-axis. It has a bin width of 0.5. The majority of observations have sepal length between 5.5 to 6.5.

Categorical Data

Species = iris$Species
table(Species)
## Species
##     setosa versicolor  virginica 
##         50         50         50

From the table, we can see that all of the 3 species have the same number of observations; 50.

barplot(table(Species))

From the bar graph above, we have Species on the x-axis and frequency of the species on the y-axis. Since all the species have the same number of observation, the height of the bar for all species are the same.


(c)

Dataset obtained from https://github.com/cmdlinetips/data/blob/master/sample_data_to_convert_column_to_datetime_pandas.csv

library(dplyr)
info = read.csv("https://raw.githubusercontent.com/cmdlinetips/data/master/sample_data_to_convert_column_to_datetime_pandas.csv")
info
i. Change column name to something new

rename() is a function in dplyr that allows user to rename a column in R.

rename(X, B = A) where X is the name of the data frame, B is the new name and A is the old name.

info <- rename(info, "Date" = "date", "Precipitation" = "precipitation","Maximum temp" = "temp_max","Minimum temp" = "temp_min","Wind" = "wind","Weather" = "weather")
info
ii. Pick rows based on their values
rain <- filter(info, Weather == "rain")
rain

Listed are the rows for the value “rain” as the attribute for the column Weather

iii. Add new columns to a data frame
info["New"] <- "Value"
info

New column “New” is added to the data frame

iv. Combine data across two or more data frames

New dataset is obtained from https://github.com/cmdlinetips/data/blob/master/combine_year_month_day_into_date_pandas.csv

newInfo = read.csv("https://raw.githubusercontent.com/cmdlinetips/data/master/combine_year_month_day_into_date_pandas.csv")
newInfo
xy = bind_cols(info, newInfo)
xy