AAQ2

By : IRDINA BATRISYIA BINTI ASRUL NIZAM (17186649/2)

QUESTION A

The model made (120+50) = 170 correct predictions

The model made (10+15) = 25 incorrect predictions

There are (120 + 15 + 10 + 50) = 195 total scored cases

The error rate = 25/195 = 0.1282

The overall accuracy rate = 170/195 = 1.7894

Sensitivity = TP/(TP+FN) = 0.92

Specificity = TN/(TN+FP) = 0.15

Precision = TP /(TP+FP) = 0.89

Negative predictive value = TN/(TN+FN) = 0.167

F-measure = 2RecallPrecision/ (recall + precision) = 1.89

QUESTION B

library(ggplot2) 
library(dplyr)
library(memisc)

data.df<- as.data.frame(HairEyeColor)
str(data.df)

## 'data.frame':    32 obs. of  4 variables:
##  $ Hair: Factor w/ 4 levels "Black","Brown",..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ Eye : Factor w/ 4 levels "Brown","Blue",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ Sex : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Freq: num  32 53 10 3 11 50 10 30 10 25 ...

summary(HairEyeColor)

## Number of cases in table: 592 
## Number of factors: 3 
## Test for independence of all factors:
##  Chisq = 164.92, df = 24, p-value = 5.321e-23
##  Chi-squared approximation may be incorrect

VISUALIZING DATA

qplot(data = data.df, Eye, Freq, geom="boxplot", color=Sex)

Most males and females have blue and brown eyes

qplot(data = data.df, Hair, Freq, geom="boxplot", color=Sex)

Most males and females have brown hair.

Density plot of different hair colors

qplot(data=data.df, Eye, geom="density", fill=Eye, alpha=0.6)

CODEBOOK

codebook(data.df) # call the codebook function

## ================================================================================
## 
##    Hair
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: integer
##    Factor with 4 levels
## 
##    Levels and labels     N Valid
##                                 
##    1 'Black'             8  25.0
##    2 'Brown'             8  25.0
##    3 'Red'               8  25.0
##    4 'Blond'             8  25.0
## 
## ================================================================================
## 
##    Eye
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: integer
##    Factor with 4 levels
## 
##    Levels and labels     N Valid
##                                 
##    1 'Brown'             8  25.0
##    2 'Blue'              8  25.0
##    3 'Hazel'             8  25.0
##    4 'Green'             8  25.0
## 
## ================================================================================
## 
##    Sex
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: integer
##    Factor with 2 levels
## 
##    Levels and labels     N Valid
##                                 
##    1 'Male'             16  50.0
##    2 'Female'           16  50.0
## 
## ================================================================================
## 
##    Freq
## 
## --------------------------------------------------------------------------------
## 
##    Storage mode: double
## 
##         Min:  2.000
##         Max: 66.000
##        Mean: 18.500
##    Std.Dev.: 17.955
##    Skewness:  1.364
##    Kurtosis:  0.749

QUESTION C

Demonstrate useful functions of dplyr for data manipulation for the following:

#Declaring the dataset from source to df
c <- read.csv("https://raw.githubusercontent.com/IrdnaBtrisyia/AAQ2/main/Movies.csv")
View(c)

Change the existing column name to something new.

# showing the old variable names
names(c)

## [1] "movie_title"              "release_date"            
## [3] "total_gross"              "inflation_adjusted_gross"

c<- dplyr::rename(c, "Title"="movie_title", "Release" = "release_date" , "Income" = "total_gross" , "AGI" = "inflation_adjusted_gross")
# showing the new name for variables
names(c)

## [1] "Title"   "Release" "Income"  "AGI"

Pick rows based on their values.

Showing row with AGI equals to 14641561

show (dplyr::filter(c, AGI==14641561))

##            Title  Release  Income      AGI
## 1 The Jerky Boys 2/3/1995 7555256 14641561

Showing row with Income equals to 14276095

show (dplyr::filter(c, Income==14276095))

##                             Title   Release   Income      AGI
## 1 Baby: Secret of the Lost Legend 3/22/1985 14276095 33900697

Add new columns to a data frame.

Adding new column, Deductions which is the total of AGI subtract Income

c <- c%>%
  dplyr::select(Title:AGI) %>%
  dplyr::mutate(
    Deductions = AGI - Income
)

Combine data across two or more data frames.

Combining data frame with Genre data frame and set a new data frame called Movies

genre <- read.csv("https://raw.githubusercontent.com/IrdnaBtrisyia/AAQ2/main/Genre.csv")
genre <- dplyr::rename (genre, "Title"="movie_title")
movies = inner_join(c, genre, by = "Title")
View(movies)

Reference: https://www.kaggle.com/prateekmaj21/disney-movies