(a)

Find and get a dataset from the datasets available within R.
Perform exploratory data analysis (EDA) and prepare a codebook on that dataset using a newer method in R.

Import libraries

library(MASS)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(corrplot)
library(DT)
library(dataMaid)
library(mlbench) # contains BostonHousing dataset

EDA on boston housing dataset

Boston housing dataset is the housing data for 506 census tracts of Boston from the 1970 census.

Display dimension of dataset
The dataset contains 14 attributes with 506 entries.

dim(Boston)
## [1] 506  14

Quick glance at the dataset

head(Boston) 

Check whether the dataset contains NA values

colSums(is.na(Boston))
##    crim      zn   indus    chas     nox      rm     age     dis     rad     tax 
##       0       0       0       0       0       0       0       0       0       0 
## ptratio   black   lstat    medv 
##       0       0       0       0

Structure of dataset
All variables are numeric except ‘chas’ and ‘rad’.

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The summary of the dataset
From the summary, it can be observed that ‘crim’, ‘zn’, ‘rm’ and ‘black’ have huge difference in their median and mean. This depicts that the numbers of outliers are high.

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Checking outliers
From the box plots, it is clearly shown that there are indeed a lot of outliers in those 4 variables!

par(mfrow = c(1, 4), pch=16) # Display 1 x 4 plots
boxplot(Boston$crim, main='crim',col='Grey')
boxplot(Boston$zn, main='zn',col='Yellow')
boxplot(Boston$rm, main='rm',col='Green')
boxplot(Boston$black, main='black',col='Sky Blue')

Plotting histograms
Histograms enable us to observe the skewness of the data.
In Boston housing dataset, most variables are skewed except ‘rm’ that follows the normal distribution.

Boston %>%
  # gather key (variable name) and value (entry value) as 2 columns except medv and chas
  gather(key = "var", value = "value", -medv, -chas) %>%  
  # ggplot histogram
  ggplot(aes(x = value)) +
  geom_histogram() +
  facet_wrap(~ var, scales = "free") +
  # some formatting to make the plot nicer
  theme_gray() +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("Histogram for Variables") # title

Determine correlations between variables
corrplot provides a great function for this purpose!

# The darker the color, the stronger the correlation between the variables
corrplot(cor(Boston))

Plotting each variables against per capita crime rate

Boston %>%
  # gather key (variable name) and value (entry value) as 2 columns except medv and crim
  gather(key = "var", value = "value", -crim, -medv) %>%
  # ggplot scatter(point) plot
  ggplot(aes(x = value, y = crim)) +
  geom_point(shape = 20) +
  facet_wrap(~var, scales = "free") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("Scatter Plot of Variables against Per capita Crime Rate") 

Inspecting linear relation between medv and other variables

Boston %>%
  # gather key (variable name) and value (entry value) as 2 columns except medv
  gather(key = "var", value = "value", -medv) %>%
  # ggplot scatter(point) plot
  ggplot(aes(x = value, y = medv)) +
  geom_point(shape = 20) +
  stat_smooth(formula = y~x, method = "lm", se = TRUE, col = "green") +
  facet_wrap(~var, scales = "free") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("Scatter Plot of Variables against Median Value (medv)") 

Creating code book with dataMaid

Codebook for Boston

Data report overview

The dataset examined has the following dimensions:

Feature Result
Number of observations 506
Number of variables 14

crimzninduschasnoxrmagedisradtaxptratioblacklstatmedv

Codebook summary table

Label Variable Class # unique values Missing Description
crim crim numeric 504 0.00 % Per capita crime rate by town
zn zn numeric 26 0.00 % Proportion of residential land zoned for lots over 25,000 sq.ft
indus indus numeric 76 0.00 % Proportion of non-retail business acres per town
chas chas integer 2 0.00 % Charles River dummy variable (1 if tract bounds river; 0 otherwise)
nox nox numeric 81 0.00 % Nitric oxides concentration (parts per 10 million)
rm rm numeric 446 0.00 % Average number of rooms per dwelling
age age numeric 356 0.00 % Proportion of owner-occupied units built prior to 1940
dis dis numeric 412 0.00 % Weighted distances to five Boston employment centres
rad rad integer 9 0.00 % Index of accessibility to radial highways
tax tax numeric 66 0.00 % Full-value property-tax rate per USD 10,000
ptratio ptratio numeric 46 0.00 % Pupil-teacher ratio by town
black black numeric 357 0.00 % Proportion of blacks by town
lstat lstat numeric 455 0.00 % Percentage of lower status of the population
medv medv numeric 229 0.00 % Median value of owner-occupied homes in USD 1000?s

Variable list

crim

crim

Feature Result
Variable type numeric
Number of missing obs. 0 (0 %)
Number of unique values 504
Median 0.26
1st and 3rd quartiles 0.08; 3.68
Min. and max. 0.01; 88.98

zn

zn

Feature Result
Variable type numeric
Number of missing obs. 0 (0 %)
Number of unique values 26
Median 0
1st and 3rd quartiles 0; 12.5
Min. and max. 0; 100

indus

indus

Feature Result
Variable type numeric
Number of missing obs. 0 (0 %)
Number of unique values 76
Median 9.69
1st and 3rd quartiles 5.19; 18.1
Min. and max. 0.46; 27.74

chas

chas

Feature Result
Variable type integer
Number of missing obs. 0 (0 %)
Number of unique values 2
Median 0
1st and 3rd quartiles 0; 0
Min. and max. 0; 1

nox

nox

Feature Result
Variable type numeric
Number of missing obs. 0 (0 %)
Number of unique values 81
Median 0.54
1st and 3rd quartiles 0.45; 0.62
Min. and max. 0.38; 0.87

rm

rm

Feature Result
Variable type numeric
Number of missing obs. 0 (0 %)
Number of unique values 446
Median 6.21
1st and 3rd quartiles 5.89; 6.62
Min. and max. 3.56; 8.78

age

age

Feature Result
Variable type numeric
Number of missing obs. 0 (0 %)
Number of unique values 356
Median 77.5
1st and 3rd quartiles 45.02; 94.07
Min. and max. 2.9; 100

dis

dis

Feature Result
Variable type numeric
Number of missing obs. 0 (0 %)
Number of unique values 412
Median 3.21
1st and 3rd quartiles 2.1; 5.19
Min. and max. 1.13; 12.13

rad

rad

Feature Result
Variable type integer
Number of missing obs. 0 (0 %)
Number of unique values 9
Median 5
1st and 3rd quartiles 4; 24
Min. and max. 1; 24

tax

tax

Feature Result
Variable type numeric
Number of missing obs. 0 (0 %)
Number of unique values 66
Median 330
1st and 3rd quartiles 279; 666
Min. and max. 187; 711

ptratio

ptratio

Feature Result
Variable type numeric
Number of missing obs. 0 (0 %)
Number of unique values 46
Median 19.05
1st and 3rd quartiles 17.4; 20.2
Min. and max. 12.6; 22

black

black

Feature Result
Variable type numeric
Number of missing obs. 0 (0 %)
Number of unique values 357
Median 391.44
1st and 3rd quartiles 375.38; 396.22
Min. and max. 0.32; 396.9

lstat

lstat

Feature Result
Variable type numeric
Number of missing obs. 0 (0 %)
Number of unique values 455
Median 11.36
1st and 3rd quartiles 6.95; 16.96
Min. and max. 1.73; 37.97

medv

medv

Feature Result
Variable type numeric
Number of missing obs. 0 (0 %)
Number of unique values 229
Median 21.2
1st and 3rd quartiles 17.02; 25
Min. and max. 5; 50

Report generation information:

  • Created by: zianttt (username: ziant).

  • Report creation time: Fri Dec 31 2021 11:42:11

  • Report was run from directory: C:/Users/ziant/Desktop/Y1S1/WIA1007 Data Science/Individual assignment/AA1

  • dataMaid v1.4.1 [Pkg: 2021-10-08 from CRAN (R 4.1.2)]

  • R version 4.1.1 (2021-08-10).

  • Platform: x86_64-w64-mingw32/x64 (64-bit)(Windows 10 x64 (build 19042)).

  • Function call: dataMaid::makeDataReport(data = Boston, mode = c("summarize", "visualize", "check"), smartNum = FALSE, file = "codebook_Boston.Rmd", replace = TRUE, checks = list(character = "showAllFactorLevels", factor = "showAllFactorLevels", labelled = "showAllFactorLevels", haven_labelled = "showAllFactorLevels", numeric = NULL, integer = NULL, logical = NULL, Date = NULL), listChecks = FALSE, maxProbVals = Inf, codebook = TRUE, reportTitle = "Codebook for Boston")

(b)

Demonstrate these FIVE (5) functions of dplyr for data manipulation:

i. filter ( )
ii. arrange ( )
iii. mutate ( )
iv. select ( ) 
v. summarise ( )

Loading and inspecting data.

input <- read.csv("cereal.csv", sep = ";")
df <- data.frame(input)
head(df)
str(df)
## 'data.frame':    78 obs. of  16 variables:
##  $ name    : chr  "String" "100% Bran" "100% Natural Bran" "All-Bran" ...
##  $ mfr     : chr  "Categorical" "N" "Q" "K" ...
##  $ type    : chr  "Categorical" "C" "C" "C" ...
##  $ calories: chr  "Int" "70" "120" "70" ...
##  $ protein : chr  "Int" "4" "3" "4" ...
##  $ fat     : chr  "Int" "1" "5" "1" ...
##  $ sodium  : chr  "Int" "130" "15" "260" ...
##  $ fiber   : chr  "Float" "10" "2" "9" ...
##  $ carbo   : chr  "Float" "5" "8" "7" ...
##  $ sugars  : chr  "Int" "6" "8" "5" ...
##  $ potass  : chr  "Int" "280" "135" "320" ...
##  $ vitamins: chr  "Int" "25" "0" "25" ...
##  $ shelf   : chr  "Int" "3" "3" "3" ...
##  $ weight  : chr  "Float" "1" "1" "1" ...
##  $ cups    : chr  "Float" "0.33" "1" "0.33" ...
##  $ rating  : chr  "Float" "68.402973" "33.983679" "59.425505" ...

First row is now needed because it is not an data entry. The class of non-categorical variables such as rating, weight, carbo should be convert to numeric to enable calculations.

# Remove first row
data <- df[-1, ]

# Convert character class to numeric class
i <- c(4:16)
data[ , i] <- apply(data[ , i], 2,        
                    function(x) as.numeric(as.character(x)))
str(data)
## 'data.frame':    77 obs. of  16 variables:
##  $ name    : chr  "100% Bran" "100% Natural Bran" "All-Bran" "All-Bran with Extra Fiber" ...
##  $ mfr     : chr  "N" "Q" "K" "K" ...
##  $ type    : chr  "C" "C" "C" "C" ...
##  $ calories: num  70 120 70 50 110 110 110 130 90 90 ...
##  $ protein : num  4 3 4 4 2 2 2 3 2 3 ...
##  $ fat     : num  1 5 1 0 2 2 0 2 1 0 ...
##  $ sodium  : num  130 15 260 140 200 180 125 210 200 210 ...
##  $ fiber   : num  10 2 9 14 1 1.5 1 2 4 5 ...
##  $ carbo   : num  5 8 7 8 14 10.5 11 18 15 13 ...
##  $ sugars  : num  6 8 5 0 8 10 14 8 6 5 ...
##  $ potass  : num  280 135 320 330 -1 70 30 100 125 190 ...
##  $ vitamins: num  25 0 25 25 25 25 25 25 25 25 ...
##  $ shelf   : num  3 3 3 3 3 1 2 3 1 3 ...
##  $ weight  : num  1 1 1 1 1 1 1 1.33 1 1 ...
##  $ cups    : num  0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
##  $ rating  : num  68.4 34 59.4 93.7 34.4 ...

filter()

It is used to subset data with matching logical conditions. It takes a dataframe and logical conditions as arguments. Logical operators such as ‘&’, ‘|’, ‘!’ can be used to create conditions.

# Filtering entries with rating higher or equal to 60
good_rating <- filter(data, rating >= 60.00)
# Filtering entries with types equal to C and calories higher than 50
filtered_data <- filter(data, type=='C' & calories > 50)
good_rating
filtered_data

arrange()

It is used to sort data based on the condition(variables) specified. It takes a dataframe and sorting factors as arguments.

# Sorting the data based on ratings (desc means in descending order) and calories
data_sorted <- arrange(data, desc(rating), calories)
data_sorted

mutate()

It is used to create new variables. It takes a dataframe and expressions as arguments.

# Create new variables average_rating (using mean() function) and caloriesCarboRatio which depicts the calory to carbohydrate content ratio
mutated_data <- mutate(data, average_rating=mean(rating), caloriesCarboRatio=calories/carbo)
mutated_data

select()

It is used to select only desired variables. It takes a dataframe and variables by name as arguments.

# Select every column except mfr and type columns
selected_data <- select(data, -c(mfr, type))
selected_data

summarize()

It is used to summarize data. It takes a dataframe and summary functions as arguments.

# Summary for the average rating, median rating, and highest protein content of all cereal brands
summary_data <- summarise(data, average_rating = mean(rating), median_rating = median(calories), 
                          highest_protein=max(protein))
summary_data