(a)

Find and get a dataset from the datasets available within R.
Perform exploratory data analysis (EDA) and prepare a codebook on that dataset using a newer method in R.

Import libraries

library(MASS)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(corrplot)
library(DT)
library(dataMaid)
library(mlbench) # contains BostonHousing dataset

EDA on boston housing dataset

Boston housing dataset is the housing data for 506 census tracts of Boston from the 1970 census.

Display dimension of dataset
The dataset contains 14 attributes with 506 entries.

dim(Boston)

## [1] 506  14

Quick glance at the dataset

head(Boston)

Check whether the dataset contains NA values

colSums(is.na(Boston))

##    crim      zn   indus    chas     nox      rm     age     dis     rad     tax 
##       0       0       0       0       0       0       0       0       0       0 
## ptratio   black   lstat    medv 
##       0       0       0       0

Structure of dataset
All variables are numeric except ‘chas’ and ‘rad’.

str(Boston)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The summary of the dataset
From the summary, it can be observed that ‘crim’, ‘zn’, ‘rm’ and ‘black’ have huge difference in their median and mean. This depicts that the numbers of outliers are high.

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Checking outliers
From the box plots, it is clearly shown that there are indeed a lot of outliers in those 4 variables!

par(mfrow = c(1, 4), pch=16) # Display 1 x 4 plots
boxplot(Boston$crim, main='crim',col='Grey')
boxplot(Boston$zn, main='zn',col='Yellow')
boxplot(Boston$rm, main='rm',col='Green')
boxplot(Boston$black, main='black',col='Sky Blue')

Plotting histograms
Histograms enable us to observe the skewness of the data.
In Boston housing dataset, most variables are skewed except ‘rm’ that follows the normal distribution.

Boston %>%
  # gather key (variable name) and value (entry value) as 2 columns except medv and chas
  gather(key = "var", value = "value", -medv, -chas) %>%  
  # ggplot histogram
  ggplot(aes(x = value)) +
  geom_histogram() +
  facet_wrap(~ var, scales = "free") +
  # some formatting to make the plot nicer
  theme_gray() +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("Histogram for Variables") # title

Determine correlations between variables
corrplot provides a great function for this purpose!

# The darker the color, the stronger the correlation between the variables
corrplot(cor(Boston))

Plotting each variables against per capita crime rate

Boston %>%
  # gather key (variable name) and value (entry value) as 2 columns except medv and crim
  gather(key = "var", value = "value", -crim, -medv) %>%
  # ggplot scatter(point) plot
  ggplot(aes(x = value, y = crim)) +
  geom_point(shape = 20) +
  facet_wrap(~var, scales = "free") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("Scatter Plot of Variables against Per capita Crime Rate")

Inspecting linear relation between medv and other variables

Boston %>%
  # gather key (variable name) and value (entry value) as 2 columns except medv
  gather(key = "var", value = "value", -medv) %>%
  # ggplot scatter(point) plot
  ggplot(aes(x = value, y = medv)) +
  geom_point(shape = 20) +
  stat_smooth(formula = y~x, method = "lm", se = TRUE, col = "green") +
  facet_wrap(~var, scales = "free") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("Scatter Plot of Variables against Median Value (medv)")

Creating code book with dataMaid

Codebook for Boston

Data report overview

The dataset examined has the following dimensions:

Feature	Result
Number of observations	506
Number of variables	14

crimzninduschasnoxrmagedisradtaxptratioblacklstatmedv

Codebook summary table

Label	Variable	Class	# unique values	Description
crim	crim	numeric	504	Per capita crime rate by town
zn	zn	numeric	26	Proportion of residential land zoned for lots over 25,000 sq.ft
indus	indus	numeric	76	Proportion of non-retail business acres per town
chas	chas	integer	2	Charles River dummy variable (1 if tract bounds river; 0 otherwise)
nox	nox	numeric	81	Nitric oxides concentration (parts per 10 million)
rm	rm	numeric	446	Average number of rooms per dwelling
age	age	numeric	356	Proportion of owner-occupied units built prior to 1940
dis	dis	numeric	412	Weighted distances to five Boston employment centres
rad	rad	integer	9	Index of accessibility to radial highways
tax	tax	numeric	66	Full-value property-tax rate per USD 10,000
ptratio	ptratio	numeric	46	Pupil-teacher ratio by town
black	black	numeric	357	Proportion of blacks by town
lstat	lstat	numeric	455	Percentage of lower status of the population
medv	medv	numeric	229	Median value of owner-occupied homes in USD 1000?s

Variable list

crim

crim

Feature	Result
Variable type	numeric
Number of missing obs.	0 (0 %)
Number of unique values	504
Median	0.26
1st and 3rd quartiles	0.08; 3.68
Min. and max.	0.01; 88.98

zn

Feature	Result
Variable type	numeric
Number of missing obs.	0 (0 %)
Number of unique values	26
Median	0
1st and 3rd quartiles	0; 12.5
Min. and max.	0; 100

indus

indus

Feature	Result
Variable type	numeric
Number of missing obs.	0 (0 %)
Number of unique values	76
Median	9.69
1st and 3rd quartiles	5.19; 18.1
Min. and max.	0.46; 27.74

chas

chas

Feature	Result
Variable type	integer
Number of missing obs.	0 (0 %)
Number of unique values	2
Median	0
1st and 3rd quartiles	0; 0
Min. and max.	0; 1

nox

nox

Feature	Result
Variable type	numeric
Number of missing obs.	0 (0 %)
Number of unique values	81
Median	0.54
1st and 3rd quartiles	0.45; 0.62
Min. and max.	0.38; 0.87

rm

Feature	Result
Variable type	numeric
Number of missing obs.	0 (0 %)
Number of unique values	446
Median	6.21
1st and 3rd quartiles	5.89; 6.62
Min. and max.	3.56; 8.78

age

age

Feature	Result
Variable type	numeric
Number of missing obs.	0 (0 %)
Number of unique values	356
Median	77.5
1st and 3rd quartiles	45.02; 94.07
Min. and max.	2.9; 100

dis

dis

Feature	Result
Variable type	numeric
Number of missing obs.	0 (0 %)
Number of unique values	412
Median	3.21
1st and 3rd quartiles	2.1; 5.19
Min. and max.	1.13; 12.13

rad

rad

Feature	Result
Variable type	integer
Number of missing obs.	0 (0 %)
Number of unique values	9
Median	5
1st and 3rd quartiles	4; 24
Min. and max.	1; 24

tax

tax

Feature	Result
Variable type	numeric
Number of missing obs.	0 (0 %)
Number of unique values	66
Median	330
1st and 3rd quartiles	279; 666
Min. and max.	187; 711

ptratio

ptratio

Feature	Result
Variable type	numeric
Number of missing obs.	0 (0 %)
Number of unique values	46
Median	19.05
1st and 3rd quartiles	17.4; 20.2
Min. and max.	12.6; 22

black

black

Feature	Result
Variable type	numeric
Number of missing obs.	0 (0 %)
Number of unique values	357
Median	391.44
1st and 3rd quartiles	375.38; 396.22
Min. and max.	0.32; 396.9

lstat

lstat

Feature	Result
Variable type	numeric
Number of missing obs.	0 (0 %)
Number of unique values	455
Median	11.36
1st and 3rd quartiles	6.95; 16.96
Min. and max.	1.73; 37.97

medv

medv

Feature	Result
Variable type	numeric
Number of missing obs.	0 (0 %)
Number of unique values	229
Median	21.2
1st and 3rd quartiles	17.02; 25
Min. and max.	5; 50

Report generation information:

Created by: zianttt (username: ziant).
Report creation time: Fri Dec 31 2021 11:42:11
Report was run from directory: C:/Users/ziant/Desktop/Y1S1/WIA1007 Data Science/Individual assignment/AA1
dataMaid v1.4.1 [Pkg: 2021-10-08 from CRAN (R 4.1.2)]
R version 4.1.1 (2021-08-10).
Platform: x86_64-w64-mingw32/x64 (64-bit)(Windows 10 x64 (build 19042)).
Function call: dataMaid::makeDataReport(data = Boston, mode = c("summarize", "visualize", "check"), smartNum = FALSE, file = "codebook_Boston.Rmd", replace = TRUE, checks = list(character = "showAllFactorLevels", factor = "showAllFactorLevels", labelled = "showAllFactorLevels", haven_labelled = "showAllFactorLevels", numeric = NULL, integer = NULL, logical = NULL, Date = NULL), listChecks = FALSE, maxProbVals = Inf, codebook = TRUE, reportTitle = "Codebook for Boston")

(b)

Demonstrate these FIVE (5) functions of dplyr for data manipulation:

i. filter ( )
ii. arrange ( )
iii. mutate ( )
iv. select ( ) 
v. summarise ( )

Loading and inspecting data.

input <- read.csv("cereal.csv", sep = ";")
df <- data.frame(input)
head(df)

str(df)

## 'data.frame':    78 obs. of  16 variables:
##  $ name    : chr  "String" "100% Bran" "100% Natural Bran" "All-Bran" ...
##  $ mfr     : chr  "Categorical" "N" "Q" "K" ...
##  $ type    : chr  "Categorical" "C" "C" "C" ...
##  $ calories: chr  "Int" "70" "120" "70" ...
##  $ protein : chr  "Int" "4" "3" "4" ...
##  $ fat     : chr  "Int" "1" "5" "1" ...
##  $ sodium  : chr  "Int" "130" "15" "260" ...
##  $ fiber   : chr  "Float" "10" "2" "9" ...
##  $ carbo   : chr  "Float" "5" "8" "7" ...
##  $ sugars  : chr  "Int" "6" "8" "5" ...
##  $ potass  : chr  "Int" "280" "135" "320" ...
##  $ vitamins: chr  "Int" "25" "0" "25" ...
##  $ shelf   : chr  "Int" "3" "3" "3" ...
##  $ weight  : chr  "Float" "1" "1" "1" ...
##  $ cups    : chr  "Float" "0.33" "1" "0.33" ...
##  $ rating  : chr  "Float" "68.402973" "33.983679" "59.425505" ...

First row is now needed because it is not an data entry. The class of non-categorical variables such as rating, weight, carbo should be convert to numeric to enable calculations.

# Remove first row
data <- df[-1, ]

# Convert character class to numeric class
i <- c(4:16)
data[ , i] <- apply(data[ , i], 2,        
                    function(x) as.numeric(as.character(x)))
str(data)

## 'data.frame':    77 obs. of  16 variables:
##  $ name    : chr  "100% Bran" "100% Natural Bran" "All-Bran" "All-Bran with Extra Fiber" ...
##  $ mfr     : chr  "N" "Q" "K" "K" ...
##  $ type    : chr  "C" "C" "C" "C" ...
##  $ calories: num  70 120 70 50 110 110 110 130 90 90 ...
##  $ protein : num  4 3 4 4 2 2 2 3 2 3 ...
##  $ fat     : num  1 5 1 0 2 2 0 2 1 0 ...
##  $ sodium  : num  130 15 260 140 200 180 125 210 200 210 ...
##  $ fiber   : num  10 2 9 14 1 1.5 1 2 4 5 ...
##  $ carbo   : num  5 8 7 8 14 10.5 11 18 15 13 ...
##  $ sugars  : num  6 8 5 0 8 10 14 8 6 5 ...
##  $ potass  : num  280 135 320 330 -1 70 30 100 125 190 ...
##  $ vitamins: num  25 0 25 25 25 25 25 25 25 25 ...
##  $ shelf   : num  3 3 3 3 3 1 2 3 1 3 ...
##  $ weight  : num  1 1 1 1 1 1 1 1.33 1 1 ...
##  $ cups    : num  0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
##  $ rating  : num  68.4 34 59.4 93.7 34.4 ...

filter()

It is used to subset data with matching logical conditions. It takes a dataframe and logical conditions as arguments. Logical operators such as ‘&’, ‘|’, ‘!’ can be used to create conditions.

# Filtering entries with rating higher or equal to 60
good_rating <- filter(data, rating >= 60.00)
# Filtering entries with types equal to C and calories higher than 50
filtered_data <- filter(data, type=='C' & calories > 50)
good_rating

filtered_data

arrange()

It is used to sort data based on the condition(variables) specified. It takes a dataframe and sorting factors as arguments.

# Sorting the data based on ratings (desc means in descending order) and calories
data_sorted <- arrange(data, desc(rating), calories)
data_sorted

mutate()

It is used to create new variables. It takes a dataframe and expressions as arguments.

# Create new variables average_rating (using mean() function) and caloriesCarboRatio which depicts the calory to carbohydrate content ratio
mutated_data <- mutate(data, average_rating=mean(rating), caloriesCarboRatio=calories/carbo)
mutated_data

select()

It is used to select only desired variables. It takes a dataframe and variables by name as arguments.

# Select every column except mfr and type columns
selected_data <- select(data, -c(mfr, type))
selected_data

summarize()

It is used to summarize data. It takes a dataframe and summary functions as arguments.

# Summary for the average rating, median rating, and highest protein content of all cereal brands
summary_data <- summarise(data, average_rating = mean(rating), median_rating = median(calories), 
                          highest_protein=max(protein))
summary_data

AAQ1

TAN ZI AN U2102755

12/30/2021

(a)