Find and get a dataset from the datasets available within R.
Perform exploratory data analysis (EDA) and prepare a codebook on that dataset using a newer method in R.
Import libraries
library(MASS)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(corrplot)
library(DT)
library(dataMaid)
library(mlbench) # contains BostonHousing dataset
Boston housing dataset is the housing data for 506 census tracts of Boston from the 1970 census.
Display dimension of dataset
The dataset contains 14 attributes with 506 entries.
dim(Boston)
## [1] 506 14
Quick glance at the dataset
head(Boston)
Check whether the dataset contains NA values
colSums(is.na(Boston))
## crim zn indus chas nox rm age dis rad tax
## 0 0 0 0 0 0 0 0 0 0
## ptratio black lstat medv
## 0 0 0 0
Structure of dataset
All variables are numeric except ‘chas’ and ‘rad’.
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
The summary of the dataset
From the summary, it can be observed that ‘crim’, ‘zn’, ‘rm’ and ‘black’ have huge difference in their median and mean. This depicts that the numbers of outliers are high.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Checking outliers
From the box plots, it is clearly shown that there are indeed a lot of outliers in those 4 variables!
par(mfrow = c(1, 4), pch=16) # Display 1 x 4 plots
boxplot(Boston$crim, main='crim',col='Grey')
boxplot(Boston$zn, main='zn',col='Yellow')
boxplot(Boston$rm, main='rm',col='Green')
boxplot(Boston$black, main='black',col='Sky Blue')
Plotting histograms
Histograms enable us to observe the skewness of the data.
In Boston housing dataset, most variables are skewed except ‘rm’ that follows the normal distribution.
Boston %>%
# gather key (variable name) and value (entry value) as 2 columns except medv and chas
gather(key = "var", value = "value", -medv, -chas) %>%
# ggplot histogram
ggplot(aes(x = value)) +
geom_histogram() +
facet_wrap(~ var, scales = "free") +
# some formatting to make the plot nicer
theme_gray() +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Histogram for Variables") # title
Determine correlations between variables
corrplot provides a great function for this purpose!
# The darker the color, the stronger the correlation between the variables
corrplot(cor(Boston))
Plotting each variables against per capita crime rate
Boston %>%
# gather key (variable name) and value (entry value) as 2 columns except medv and crim
gather(key = "var", value = "value", -crim, -medv) %>%
# ggplot scatter(point) plot
ggplot(aes(x = value, y = crim)) +
geom_point(shape = 20) +
facet_wrap(~var, scales = "free") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Scatter Plot of Variables against Per capita Crime Rate")
Inspecting linear relation between medv and other variables
Boston %>%
# gather key (variable name) and value (entry value) as 2 columns except medv
gather(key = "var", value = "value", -medv) %>%
# ggplot scatter(point) plot
ggplot(aes(x = value, y = medv)) +
geom_point(shape = 20) +
stat_smooth(formula = y~x, method = "lm", se = TRUE, col = "green") +
facet_wrap(~var, scales = "free") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Scatter Plot of Variables against Median Value (medv)")
Creating code book with dataMaid
The dataset examined has the following dimensions:
| Feature | Result |
|---|---|
| Number of observations | 506 |
| Number of variables | 14 |
crimzninduschasnoxrmagedisradtaxptratioblacklstatmedv
| Label | Variable | Class | # unique values | Missing | Description |
|---|---|---|---|---|---|
| crim | crim | numeric | 504 | 0.00 % | Per capita crime rate by town |
| zn | zn | numeric | 26 | 0.00 % | Proportion of residential land zoned for lots over 25,000 sq.ft |
| indus | indus | numeric | 76 | 0.00 % | Proportion of non-retail business acres per town |
| chas | chas | integer | 2 | 0.00 % | Charles River dummy variable (1 if tract bounds river; 0 otherwise) |
| nox | nox | numeric | 81 | 0.00 % | Nitric oxides concentration (parts per 10 million) |
| rm | rm | numeric | 446 | 0.00 % | Average number of rooms per dwelling |
| age | age | numeric | 356 | 0.00 % | Proportion of owner-occupied units built prior to 1940 |
| dis | dis | numeric | 412 | 0.00 % | Weighted distances to five Boston employment centres |
| rad | rad | integer | 9 | 0.00 % | Index of accessibility to radial highways |
| tax | tax | numeric | 66 | 0.00 % | Full-value property-tax rate per USD 10,000 |
| ptratio | ptratio | numeric | 46 | 0.00 % | Pupil-teacher ratio by town |
| black | black | numeric | 357 | 0.00 % | Proportion of blacks by town |
| lstat | lstat | numeric | 455 | 0.00 % | Percentage of lower status of the population |
| medv | medv | numeric | 229 | 0.00 % | Median value of owner-occupied homes in USD 1000?s |
crim
| Feature | Result |
|---|---|
| Variable type | numeric |
| Number of missing obs. | 0 (0 %) |
| Number of unique values | 504 |
| Median | 0.26 |
| 1st and 3rd quartiles | 0.08; 3.68 |
| Min. and max. | 0.01; 88.98 |
zn
| Feature | Result |
|---|---|
| Variable type | numeric |
| Number of missing obs. | 0 (0 %) |
| Number of unique values | 26 |
| Median | 0 |
| 1st and 3rd quartiles | 0; 12.5 |
| Min. and max. | 0; 100 |
indus
| Feature | Result |
|---|---|
| Variable type | numeric |
| Number of missing obs. | 0 (0 %) |
| Number of unique values | 76 |
| Median | 9.69 |
| 1st and 3rd quartiles | 5.19; 18.1 |
| Min. and max. | 0.46; 27.74 |
chas
| Feature | Result |
|---|---|
| Variable type | integer |
| Number of missing obs. | 0 (0 %) |
| Number of unique values | 2 |
| Median | 0 |
| 1st and 3rd quartiles | 0; 0 |
| Min. and max. | 0; 1 |
nox
| Feature | Result |
|---|---|
| Variable type | numeric |
| Number of missing obs. | 0 (0 %) |
| Number of unique values | 81 |
| Median | 0.54 |
| 1st and 3rd quartiles | 0.45; 0.62 |
| Min. and max. | 0.38; 0.87 |
rm
| Feature | Result |
|---|---|
| Variable type | numeric |
| Number of missing obs. | 0 (0 %) |
| Number of unique values | 446 |
| Median | 6.21 |
| 1st and 3rd quartiles | 5.89; 6.62 |
| Min. and max. | 3.56; 8.78 |
age
| Feature | Result |
|---|---|
| Variable type | numeric |
| Number of missing obs. | 0 (0 %) |
| Number of unique values | 356 |
| Median | 77.5 |
| 1st and 3rd quartiles | 45.02; 94.07 |
| Min. and max. | 2.9; 100 |
dis
| Feature | Result |
|---|---|
| Variable type | numeric |
| Number of missing obs. | 0 (0 %) |
| Number of unique values | 412 |
| Median | 3.21 |
| 1st and 3rd quartiles | 2.1; 5.19 |
| Min. and max. | 1.13; 12.13 |
rad
| Feature | Result |
|---|---|
| Variable type | integer |
| Number of missing obs. | 0 (0 %) |
| Number of unique values | 9 |
| Median | 5 |
| 1st and 3rd quartiles | 4; 24 |
| Min. and max. | 1; 24 |
tax
| Feature | Result |
|---|---|
| Variable type | numeric |
| Number of missing obs. | 0 (0 %) |
| Number of unique values | 66 |
| Median | 330 |
| 1st and 3rd quartiles | 279; 666 |
| Min. and max. | 187; 711 |
ptratio
| Feature | Result |
|---|---|
| Variable type | numeric |
| Number of missing obs. | 0 (0 %) |
| Number of unique values | 46 |
| Median | 19.05 |
| 1st and 3rd quartiles | 17.4; 20.2 |
| Min. and max. | 12.6; 22 |
black
| Feature | Result |
|---|---|
| Variable type | numeric |
| Number of missing obs. | 0 (0 %) |
| Number of unique values | 357 |
| Median | 391.44 |
| 1st and 3rd quartiles | 375.38; 396.22 |
| Min. and max. | 0.32; 396.9 |
lstat
| Feature | Result |
|---|---|
| Variable type | numeric |
| Number of missing obs. | 0 (0 %) |
| Number of unique values | 455 |
| Median | 11.36 |
| 1st and 3rd quartiles | 6.95; 16.96 |
| Min. and max. | 1.73; 37.97 |
medv
| Feature | Result |
|---|---|
| Variable type | numeric |
| Number of missing obs. | 0 (0 %) |
| Number of unique values | 229 |
| Median | 21.2 |
| 1st and 3rd quartiles | 17.02; 25 |
| Min. and max. | 5; 50 |
Report generation information:
Created by: zianttt (username: ziant).
Report creation time: Fri Dec 31 2021 11:42:11
Report was run from directory: C:/Users/ziant/Desktop/Y1S1/WIA1007 Data Science/Individual assignment/AA1
dataMaid v1.4.1 [Pkg: 2021-10-08 from CRAN (R 4.1.2)]
R version 4.1.1 (2021-08-10).
Platform: x86_64-w64-mingw32/x64 (64-bit)(Windows 10 x64 (build 19042)).
Function call: dataMaid::makeDataReport(data = Boston, mode = c("summarize", "visualize", "check"), smartNum = FALSE, file = "codebook_Boston.Rmd", replace = TRUE, checks = list(character = "showAllFactorLevels", factor = "showAllFactorLevels", labelled = "showAllFactorLevels", haven_labelled = "showAllFactorLevels", numeric = NULL, integer = NULL, logical = NULL, Date = NULL), listChecks = FALSE, maxProbVals = Inf, codebook = TRUE, reportTitle = "Codebook for Boston")
Demonstrate these FIVE (5) functions of dplyr for data manipulation:
i. filter ( )
ii. arrange ( )
iii. mutate ( )
iv. select ( )
v. summarise ( )
Loading and inspecting data.
input <- read.csv("cereal.csv", sep = ";")
df <- data.frame(input)
head(df)
str(df)
## 'data.frame': 78 obs. of 16 variables:
## $ name : chr "String" "100% Bran" "100% Natural Bran" "All-Bran" ...
## $ mfr : chr "Categorical" "N" "Q" "K" ...
## $ type : chr "Categorical" "C" "C" "C" ...
## $ calories: chr "Int" "70" "120" "70" ...
## $ protein : chr "Int" "4" "3" "4" ...
## $ fat : chr "Int" "1" "5" "1" ...
## $ sodium : chr "Int" "130" "15" "260" ...
## $ fiber : chr "Float" "10" "2" "9" ...
## $ carbo : chr "Float" "5" "8" "7" ...
## $ sugars : chr "Int" "6" "8" "5" ...
## $ potass : chr "Int" "280" "135" "320" ...
## $ vitamins: chr "Int" "25" "0" "25" ...
## $ shelf : chr "Int" "3" "3" "3" ...
## $ weight : chr "Float" "1" "1" "1" ...
## $ cups : chr "Float" "0.33" "1" "0.33" ...
## $ rating : chr "Float" "68.402973" "33.983679" "59.425505" ...
First row is now needed because it is not an data entry. The class of non-categorical variables such as rating, weight, carbo should be convert to numeric to enable calculations.
# Remove first row
data <- df[-1, ]
# Convert character class to numeric class
i <- c(4:16)
data[ , i] <- apply(data[ , i], 2,
function(x) as.numeric(as.character(x)))
str(data)
## 'data.frame': 77 obs. of 16 variables:
## $ name : chr "100% Bran" "100% Natural Bran" "All-Bran" "All-Bran with Extra Fiber" ...
## $ mfr : chr "N" "Q" "K" "K" ...
## $ type : chr "C" "C" "C" "C" ...
## $ calories: num 70 120 70 50 110 110 110 130 90 90 ...
## $ protein : num 4 3 4 4 2 2 2 3 2 3 ...
## $ fat : num 1 5 1 0 2 2 0 2 1 0 ...
## $ sodium : num 130 15 260 140 200 180 125 210 200 210 ...
## $ fiber : num 10 2 9 14 1 1.5 1 2 4 5 ...
## $ carbo : num 5 8 7 8 14 10.5 11 18 15 13 ...
## $ sugars : num 6 8 5 0 8 10 14 8 6 5 ...
## $ potass : num 280 135 320 330 -1 70 30 100 125 190 ...
## $ vitamins: num 25 0 25 25 25 25 25 25 25 25 ...
## $ shelf : num 3 3 3 3 3 1 2 3 1 3 ...
## $ weight : num 1 1 1 1 1 1 1 1.33 1 1 ...
## $ cups : num 0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
## $ rating : num 68.4 34 59.4 93.7 34.4 ...
It is used to subset data with matching logical conditions. It takes a dataframe and logical conditions as arguments. Logical operators such as ‘&’, ‘|’, ‘!’ can be used to create conditions.
# Filtering entries with rating higher or equal to 60
good_rating <- filter(data, rating >= 60.00)
# Filtering entries with types equal to C and calories higher than 50
filtered_data <- filter(data, type=='C' & calories > 50)
good_rating
filtered_data
It is used to sort data based on the condition(variables) specified. It takes a dataframe and sorting factors as arguments.
# Sorting the data based on ratings (desc means in descending order) and calories
data_sorted <- arrange(data, desc(rating), calories)
data_sorted
It is used to create new variables. It takes a dataframe and expressions as arguments.
# Create new variables average_rating (using mean() function) and caloriesCarboRatio which depicts the calory to carbohydrate content ratio
mutated_data <- mutate(data, average_rating=mean(rating), caloriesCarboRatio=calories/carbo)
mutated_data
It is used to select only desired variables. It takes a dataframe and variables by name as arguments.
# Select every column except mfr and type columns
selected_data <- select(data, -c(mfr, type))
selected_data
It is used to summarize data. It takes a dataframe and summary functions as arguments.
# Summary for the average rating, median rating, and highest protein content of all cereal brands
summary_data <- summarise(data, average_rating = mean(rating), median_rating = median(calories),
highest_protein=max(protein))
summary_data