COURSEWORK SUBMISSION SHEET
Student’s Name: Qudus Tewogbade Saka
Registration No: B00982741
Course Title: Msc Data Science
COM736: Data Validation and Visualisation |
CRN: 61089 |
Coursework 1 |
This item of coursework will contribute to 50% of the overall module marks. The solutions of all the following exercises need to be submitted into the module assessment area of the Blackboard, as a lab-based assignment, by Friday in the ninth week (i.e. 12.00 noon on Friday, 28/3/3025), contributing to your portfolio of evidence relating to Data Validation and Visualization exercises. You may like to use this file to present your functioning code along with program outputs through this R Markdown document (http://rmarkdown.rstudio.com). ———————————————————————————————————————
Exercise 1. The dataset mpg is part of the R datasets package. It contains a subset of the fuel economy data that the Environment Protection Agency (EPA) makes available on https://fueleconomy.gov/. It contains only car models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car. It is a dataframe with 234 rows and 11 variables: manufacturer name (manufacturer), model name (model), engine displacement (displ), year of manufacture (year), number of cylinders(cyl), type of transmission (trans), type of drive train(drv), city miles per gallon(cty), highway miles per gallon (hwy), fuel type (fl) and type of car (class). Applying an appropriate R data visualiation method on the mpg data, perform the following tasks.
(a). Write code that displays a box-plot graph which plots in the order of decreasing medians of the vehicle’s miles-per-gallon on highway (hwy) against their manufacturers. Plot the graph and list the manufacturers in the order of fuel efficiency of their vehicles. Using the graph, find out which companies produce the most and the least fuel efficient vehicles. [7 marks]
(b). Write code that displays a graph which plots in the order of
decreasing medians of the vehicle’s miles-per-gallon on highway (hwy)
against the type of car (class). Plot the graph and list the classes of
vehicle in the order of their fuel efficiency.
[7 marks]
(c). Draw a bar chart of manufacturers in terms of numbers of different types of cars manufactured. Based on this, comment on classes of vehicles manufactured by the companies producing the most and the least fuel efficient vehicles and possible reason(s) for highest/lowest fuel efficiency. [6 marks]
# Code for the above exercise
#######################
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data(mpg)
# (a) Boxplot of highway mpg (hwy) against manufacturers ordered by median fuel efficiency
mpg_manufacturer <- mpg %>%
group_by(manufacturer) %>%
summarise(median_hwy = median(hwy, na.rm = TRUE)) %>%
arrange(desc(median_hwy))
ggplot(mpg, aes(x = reorder(manufacturer, -hwy, FUN = median), y = hwy)) +
geom_boxplot() +
labs(title = "Highway Fuel Efficiency by Manufacturer", x = "Manufacturer", y = "Highway MPG") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(mpg_manufacturer)
## # A tibble: 15 × 2
## manufacturer median_hwy
## <chr> <dbl>
## 1 honda 32
## 2 volkswagen 29
## 3 hyundai 26.5
## 4 audi 26
## 5 nissan 26
## 6 pontiac 26
## 7 subaru 26
## 8 toyota 26
## 9 chevrolet 23
## 10 jeep 18.5
## 11 ford 18
## 12 mercury 18
## 13 dodge 17
## 14 lincoln 17
## 15 land rover 16.5
# (b) Boxplot of highway mpg (hwy) against car class ordered by median fuel efficiency
mpg_class <- mpg %>%
group_by(class) %>%
summarise(median_hwy = median(hwy, na.rm = TRUE)) %>%
arrange(desc(median_hwy))
ggplot(mpg, aes(x = reorder(class, -hwy, FUN = median), y = hwy)) +
geom_boxplot() +
labs(title = "Highway Fuel Efficiency by Car Class", x = "Car Class", y = "Highway MPG") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(mpg_class)
## # A tibble: 7 × 2
## class median_hwy
## <chr> <dbl>
## 1 compact 27
## 2 midsize 27
## 3 subcompact 26
## 4 2seater 25
## 5 minivan 23
## 6 suv 17.5
## 7 pickup 17
# (c) Bar chart of manufacturers and the number of different types of cars produced
mpg_manufacturer_class <- mpg %>%
group_by(manufacturer, class) %>%
summarise(count = n(), .groups = 'drop')
ggplot(mpg_manufacturer_class, aes(x = manufacturer, fill = class)) +
geom_bar(stat = "count", position = "dodge") +
labs(title = "Number of Different Car Types by Manufacturer", x = "Manufacturer", y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Exercise 2. The diamonds dataset within R’s ggplot2 contains 10 columns (price, carat, cut, color, clarity, length(x), width(y), depth(z), depth percentage, top width) for 53940 different diamonds. Using this dataset, carry out the following tasks.
(a). Write code to plot histograms for carat and price. Plot these
graphs and comment on their shapes.
[6 marks]
(b). Write code to plot bar charts of cut proportioned in terms of color and again bar charts of cuts proportioned in terms of clarity. Comment on how proportions of diamonds change in terms of clarity and colour under different cut categories. [6 marks]
(c). Write code to display an appropriate graph that facilitates the investigation of a three-way relationship between cut, carat and price. Plot the graph. What inferences can you draw regarding the three way relationship? [8 marks]
#Code for the above exercise.
#######################
data(diamonds)
# (a) Histograms for carat and price
ggplot(diamonds, aes(x = carat)) +
geom_histogram(binwidth = 0.1, fill = "blue", color = "black") +
labs(title = "Distribution of Carat", x = "Carat", y = "Count")
ggplot(diamonds, aes(x = price)) +
geom_histogram(binwidth = 500, fill = "red", color = "black") +
labs(title = "Distribution of Price", x = "Price", y = "Count")
# (b) Bar charts of cut proportioned in terms of color and clarity
ggplot(diamonds, aes(x = cut, fill = color)) +
geom_bar(position = "fill") +
labs(title = "Proportion of Cut by Color", x = "Cut", y = "Proportion")
ggplot(diamonds, aes(x = cut, fill = clarity)) +
geom_bar(position = "fill") +
labs(title = "Proportion of Cut by Clarity", x = "Cut", y = "Proportion")
# (c) Three-way relationship between cut, carat, and price
ggplot(diamonds, aes(x = carat, y = price, color = cut)) +
geom_point(alpha = 0.3) +
labs(title = "Carat vs Price by Cut", x = "Carat", y = "Price") +
theme_minimal()
Exercise 3. Before deciding about selecting a particular machine learning technique for a data science problem, it is important to study the data distribution particularly through visualization. However, visualizing a multivariate data with two or more variables is difficult in a two dimensional plot. In this exercise, you are required to study the R’s iris dataset which is a multivariate data consisting of four features or properties (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) characterizing three species of iris flower (setosa, versicolor, and virginica). The principal component analysis (PCA) is a technique that can help facilitate visualization of a multivariate data distribution. The first two principal components (PC1 and PC2) obtained after applying PCA, can explain the majority of variations in the data. In order to study the data variability in iris data-set, perform the following tasks.
(a). Write code to obtain PC scores.
[6 marks]
(b). Write code to obtain a scatter plot representing PC1 vs. PC2,
wherein data clusters corresponding to three flower types are clearly
marked using possibly an ellipsoid. Run the codes to make the scatter
plot, mark flowers and comment on the feature distribution.
[10 marks]
(c). Write code to obtain correlation heatmap between PC scores and comment on the appropriateness of the map. [4 marks]
#Code for the above exercise.
#######################
data(iris)
# (a) Perform PCA on the numerical features
iris_pca <- prcomp(iris[,1:4], center = TRUE, scale. = TRUE)
print(summary(iris_pca))
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
# (b) Scatter plot of PC1 vs. PC2 with ellipsoids for species
library(ggplot2)
ggplot(as.data.frame(iris_pca$x), aes(x = PC1, y = PC2, color = as.factor(iris$Species))) +
stat_ellipse(geom = "polygon", alpha = 0.2, aes(fill = as.factor(iris$Species))) +
geom_point(size = 3) +
labs(title = "PCA of Iris Dataset", x = "PC1", y = "PC2", color = "Species", fill = "Species") +
theme_minimal()
# (c) Correlation heatmap between PCA scores
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.3
## corrplot 0.95 loaded
iris_pca_scores <- as.data.frame(iris_pca$x)
corr_matrix <- cor(iris_pca_scores)
corrplot(corr_matrix, method = "color", addCoef.col = "black", tl.col = "black")
Exercise 4. In this task, you are required to analyze the Animals dataset from the MASS package.This dataset contains brain weight (in grams) and body weight (in kilograms) for 28 different animal species.The three largest animals are dinosaurs, whose measurements are obviously the result of scientific modeling rather than precise measurements.
A scatter plot given below fails to describe any obvious relationship between brain weight and body weight variables. You are required to apply appropriate power transformations to the variables to obtain more interpretable plot and describe the obtained relationship. To this end, undertake the following tasks.
library(ggplot2)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(e1071)
## Warning: package 'e1071' was built under R version 4.4.3
data(Animals)
qplot(brain, body, data = Animals)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Task-1. Check whether each of the variables has normal distribution. Your response should be based on appropriate confirmatory statistical tests as well as smoothed histogram plots. [10 marks]
#Code for the above task.
#########################
library(ggplot2)
library(MASS)
library(e1071)
data(Animals)
par(mfrow = c(1, 2))
hist(Animals$brain, probability = TRUE, main = "Brain Weight Distribution", col = "lightblue", xlab = "Brain Weight")
lines(density(Animals$brain), col = "red", lwd = 2)
hist(Animals$body, probability = TRUE, main = "Body Weight Distribution", col = "lightblue", xlab = "Body Weight")
lines(density(Animals$body), col = "red", lwd = 2)
shapiro.test(Animals$brain)
##
## Shapiro-Wilk normality test
##
## data: Animals$brain
## W = 0.45173, p-value = 3.763e-09
shapiro.test(Animals$body)
##
## Shapiro-Wilk normality test
##
## data: Animals$body
## W = 0.27831, p-value = 1.115e-10
Task-2. A power transformation of a variable X consists of raising X
to the power lambda. Using an appropriate statistical test and/or plot,
find best lambda values needed for transforming each of the variables
requiring power transformation.
[10 marks]
#Code for the above task.
#########################
library(car)
## Warning: package 'car' was built under R version 4.4.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.4.3
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
lambda_brain <- boxCox(lm(brain ~ 1, data = Animals), lambda = seq(-2, 2, by = 0.1))
lambda_body <- boxCox(lm(body ~ 1, data = Animals), lambda = seq(-2, 2, by = 0.1))
lambda_brain_value <- lambda_brain$x[which.max(lambda_brain$y)]
lambda_body_value <- lambda_body$x[which.max(lambda_body$y)]
cat("Optimal lambda for Brain Weight:", lambda_brain_value, "\n")
## Optimal lambda for Brain Weight: 0.1010101
cat("Optimal lambda for Body Weight:", lambda_body_value, "\n")
## Optimal lambda for Body Weight: 0.02020202
Task-3. Apply power transformation and verify whether transformed variables have a normal distribution through confirmatory statistical tests as well as smoothed histogram plots. [10 marks]
#Code for the above task.
#########################
library(car)
Animals$brain_transformed <- Animals$brain^lambda_brain_value
Animals$body_transformed <- Animals$body^lambda_body_value
par(mfrow = c(1, 2))
hist(Animals$brain_transformed, probability = TRUE, main = "Transformed Brain Weight", col = "lightgreen", xlab = "Transformed Brain")
lines(density(Animals$brain_transformed), col = "red", lwd = 2)
hist(Animals$body_transformed, probability = TRUE, main = "Transformed Body Weight", col = "lightgreen", xlab = "Transformed Body")
lines(density(Animals$body_transformed), col = "red", lwd = 2)
shapiro.test(Animals$brain_transformed)
##
## Shapiro-Wilk normality test
##
## data: Animals$brain_transformed
## W = 0.97199, p-value = 0.6349
shapiro.test(Animals$body_transformed)
##
## Shapiro-Wilk normality test
##
## data: Animals$body_transformed
## W = 0.98441, p-value = 0.9396
Task-4. Create a scatter plot of the transformed data. Based on the visual inspection of the plot, provide your interpretation of the relationship between brain weight and body weight variables. You may like to add an appropriate smoothed line curve to your plot to help in interpretation. [10 marks]
#Code for the above task.
#########################
library(ggplot2)
transformed_plot <- ggplot(Animals, aes(x = brain_transformed, y = body_transformed)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
ggtitle("Transformed Brain vs. Body Weight") +
xlab("Transformed Brain Weight") +
ylab("Transformed Body Weight") +
theme_minimal()
print(transformed_plot)
## `geom_smooth()` using formula = 'y ~ x'