UNIVERSITY OF ULSTER
FACULTY OF COMPUTING, ENGINEERING AND BUILT ENVIRONMENT
COURSEWORK SUBMISSION SHEET
This sheet must be completed in full and attached to the front of each item of assessment before submission to Module Coordinator/Instructor, Professor Girijesh Prasad, via Blackboard Learn.
Student’s Name. OLATUNDE AYOOLA IBRAHIM
Registration No. B00869150
Course Title.…MSc Data Science………………………………………….
Module Code/Title…COM736: Data Validation and Visualisation………
Instructor …Professor Girijesh Prasad……………………………………..
Date Due …………… Friday 25th March 2022…………………………….
Submitted work is subject to the following assessment policies:
Coursework must be submitted by the specified date.
Students may seek prior consent from the Course Director to submit coursework after the official deadline; such requests must be accompanied by a satisfactory explanation, and in the case of illness by a medical certificate.
Coursework submitted without consent after the deadline will not normally be accepted and will therefore receive a mark of zero.
I declare that this is all my own work and that any material I have referred to has been accurately referenced. I have read the University’s policy on plagiarism and understand the definition of plagiarism. If it is shown that material has been plagiarised, or I have otherwise attempted to obtain an unfair advantage for myself or others, I understand that I may face sanctions in accordance with the policies and procedures of the University. A mark of zero may be awarded and the reason for that mark will be recorded on my file.
COURSEWORK SUBMISSION SHEET
Student’s Name:-OLATUNDE, AYOOLA IBRAHIM
Registration No:- B00869150
Course Title: - Data Validation and Visualisation COM736(17946)
#Declaring all necessary libraries
library(ggplot2)
library(gridExtra)
library(ggplot2)
library(MASS)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::combine() masks gridExtra::combine()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x dplyr::select() masks MASS::select()
library(e1071)
| COM736: Data Validation and Visualisation |
| CRN: 17946 |
| Coursework 1 |
This item of coursework will contribute to 50% of the overall module marks. The solutions of all the following exercises need to be submitted into the module assessment area of the Blackboard, as a lab-based assignment, by the end of the day on Friday in the ninth week (i.e. 25th March’2022), contributing to your portfolio of evidence relating to Data Validation and Visualization exercises. You may like to use this file to present your functioning code along with program outputs through this R Markdown document (http://rmarkdown.rstudio.com). ———————————————————————————————————————
Exercise 1. The dataset mpg is part of the R datasets package. It contains a subset of the fuel economy data that the Environment Protection Agency (EPA) makes available on https://fueleconomy.gov/. It contains only car models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car. It is a dataframe with 234 rows and 11 variables: manufacturer name (manufacturer), model name (model), engine displacement (displ), year of manufacture (year), number of cylinders(cyl), type of transmission (trans), type of drive train(drv), city miles per gallon(cty), highway miles per gallon(hwy), fuel type (fl) and type of car (class). Applying an appropriate R data visualisation method on the mpg data, perform the following tasks.
(a). Write code that displays a graph which plots in the order of
decreasing medians of the vehicle’s miles-per-gallon on highway (hwy)
against their manufacturers. Plot the graph and list the manufacturers
in the order of fuel efficiency of their vehicles. Using the graph, find
out which companies produce the most and the least fuel efficient
vehicles.
[8 marks]
# Code for the above exercise
#(a)
data(mpg)
str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
#fetch median in descending order by introducing a nagative hwy
fetchMedianManufacturer = with(mpg, reorder(manufacturer, -hwy, median))
fetchMedianManufacturer
## [1] audi audi audi audi audi audi
## [7] audi audi audi audi audi audi
## [13] audi audi audi audi audi audi
## [19] chevrolet chevrolet chevrolet chevrolet chevrolet chevrolet
## [25] chevrolet chevrolet chevrolet chevrolet chevrolet chevrolet
## [31] chevrolet chevrolet chevrolet chevrolet chevrolet chevrolet
## [37] chevrolet dodge dodge dodge dodge dodge
## [43] dodge dodge dodge dodge dodge dodge
## [49] dodge dodge dodge dodge dodge dodge
## [55] dodge dodge dodge dodge dodge dodge
## [61] dodge dodge dodge dodge dodge dodge
## [67] dodge dodge dodge dodge dodge dodge
## [73] dodge dodge ford ford ford ford
## [79] ford ford ford ford ford ford
## [85] ford ford ford ford ford ford
## [91] ford ford ford ford ford ford
## [97] ford ford ford honda honda honda
## [103] honda honda honda honda honda honda
## [109] hyundai hyundai hyundai hyundai hyundai hyundai
## [115] hyundai hyundai hyundai hyundai hyundai hyundai
## [121] hyundai hyundai jeep jeep jeep jeep
## [127] jeep jeep jeep jeep land rover land rover
## [133] land rover land rover lincoln lincoln lincoln mercury
## [139] mercury mercury mercury nissan nissan nissan
## [145] nissan nissan nissan nissan nissan nissan
## [151] nissan nissan nissan nissan pontiac pontiac
## [157] pontiac pontiac pontiac subaru subaru subaru
## [163] subaru subaru subaru subaru subaru subaru
## [169] subaru subaru subaru subaru subaru toyota
## [175] toyota toyota toyota toyota toyota toyota
## [181] toyota toyota toyota toyota toyota toyota
## [187] toyota toyota toyota toyota toyota toyota
## [193] toyota toyota toyota toyota toyota toyota
## [199] toyota toyota toyota toyota toyota toyota
## [205] toyota toyota toyota volkswagen volkswagen volkswagen
## [211] volkswagen volkswagen volkswagen volkswagen volkswagen volkswagen
## [217] volkswagen volkswagen volkswagen volkswagen volkswagen volkswagen
## [223] volkswagen volkswagen volkswagen volkswagen volkswagen volkswagen
## [229] volkswagen volkswagen volkswagen volkswagen volkswagen volkswagen
## attr(,"scores")
## audi chevrolet dodge ford honda hyundai jeep
## -26.0 -23.0 -17.0 -18.0 -32.0 -26.5 -18.5
## land rover lincoln mercury nissan pontiac subaru toyota
## -16.5 -17.0 -18.0 -26.0 -26.0 -26.0 -26.0
## volkswagen
## -29.0
## 15 Levels: honda volkswagen hyundai audi nissan pontiac subaru ... land rover
#using boxplot for visualisation
boxplot(hwy ~ fetchMedianManufacturer, data = mpg,
xlab = "Manufacturers", ylab = "Miles/Gallon on Highway",
main = "hwy vs manufacturers in order of decreasing median ", varwidth = TRUE,
col = "green")
# the trend to most efficiency rank from Honda to land rover, while least range from Land rover to Honda.The list is given below.
# honda volkswagen hyundai audi nissan pontiac subaru toyota chevrolet jeep ... land rover
(b). Write code that displays a graph which plots in the order of
decreasing medians of the vehicle’s miles-per-gallon on highway (hwy)
against the type of car (class). Plot the graph and list the classes of
vehicle in the order of their fuel efficiency.
[8 marks]
#(b)
#fetch median in descending order by introducing a nagative hwy
fetchMedianClass = with(mpg, reorder(class, -hwy, median))
fetchMedianClass
## [1] compact compact compact compact compact compact
## [7] compact compact compact compact compact compact
## [13] compact compact compact midsize midsize midsize
## [19] suv suv suv suv suv 2seater
## [25] 2seater 2seater 2seater 2seater suv suv
## [31] suv suv midsize midsize midsize midsize
## [37] midsize minivan minivan minivan minivan minivan
## [43] minivan minivan minivan minivan minivan minivan
## [49] pickup pickup pickup pickup pickup pickup
## [55] pickup pickup pickup suv suv suv
## [61] suv suv suv suv pickup pickup
## [67] pickup pickup pickup pickup pickup pickup
## [73] pickup pickup suv suv suv suv
## [79] suv suv suv suv suv pickup
## [85] pickup pickup pickup pickup pickup pickup
## [91] subcompact subcompact subcompact subcompact subcompact subcompact
## [97] subcompact subcompact subcompact subcompact subcompact subcompact
## [103] subcompact subcompact subcompact subcompact subcompact subcompact
## [109] midsize midsize midsize midsize midsize midsize
## [115] midsize subcompact subcompact subcompact subcompact subcompact
## [121] subcompact subcompact suv suv suv suv
## [127] suv suv suv suv suv suv
## [133] suv suv suv suv suv suv
## [139] suv suv suv compact compact midsize
## [145] midsize midsize midsize midsize midsize midsize
## [151] suv suv suv suv midsize midsize
## [157] midsize midsize midsize suv suv suv
## [163] suv suv suv subcompact subcompact subcompact
## [169] subcompact compact compact compact compact suv
## [175] suv suv suv suv suv midsize
## [181] midsize midsize midsize midsize midsize midsize
## [187] compact compact compact compact compact compact
## [193] compact compact compact compact compact compact
## [199] suv suv pickup pickup pickup pickup
## [205] pickup pickup pickup compact compact compact
## [211] compact compact compact compact compact compact
## [217] compact compact compact compact compact subcompact
## [223] subcompact subcompact subcompact subcompact subcompact midsize
## [229] midsize midsize midsize midsize midsize midsize
## attr(,"scores")
## 2seater compact midsize minivan pickup subcompact suv
## -25.0 -27.0 -27.0 -23.0 -17.0 -26.0 -17.5
## Levels: compact midsize subcompact 2seater minivan suv pickup
#using boxplot for visualisation
boxplot(hwy ~ fetchMedianClass, data = mpg,
xlab = "Class", ylab = "Miles/Gallon on Highway",
main = "hwy vs class in order of decreasing median ", varwidth = TRUE,
col = "blue")
# the trend to most efficiency rank from compact to pickup, while least range from pickup to compact. The list is given below.
# compact midsize subcompact 2seater minivan suv pickup
(c). Draw a bar chart of manufacturers in terms of numbers of
different types of cars manufactured. Based on this, comment on classes
of vehicles manufactured by the companies producing the most and the
least fuel efficient vehicles and possible reason(s) for highest/lowest
fuel efficiency.
[4 marks]
# Code for the above exercise
#(a)
ggplot(mpg, aes(x= manufacturer, fill = class)) + geom_bar() + ggtitle("Bar chart of manufacturers based on product class")
# From the chart, we could conclude that SUVs consume lots of fuel. The least fuel economical manufacturer (Land rover) produces more/only of SUVs.
#SUBCOMPACT, followed by COMPACT class of vehicles are those vehicles that are fuel economical, and are manufactured (the most) by the companies that produces most economical vehicles. Check Honda for example.
Exercise 2. The diamonds dataset within R’s ggplot2 contains 10 columns (price, carat, cut, color, clarity, length(x), width(y), depth(z), depth percentage, top width) for 53940 different diamonds. Using this dataset, carry out the following tasks.
(a). Write code to plot histograms for carat and price. Plot these
graphs and comment on their shapes.
[6 marks]
#Code for the above exercise.
data("diamonds")
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
caratPlot=ggplot(diamonds, aes(x=carat)) + geom_histogram()
pricePlot=ggplot(diamonds, aes(x=price)) + geom_histogram()
grid.arrange(caratPlot, pricePlot, ncol=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#based on there shapes (graph), it is obvious they both skew to the left (left-skewness)
(b). Write code to plot bar charts of cut proportioned in terms of color and again bar charts of cuts proportioned in terms of clarity. Comment on how proportions of diamonds change in terms of clarity and colour under different cut categories. [6 marks]
cutFillColor = ggplot(diamonds, aes(x= cut, fill = color)) + geom_bar() + ggtitle("Barchart of cut proportioned in terms of color")
cutFillClarity = ggplot(diamonds, aes(x= cut, fill = clarity)) +geom_bar() + ggtitle("Barchart of cut proportioned in terms of clarity")
grid.arrange(cutFillColor,cutFillClarity, ncol=2)
#For Fair Cut category;Almost all the colors in evenly available, whereas there are rare if not NO IF clarity
#For Good Cut category,the chart shows that color E and F are slightly much than the rest, whereas I1,SI2, and SI1 clarity dominate the category
#For Very Good Cut category, color J suffers compare to the remaining colors, also, IF clarity is rare.
#For Premium Cut category,IF Clarity still shows the rarest, while all colors are sufficently available.
#For Ideal Cut category, we can almost say the number of Color D is equal to that of I1 clarity.
(c). Write code to display an appropriate graph that facilitates the investigation of a three-way relationship between cut, carat and price. Plot the graph. What inferences can you draw regarding the three way relationship? [8 marks]
qplot(x = carat, y = price, data = diamonds, color = cut)
#The main inference is that all of cut categories can form a linear distribution - from Price and Carat relationship. They all shows trend to the upper right corner, which means, the higher the Carat, the higher the price.
Exercise 3. Before deciding about selecting a particular machine learning technique for a data science problem, it is important to study the data distribution particularly through visualization. However, visualizing a multivariate data with two or more variables is difficult in a two dimensional plot. In this exercise, you are required to study the R’s iris dataset which is a multivariate data consisting of four features or properties (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) characterizing three species of iris flower (setosa, versicolor, and virginica). The principal component analysis (PCA) is a technique that can help facilitate visualization of a multivariate data distribution. The first two principal components (PC1 and PC2) obtained after applying PCA, can explain the majority of variations in the data. In order to study the data variability in iris data-set, perform the following tasks.
(a). Write code to obtain PC scores.
[6 marks]
#Code for the above exercise.
data("iris")
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
pcaIris = prcomp(iris[,-5], scale = T)
names(pcaIris)
## [1] "sdev" "rotation" "center" "scale" "x"
head(pcaIris$x)
## PC1 PC2 PC3 PC4
## [1,] -2.257141 -0.4784238 0.12727962 0.024087508
## [2,] -2.074013 0.6718827 0.23382552 0.102662845
## [3,] -2.356335 0.3407664 -0.04405390 0.028282305
## [4,] -2.291707 0.5953999 -0.09098530 -0.065735340
## [5,] -2.381863 -0.6446757 -0.01568565 -0.035802870
## [6,] -2.068701 -1.4842053 -0.02687825 0.006586116
(b). Write code to obtain a scatter plot representing PC1 vs. PC2,
wherein data clusters corresponding to three flower types are clearly
marked using possibly an ellipsoid. Also, write code to obtain
correlation heatmap between PC scores and comment on the appropriateness
of the map.
[10 marks]
PC <- data.frame(Species = iris$Species, pcaIris$x[,1], pcaIris$x[,2])
names(PC) <- c("Species", "PC1", "PC2")
names(PC)
## [1] "Species" "PC1" "PC2"
ggplot(PC, aes(x = PC1, y = PC2, color = Species )) +
geom_point() +
stat_ellipse() +
xlab("First Principal Component PC1") +
ylab("Second Principal Component PC2") +
ggtitle("The first Two Principal Components of R's iris dataset")
#The map shows Cluster relationships between the species in term of First and Second Principal Component - a distinct representation of setosa specie and little intercept relationship of versicolor and virginica species.
(c). Run the codes to make the scatter plot, mark flowers using
ellipsoids and comment on the feature distribution.
[4 marks]
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species )) +
geom_point() +
stat_ellipse() +
xlab("Length of Sepal") +
ylab("Length of Petal") +
ggtitle("Sepal Length vs Sepa Length")
#The map shows Cluster relationships between the species in term of Length of Sepal and Petal - a distinct representation of setosa specie and little intercept relationship of versicolor and virginica species.
Exercise 4. In this task, you are required to analyze the Animals dataset from the MASS package.This dataset contains brain weight (in grams) and body weight (in kilograms) for 28 different animal species.The three largest animals are dinosaurs, whose measurements are obviously the result of scientific modeling rather than precise measurements.
A scatter plot given below fails to describe any obvious relationship between brain weight and body weight variables. You are required to apply appropriate power transformations to the variables to obtain more interpretable plot and describe the obtained relationship. To this end, undertake the following tasks.
data("Animals")
str(Animals)
## 'data.frame': 28 obs. of 2 variables:
## $ body : num 1.35 465 36.33 27.66 1.04 ...
## $ brain: num 8.1 423 119.5 115 5.5 ...
qplot(brain, body, data = Animals)
Task-1. Check whether each of the variables has normal distribution. Your response should be based on an appropriate statistical test as well as smoothed histogram plots. [10 marks]
#Code for the above exercise.
# Shapiro-Wilks test will be used to test for normalty
shapiro.test(Animals$body)
##
## Shapiro-Wilk normality test
##
## data: Animals$body
## W = 0.27831, p-value = 1.115e-10
#W = 0.27831, p-value = 1.115e-10, This shows the body variable does not have normal distribution.Since the p-value is less than 0.05
shapiro.test(Animals$brain)
##
## Shapiro-Wilk normality test
##
## data: Animals$brain
## W = 0.45173, p-value = 3.763e-09
# W = 0.45173, p-value = 3.763e-09, This shows the brain variable does not have normal distribution. Since the p-value is less than 0.05
hist(Animals$body, freq = NULL)
hist(Animals$brain, freq = NULL)
Task-2. A power transformation of a variable X consists of raising X
to the power lambda. Using an appropriate statistical test and/or plot,
find best lambda values needed for transforming each of the variables
requiring power transformation.
[10 marks]
# To find an appropriate lambda value for Brain column
lbdVal = boxcox(Animals$brain~1,plotit=TRUE, lambda=seq(-1,1,0.01))
title(main="Box-Cox Plot for Brain")
# To find the optimal lambda value.
lbdValOpt = lbdVal$x [lbdVal$y==max(lbdVal$y)]
lbdValOpt
## [1] 0.08
#THE OPTIMAL LAMBDA VALUE IS 0.08 For Brain
lbdVal2 = boxcox(Animals$body~1,plotit=TRUE, lambda=seq(-1.5,1.5,0.01))
title(main="Box-Cox Plot for Body ")
# Find The Optimal lambda value.
lbdVal2Opt=lbdVal2$x[lbdVal2$y==max(lbdVal2$y)]
lbdVal2Opt
## [1] 0.01
#THE OPTIMAL LAMBDA VALUE IS 0.01 For Body
Task-3. Apply power transformation and verify whether transformed variables have a normal distribution through statistical test as well as smoothed histogram plots. [10 marks]
#Power transformation for body
Animals$body=Animals$body^0.01
shapiro.test(Animals$body)
##
## Shapiro-Wilk normality test
##
## data: Animals$body
## W = 0.98501, p-value = 0.9486
# W = 0.98501 and p-value = 0.9486, since p-value is above 0.05, the distribution is now Normal
#Power transformation for brain
Animals$brain = Animals$brain^0.08
shapiro.test(Animals$brain)
##
## Shapiro-Wilk normality test
##
## data: Animals$brain
## W = 0.97237, p-value = 0.6454
# W = 0.97237 and p-value = 0.6454, since p-value is above 0.05, the distribution is now Normal
hist(Animals$body, freq = NULL)
lines(density(Animals$body))
hist(Animals$brain, freq = NULL)
lines(density(Animals$brain))
Task-4. Create a scatter plot of the transformed data. Based on the visual inspection of the plot, provide your interpretation of the relationship between brain weight and body weight variables. You may like to add an appropriate smoothed line curve to your plot to help in interpretation. [10 marks]
ggplot(Animals, aes(x=body, y=brain)) + geom_point() + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'
#From the diagram, we can see most of the points scattered along the linear line. Hence, we can say the relationship is linear.