OLATUNDE-AYOOLA-IBRAHIM.knit

UNIVERSITY OF ULSTER

FACULTY OF COMPUTING, ENGINEERING AND BUILT ENVIRONMENT

COURSEWORK SUBMISSION SHEET

This sheet must be completed in full and attached to the front of each item of assessment before submission to Module Coordinator/Instructor, Professor Girijesh Prasad, via Blackboard Learn.

Student’s Name. OLATUNDE AYOOLA IBRAHIM

Registration No. B00869150

Course Title.…MSc Data Science………………………………………….

Module Code/Title…COM736: Data Validation and Visualisation………

Instructor …Professor Girijesh Prasad……………………………………..

Date Due …………… Friday 25th March 2022…………………………….

Submitted work is subject to the following assessment policies:

Coursework must be submitted by the specified date.
Students may seek prior consent from the Course Director to submit coursework after the official deadline; such requests must be accompanied by a satisfactory explanation, and in the case of illness by a medical certificate.
Coursework submitted without consent after the deadline will not normally be accepted and will therefore receive a mark of zero.

I declare that this is all my own work and that any material I have referred to has been accurately referenced. I have read the University’s policy on plagiarism and understand the definition of plagiarism. If it is shown that material has been plagiarised, or I have otherwise attempted to obtain an unfair advantage for myself or others, I understand that I may face sanctions in accordance with the policies and procedures of the University. A mark of zero may be awarded and the reason for that mark will be recorded on my file.

COURSEWORK SUBMISSION SHEET

Student’s Name:-OLATUNDE, AYOOLA IBRAHIM

Registration No:- B00869150

Course Title: - Data Validation and Visualisation COM736(17946)

#Declaring all necessary libraries
library(ggplot2)
library(gridExtra)
library(ggplot2)
library(MASS)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ✓ purrr   0.3.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::combine() masks gridExtra::combine()
## x dplyr::filter()  masks stats::filter()
## x dplyr::lag()     masks stats::lag()
## x dplyr::select()  masks MASS::select()

library(e1071)

COM736: Data Validation and Visualisation

CRN: 17946

Coursework 1

This item of coursework will contribute to 50% of the overall module marks. The solutions of all the following exercises need to be submitted into the module assessment area of the Blackboard, as a lab-based assignment, by the end of the day on Friday in the ninth week (i.e. 25th March’2022), contributing to your portfolio of evidence relating to Data Validation and Visualization exercises. You may like to use this file to present your functioning code along with program outputs through this R Markdown document (http://rmarkdown.rstudio.com). ———————————————————————————————————————

Exercise 1. The dataset mpg is part of the R datasets package. It contains a subset of the fuel economy data that the Environment Protection Agency (EPA) makes available on https://fueleconomy.gov/. It contains only car models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car. It is a dataframe with 234 rows and 11 variables: manufacturer name (manufacturer), model name (model), engine displacement (displ), year of manufacture (year), number of cylinders(cyl), type of transmission (trans), type of drive train(drv), city miles per gallon(cty), highway miles per gallon(hwy), fuel type (fl) and type of car (class). Applying an appropriate R data visualisation method on the mpg data, perform the following tasks.

(a). Write code that displays a graph which plots in the order of decreasing medians of the vehicle’s miles-per-gallon on highway (hwy) against their manufacturers. Plot the graph and list the manufacturers in the order of fuel efficiency of their vehicles. Using the graph, find out which companies produce the most and the least fuel efficient vehicles.
[8 marks]

# Code for the above exercise
#(a)
data(mpg)
str(mpg)

## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

summary(mpg)

##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

#fetch median in descending order by introducing a nagative hwy
fetchMedianManufacturer = with(mpg, reorder(manufacturer, -hwy, median))
fetchMedianManufacturer

##   [1] audi       audi       audi       audi       audi       audi      
##   [7] audi       audi       audi       audi       audi       audi      
##  [13] audi       audi       audi       audi       audi       audi      
##  [19] chevrolet  chevrolet  chevrolet  chevrolet  chevrolet  chevrolet 
##  [25] chevrolet  chevrolet  chevrolet  chevrolet  chevrolet  chevrolet 
##  [31] chevrolet  chevrolet  chevrolet  chevrolet  chevrolet  chevrolet 
##  [37] chevrolet  dodge      dodge      dodge      dodge      dodge     
##  [43] dodge      dodge      dodge      dodge      dodge      dodge     
##  [49] dodge      dodge      dodge      dodge      dodge      dodge     
##  [55] dodge      dodge      dodge      dodge      dodge      dodge     
##  [61] dodge      dodge      dodge      dodge      dodge      dodge     
##  [67] dodge      dodge      dodge      dodge      dodge      dodge     
##  [73] dodge      dodge      ford       ford       ford       ford      
##  [79] ford       ford       ford       ford       ford       ford      
##  [85] ford       ford       ford       ford       ford       ford      
##  [91] ford       ford       ford       ford       ford       ford      
##  [97] ford       ford       ford       honda      honda      honda     
## [103] honda      honda      honda      honda      honda      honda     
## [109] hyundai    hyundai    hyundai    hyundai    hyundai    hyundai   
## [115] hyundai    hyundai    hyundai    hyundai    hyundai    hyundai   
## [121] hyundai    hyundai    jeep       jeep       jeep       jeep      
## [127] jeep       jeep       jeep       jeep       land rover land rover
## [133] land rover land rover lincoln    lincoln    lincoln    mercury   
## [139] mercury    mercury    mercury    nissan     nissan     nissan    
## [145] nissan     nissan     nissan     nissan     nissan     nissan    
## [151] nissan     nissan     nissan     nissan     pontiac    pontiac   
## [157] pontiac    pontiac    pontiac    subaru     subaru     subaru    
## [163] subaru     subaru     subaru     subaru     subaru     subaru    
## [169] subaru     subaru     subaru     subaru     subaru     toyota    
## [175] toyota     toyota     toyota     toyota     toyota     toyota    
## [181] toyota     toyota     toyota     toyota     toyota     toyota    
## [187] toyota     toyota     toyota     toyota     toyota     toyota    
## [193] toyota     toyota     toyota     toyota     toyota     toyota    
## [199] toyota     toyota     toyota     toyota     toyota     toyota    
## [205] toyota     toyota     toyota     volkswagen volkswagen volkswagen
## [211] volkswagen volkswagen volkswagen volkswagen volkswagen volkswagen
## [217] volkswagen volkswagen volkswagen volkswagen volkswagen volkswagen
## [223] volkswagen volkswagen volkswagen volkswagen volkswagen volkswagen
## [229] volkswagen volkswagen volkswagen volkswagen volkswagen volkswagen
## attr(,"scores")
##       audi  chevrolet      dodge       ford      honda    hyundai       jeep 
##      -26.0      -23.0      -17.0      -18.0      -32.0      -26.5      -18.5 
## land rover    lincoln    mercury     nissan    pontiac     subaru     toyota 
##      -16.5      -17.0      -18.0      -26.0      -26.0      -26.0      -26.0 
## volkswagen 
##      -29.0 
## 15 Levels: honda volkswagen hyundai audi nissan pontiac subaru ... land rover

#using boxplot for visualisation
boxplot(hwy ~ fetchMedianManufacturer, data = mpg,
        xlab = "Manufacturers", ylab = "Miles/Gallon on Highway",
        main = "hwy vs manufacturers in order of decreasing median ", varwidth = TRUE,
        col = "green")

# the trend to most efficiency rank from Honda to land rover, while least range from Land rover to Honda.The list is given below.

# honda volkswagen hyundai audi nissan pontiac subaru toyota chevrolet jeep ... land rover

(b). Write code that displays a graph which plots in the order of decreasing medians of the vehicle’s miles-per-gallon on highway (hwy) against the type of car (class). Plot the graph and list the classes of vehicle in the order of their fuel efficiency.
[8 marks]

#(b)
#fetch median in descending order by introducing a nagative hwy
fetchMedianClass = with(mpg, reorder(class, -hwy, median))
fetchMedianClass

##   [1] compact    compact    compact    compact    compact    compact   
##   [7] compact    compact    compact    compact    compact    compact   
##  [13] compact    compact    compact    midsize    midsize    midsize   
##  [19] suv        suv        suv        suv        suv        2seater   
##  [25] 2seater    2seater    2seater    2seater    suv        suv       
##  [31] suv        suv        midsize    midsize    midsize    midsize   
##  [37] midsize    minivan    minivan    minivan    minivan    minivan   
##  [43] minivan    minivan    minivan    minivan    minivan    minivan   
##  [49] pickup     pickup     pickup     pickup     pickup     pickup    
##  [55] pickup     pickup     pickup     suv        suv        suv       
##  [61] suv        suv        suv        suv        pickup     pickup    
##  [67] pickup     pickup     pickup     pickup     pickup     pickup    
##  [73] pickup     pickup     suv        suv        suv        suv       
##  [79] suv        suv        suv        suv        suv        pickup    
##  [85] pickup     pickup     pickup     pickup     pickup     pickup    
##  [91] subcompact subcompact subcompact subcompact subcompact subcompact
##  [97] subcompact subcompact subcompact subcompact subcompact subcompact
## [103] subcompact subcompact subcompact subcompact subcompact subcompact
## [109] midsize    midsize    midsize    midsize    midsize    midsize   
## [115] midsize    subcompact subcompact subcompact subcompact subcompact
## [121] subcompact subcompact suv        suv        suv        suv       
## [127] suv        suv        suv        suv        suv        suv       
## [133] suv        suv        suv        suv        suv        suv       
## [139] suv        suv        suv        compact    compact    midsize   
## [145] midsize    midsize    midsize    midsize    midsize    midsize   
## [151] suv        suv        suv        suv        midsize    midsize   
## [157] midsize    midsize    midsize    suv        suv        suv       
## [163] suv        suv        suv        subcompact subcompact subcompact
## [169] subcompact compact    compact    compact    compact    suv       
## [175] suv        suv        suv        suv        suv        midsize   
## [181] midsize    midsize    midsize    midsize    midsize    midsize   
## [187] compact    compact    compact    compact    compact    compact   
## [193] compact    compact    compact    compact    compact    compact   
## [199] suv        suv        pickup     pickup     pickup     pickup    
## [205] pickup     pickup     pickup     compact    compact    compact   
## [211] compact    compact    compact    compact    compact    compact   
## [217] compact    compact    compact    compact    compact    subcompact
## [223] subcompact subcompact subcompact subcompact subcompact midsize   
## [229] midsize    midsize    midsize    midsize    midsize    midsize   
## attr(,"scores")
##    2seater    compact    midsize    minivan     pickup subcompact        suv 
##      -25.0      -27.0      -27.0      -23.0      -17.0      -26.0      -17.5 
## Levels: compact midsize subcompact 2seater minivan suv pickup

#using boxplot for visualisation
boxplot(hwy ~ fetchMedianClass, data = mpg,
        xlab = "Class", ylab = "Miles/Gallon on Highway",
        main = "hwy vs class in order of decreasing median ", varwidth = TRUE,
        col = "blue")

# the trend to most efficiency rank from compact to pickup, while least range from pickup to compact. The list is given below.
# compact midsize subcompact 2seater minivan suv pickup

(c). Draw a bar chart of manufacturers in terms of numbers of different types of cars manufactured. Based on this, comment on classes of vehicles manufactured by the companies producing the most and the least fuel efficient vehicles and possible reason(s) for highest/lowest fuel efficiency.
[4 marks]

# Code for the above exercise
#(a)

ggplot(mpg, aes(x= manufacturer, fill = class)) + geom_bar() + ggtitle("Bar chart of manufacturers based on product class")

# From the chart, we could conclude that SUVs consume lots of fuel. The least fuel economical manufacturer (Land rover) produces more/only of SUVs.
#SUBCOMPACT, followed by COMPACT class of vehicles are those vehicles that are fuel economical, and are manufactured (the most) by the companies that produces most economical vehicles. Check Honda for example.

Exercise 2. The diamonds dataset within R’s ggplot2 contains 10 columns (price, carat, cut, color, clarity, length(x), width(y), depth(z), depth percentage, top width) for 53940 different diamonds. Using this dataset, carry out the following tasks.

(a). Write code to plot histograms for carat and price. Plot these graphs and comment on their shapes.
[6 marks]

#Code for the above exercise.
data("diamonds")
str(diamonds)

## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

summary(diamonds)

##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
##

caratPlot=ggplot(diamonds, aes(x=carat)) + geom_histogram()
pricePlot=ggplot(diamonds, aes(x=price)) + geom_histogram()
grid.arrange(caratPlot, pricePlot, ncol=2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#based on there shapes (graph), it is obvious they both skew to the left (left-skewness)

(b). Write code to plot bar charts of cut proportioned in terms of color and again bar charts of cuts proportioned in terms of clarity. Comment on how proportions of diamonds change in terms of clarity and colour under different cut categories. [6 marks]

cutFillColor = ggplot(diamonds, aes(x= cut, fill = color)) + geom_bar() + ggtitle("Barchart of cut proportioned in terms of color")

cutFillClarity = ggplot(diamonds, aes(x= cut, fill = clarity)) +geom_bar() + ggtitle("Barchart of cut proportioned in terms of clarity") 

grid.arrange(cutFillColor,cutFillClarity, ncol=2)

#For Fair Cut category;Almost all the colors in evenly available, whereas there are rare if not NO IF clarity 
#For Good Cut category,the chart shows that color E and F are slightly much than the rest, whereas  I1,SI2, and SI1 clarity dominate the category
#For Very Good Cut category, color J suffers compare to the remaining colors, also, IF clarity is rare.
#For Premium Cut category,IF Clarity still shows the rarest, while all colors are sufficently available.
#For Ideal Cut category, we can almost say the number of Color D is equal to that of I1 clarity.

(c). Write code to display an appropriate graph that facilitates the investigation of a three-way relationship between cut, carat and price. Plot the graph. What inferences can you draw regarding the three way relationship? [8 marks]

qplot(x = carat, y = price, data = diamonds, color = cut)

#The main inference is that all of cut categories can form a linear distribution - from Price and Carat relationship. They all shows trend to the upper right corner, which means, the higher the Carat, the higher the price.

Exercise 3. Before deciding about selecting a particular machine learning technique for a data science problem, it is important to study the data distribution particularly through visualization. However, visualizing a multivariate data with two or more variables is difficult in a two dimensional plot. In this exercise, you are required to study the R’s iris dataset which is a multivariate data consisting of four features or properties (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) characterizing three species of iris flower (setosa, versicolor, and virginica). The principal component analysis (PCA) is a technique that can help facilitate visualization of a multivariate data distribution. The first two principal components (PC1 and PC2) obtained after applying PCA, can explain the majority of variations in the data. In order to study the data variability in iris data-set, perform the following tasks.

(a). Write code to obtain PC scores.
[6 marks]

#Code for the above exercise.
data("iris")
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

pcaIris = prcomp(iris[,-5], scale = T)
names(pcaIris)

## [1] "sdev"     "rotation" "center"   "scale"    "x"

head(pcaIris$x)

##            PC1        PC2         PC3          PC4
## [1,] -2.257141 -0.4784238  0.12727962  0.024087508
## [2,] -2.074013  0.6718827  0.23382552  0.102662845
## [3,] -2.356335  0.3407664 -0.04405390  0.028282305
## [4,] -2.291707  0.5953999 -0.09098530 -0.065735340
## [5,] -2.381863 -0.6446757 -0.01568565 -0.035802870
## [6,] -2.068701 -1.4842053 -0.02687825  0.006586116

(b). Write code to obtain a scatter plot representing PC1 vs. PC2, wherein data clusters corresponding to three flower types are clearly marked using possibly an ellipsoid. Also, write code to obtain correlation heatmap between PC scores and comment on the appropriateness of the map.
[10 marks]

PC <- data.frame(Species = iris$Species, pcaIris$x[,1], pcaIris$x[,2])
names(PC) <- c("Species", "PC1", "PC2")
names(PC)

## [1] "Species" "PC1"     "PC2"

ggplot(PC, aes(x = PC1, y = PC2, color = Species )) +
  geom_point() +
  stat_ellipse() +
  xlab("First Principal Component PC1") + 
  ylab("Second Principal Component PC2") + 
  ggtitle("The first Two Principal Components of R's iris dataset")

#The map shows Cluster relationships between the species in term of First and Second Principal Component - a distinct representation of setosa specie and little intercept relationship of versicolor and virginica species.

(c). Run the codes to make the scatter plot, mark flowers using ellipsoids and comment on the feature distribution.
[4 marks]

ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species )) +
  geom_point() +
  stat_ellipse() +
  xlab("Length of Sepal") + 
  ylab("Length of Petal") + 
  ggtitle("Sepal Length vs Sepa Length")

#The map shows Cluster relationships between the species in term of Length of Sepal and Petal - a distinct representation of setosa specie and little intercept relationship of versicolor and virginica species.

Exercise 4. In this task, you are required to analyze the Animals dataset from the MASS package.This dataset contains brain weight (in grams) and body weight (in kilograms) for 28 different animal species.The three largest animals are dinosaurs, whose measurements are obviously the result of scientific modeling rather than precise measurements.

A scatter plot given below fails to describe any obvious relationship between brain weight and body weight variables. You are required to apply appropriate power transformations to the variables to obtain more interpretable plot and describe the obtained relationship. To this end, undertake the following tasks.

data("Animals")
str(Animals)

## 'data.frame':    28 obs. of  2 variables:
##  $ body : num  1.35 465 36.33 27.66 1.04 ...
##  $ brain: num  8.1 423 119.5 115 5.5 ...

qplot(brain, body, data = Animals)

Task-1. Check whether each of the variables has normal distribution. Your response should be based on an appropriate statistical test as well as smoothed histogram plots. [10 marks]

#Code for the above exercise.
# Shapiro-Wilks test will be used to test for normalty

shapiro.test(Animals$body)

## 
##  Shapiro-Wilk normality test
## 
## data:  Animals$body
## W = 0.27831, p-value = 1.115e-10

#W = 0.27831, p-value = 1.115e-10, This shows the body variable does not have normal distribution.Since the p-value is less than 0.05
shapiro.test(Animals$brain)

## 
##  Shapiro-Wilk normality test
## 
## data:  Animals$brain
## W = 0.45173, p-value = 3.763e-09

# W = 0.45173, p-value = 3.763e-09, This shows the brain variable does not have normal distribution. Since the p-value is less than 0.05

hist(Animals$body, freq = NULL)

hist(Animals$brain, freq = NULL)

Task-2. A power transformation of a variable X consists of raising X to the power lambda. Using an appropriate statistical test and/or plot, find best lambda values needed for transforming each of the variables requiring power transformation.
[10 marks]

# To find an appropriate lambda value for Brain column

lbdVal = boxcox(Animals$brain~1,plotit=TRUE, lambda=seq(-1,1,0.01))
title(main="Box-Cox Plot for Brain")

# To find the optimal lambda value.
lbdValOpt = lbdVal$x [lbdVal$y==max(lbdVal$y)]
lbdValOpt

## [1] 0.08

#THE OPTIMAL LAMBDA VALUE IS 0.08 For Brain

lbdVal2 = boxcox(Animals$body~1,plotit=TRUE, lambda=seq(-1.5,1.5,0.01))
title(main="Box-Cox Plot for Body ")

# Find The Optimal lambda value.
lbdVal2Opt=lbdVal2$x[lbdVal2$y==max(lbdVal2$y)]
lbdVal2Opt

## [1] 0.01

#THE OPTIMAL LAMBDA VALUE IS 0.01 For Body

Task-3. Apply power transformation and verify whether transformed variables have a normal distribution through statistical test as well as smoothed histogram plots. [10 marks]

#Power transformation for body

Animals$body=Animals$body^0.01
shapiro.test(Animals$body)

## 
##  Shapiro-Wilk normality test
## 
## data:  Animals$body
## W = 0.98501, p-value = 0.9486

# W = 0.98501 and  p-value = 0.9486, since p-value is above 0.05, the distribution is now Normal

#Power transformation for brain
Animals$brain = Animals$brain^0.08
shapiro.test(Animals$brain)

## 
##  Shapiro-Wilk normality test
## 
## data:  Animals$brain
## W = 0.97237, p-value = 0.6454

# W = 0.97237 and p-value = 0.6454, since p-value is above 0.05, the distribution is now Normal

hist(Animals$body, freq = NULL)
lines(density(Animals$body))

hist(Animals$brain, freq = NULL)
lines(density(Animals$brain))

Task-4. Create a scatter plot of the transformed data. Based on the visual inspection of the plot, provide your interpretation of the relationship between brain weight and body weight variables. You may like to add an appropriate smoothed line curve to your plot to help in interpretation. [10 marks]

ggplot(Animals, aes(x=body, y=brain)) + geom_point() + geom_smooth(method=lm)

## `geom_smooth()` using formula 'y ~ x'

#From the diagram, we can see most of the points scattered along the linear line. Hence, we can say the relationship is linear.