Required packages

The required packages to generate this report are listed below.

# This is the R chunk for the required packages
library(readr)
library(dplyr)
Registered S3 method overwritten by 'dplyr':
  method           from
  print.rowwise_df     

Attaching package: 㤼㸱dplyr㤼㸲

The following objects are masked from 㤼㸱package:stats㤼㸲:

    filter, lag

The following objects are masked from 㤼㸱package:base㤼㸲:

    intersect, setdiff, setequal, union
library(tidyr)
library(Hmisc)
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2
RStudio Community is a great place to get help:
https://community.rstudio.com/c/tidyverse
Registered S3 methods overwritten by 'htmltools':
  method               from         
  print.html           tools:rstudio
  print.shiny.tag      tools:rstudio
  print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'htmlwidgets':
  method           from         
  print.htmlwidget tools:rstudio
Registered S3 method overwritten by 'data.table':
  method           from
  print.data.table     

Attaching package: 㤼㸱Hmisc㤼㸲

The following objects are masked from 㤼㸱package:dplyr㤼㸲:

    src, summarize

The following objects are masked from 㤼㸱package:base㤼㸲:

    format.pval, units
library(ggplot2)
library(forecast)
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
This is forecast 8.12 
  Use suppressPackageStartupMessages() to eliminate package startup messages.

Executive Summary

This report talks about the preprocessing of selected prices data from April 1990 to April 2020. The required columns with all observations from the main datasets are subsetted and merged to form one dataframe. The data is untidy, so it is tidied and the variables’ datatypes are checked. Converions in data types for requisite variables are done. The variables with special and missing values are manipulated. New variables are created in the dataframe and outliers are sccanned for every numeric variable and settled. The choosen variable is transformed to a normal distribution for further analysis that could be done on that particular variable.

Data

The open datasets are collected from kaggle website.

The first data meat_prices taken from https://www.kaggle.com/kianwee/meat-prices-19902020 reflects changing prices of meat with their percentages for past three decades. It has 31 observations with 11 variables. The prices indicated price per kilogram. The second data rawmaterial_prices is from https://www.kaggle.com/kianwee/agricultural-raw-material-prices-19902020 that details about the changes in prices of non-edible raw materials in the same time period of first data. It comprises of 361 observations of 25 variables.
The data are obtained from indexmundi and both the datasets have character and numeric variables and are untidy.

Two subsets are created with variables month, beef price, lamb price, pork price from meat_prices and month, coarse wool price, fine wool price from rawmaterial_prices. These two subsets are merged by month to form merge_data comprising 361 observations of six variables.

The output of head(merge_data) gives details about variables in it. The data has 1 character variable Month and 5 numeric(dbl) variables.

# This is the R chunk for the Data Section
Parsed with column specification:
cols(
  Month = col_character(),
  `Chicken Price` = col_double(),
  `Chicken price % Change` = col_character(),
  `Beef Price` = col_double(),
  `Beef price % Change` = col_character(),
  `Lamb price` = col_double(),
  `Lamb price % Change` = col_character(),
  `Pork Price` = col_double(),
  `Pork price % Change` = col_character(),
  `Salmon Price` = col_double(),
  `Salmon price % Change` = col_character()
)
head(meat_prices)

rawmaterial_prices <- read_csv("C:/Users/intel/Desktop/rawmaterial prices.csv")
Parsed with column specification:
cols(
  .default = col_character(),
  `Coarse wool Price` = col_number(),
  `Copra Price` = col_number(),
  `Cotton Price` = col_double(),
  `Fine wool Price` = col_number(),
  `Hard log Price` = col_double(),
  `Hard sawnwood Price` = col_double(),
  `Hide Price` = col_double(),
  `Plywood Price` = col_double(),
  `Rubber Price` = col_double(),
  `Softlog Price` = col_double(),
  `Soft sawnwood Price` = col_double(),
  `Wood pulp Price` = col_double()
)
See spec(...) for full column specifications.
head(rawmaterial_prices)

#subsetting and merging the 2 datasets#
cattle_prices <- meat_prices[, c(1,4,6,8)]
wool_prices <- rawmaterial_prices[, c(1,2,8)]
merge_data <- left_join(cattle_prices, wool_prices, by ="Month")
head(merge_data)
NA

##Tidy & Manipulate Data I

merge_data is untidy because the column Month has more than one information. The column is split into two seperate columns - Month and year to make it tidy resulting in tidy_data of 361 observations with 7 variables. head(tidy_data) shows the description about variables in tidy_data.

# This is the R chunk for the Tidy & Manipulate Data I 

tidy_data <- merge_data %>% separate(Month, into = c("Month", "Year"), sep = "-")
head(tidy_data)
NA

Understand

tidy_data has year as a character which is converted to Year. The variable Month is converted as a factor. The prices are decimal values and thus is double i.e numeric type variable.

# This is the R chunk for the Understand Section

str(tidy_data)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   361 obs. of  7 variables:
 $ Month            : chr  "Apr" "May" "Jun" "Jul" ...
 $ Year             : chr  "90" "90" "90" "90" ...
 $ Beef Price       : num  2.47 2.48 2.5 2.56 2.65 2.7 2.59 2.53 2.55 2.59 ...
 $ Lamb price       : num  2.63 2.71 2.72 2.77 2.66 2.7 2.77 2.74 2.65 2.63 ...
 $ Pork Price       : num  99.6 115.7 121.5 96.2 94 ...
 $ Coarse wool Price: num  482 447 441 418 418 ...
 $ Fine wool Price  : num  1072 1057 898 896 951 ...
attributes(tidy_data)
$names
[1] "Month"             "Year"              "Beef Price"        "Lamb price"       
[5] "Pork Price"        "Coarse wool Price" "Fine wool Price"  

$row.names
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22
 [23]  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44
 [45]  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66
 [67]  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88
 [89]  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110
[111] 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132
[133] 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154
[155] 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
[177] 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
[199] 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
[221] 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242
[243] 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264
[265] 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286
[287] 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308
[309] 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330
[331] 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352
[353] 353 354 355 356 357 358 359 360 361

$class
[1] "tbl_df"     "tbl"        "data.frame"
tidy_data$Year <- as.numeric(tidy_data$Year)
str(tidy_data$Year)
 num [1:361] 90 90 90 90 90 90 90 90 90 91 ...
tidy_data$Month <- tidy_data$Month %>% as.factor()

tidy_data$Month <- factor(tidy_data$Month,levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), 
                         labels = (c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")))
str(tidy_data$Month)
 Factor w/ 12 levels "Jan","Feb","Mar",..: 4 5 6 7 8 9 10 11 12 1 ...
levels(tidy_data$Month)
 [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

Scan I

tidy_data has both special and missiong values - NaN and NAs. Lamb prices and pork prices have NaN values which are then imputed as NA values. Missing values are present in lamb prices, pork prices, coarse wool prices and fine wool prices. More than 5% of the data in the variables are Na values. Deleting them would impact the analysis. Since, they are quantitative variables, the NA values are imputed with the mean values of those variables. tidy_data now has no NA values.

# This is the R chunk for the Scan I
#Dealing with special values#
which(is.nan(tidy_data$Month))
integer(0)
which(is.nan(tidy_data$Year))
integer(0)
which(is.nan(tidy_data$`Beef Price`))
integer(0)
which(is.nan(tidy_data$`Lamb price`))
 [1] 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357
[23] 358 359 360 361
which(is.nan(tidy_data$`Pork Price`))
 [1] 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349
[23] 350 351 352 353 354 355 356 357 358 359 360 361
which(is.nan((tidy_data$`Coarse wool Price`)))
integer(0)
which(is.nan(tidy_data$`Fine wool Price`))
integer(0)
tidy_data$`Lamb price`[is.nan(tidy_data$`Lamb price`)] <- NA
tidy_data$`Pork Price`[is.nan(tidy_data$`Pork Price`)] <- NA

#Dealing with missing values#
colSums(is.na(tidy_data))
            Month              Year        Beef Price        Lamb price        Pork Price 
                0                 0                 0                26                34 
Coarse wool Price   Fine wool Price 
               34                34 
tidy_data$`Lamb price` <- tidy_data$`Lamb price` %>%  impute(fun = mean)
tidy_data$`Pork Price` <- tidy_data$`Pork Price` %>%  impute(fun = mean)
tidy_data$`Coarse wool Price` <- tidy_data$`Coarse wool Price` %>%  impute(fun = mean)
tidy_data$`Fine wool Price` <- tidy_data$`Fine wool Price` %>%  impute(fun = mean)

colSums(is.na(tidy_data))
            Month              Year        Beef Price        Lamb price        Pork Price 
                0                 0                 0                 0                 0 
Coarse wool Price   Fine wool Price 
                0                 0 

Tidy & Manipulate Data II

A new column wool_price_diff is mutated to find the difference between coarse wool and fine wool prices in every month for the three decades. Another column cattle_meat_prices is created which outlines the total meat prices of beef, lamb and pork for every month from 1990 to 2020.

# This is the R chunk for the Tidy & Manipulate Data II 

tidy_data <- tidy_data %>% mutate(wool_price_diff = `Fine wool Price` - `Coarse wool Price`)
tidy_data$cattle_meat_prices = tidy_data$`Beef Price`+tidy_data$`Lamb price`+tidy_data$`Pork Price`
head(tidy_data)
NA

Scan II

After imputing the missing values, the imputed variables are needed to be converted into numeric to work on for outliers. Out of all the numeric variables, pork prices, fine wool prices, wool price diff and cattle meat prices has outliers which could be seen in the boxplot. The dots outside the boxplot indicate the outliers. Though the outliers are less than 5% in each of these columns, deleting them is not preffered as every observation carries some information that would influence the samples in the further analysis. Capping the outliers would replace the outliers with nearest percentiles and doesn’t extremely affect the final analysis. After capping the boxplots are more clear with no outliers.

# This is the R chunk for the Scan II
str(tidy_data)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   361 obs. of  9 variables:
 $ Month             : Factor w/ 12 levels "Jan","Feb","Mar",..: 4 5 6 7 8 9 10 11 12 1 ...
 $ Year              : num  90 90 90 90 90 90 90 90 90 91 ...
 $ Beef Price        : num  2.47 2.48 2.5 2.56 2.65 2.7 2.59 2.53 2.55 2.59 ...
 $ Lamb price        : 'impute' num  2.63 2.71 2.72 2.77 2.66 2.7 2.77 2.74 2.65 2.63 ...
  ..- attr(*, "imputed")= int  336 337 338 339 340 341 342 343 344 345 ...
 $ Pork Price        : 'impute' num  99.6 115.7 121.5 96.2 94 ...
  ..- attr(*, "imputed")= int  328 329 330 331 332 333 334 335 336 337 ...
 $ Coarse wool Price : 'impute' num  482 447 441 418 418 ...
  ..- attr(*, "imputed")= int  328 329 330 331 332 333 334 335 336 337 ...
 $ Fine wool Price   : 'impute' num  1072 1057 898 896 951 ...
  ..- attr(*, "imputed")= int  328 329 330 331 332 333 334 335 336 337 ...
 $ wool_price_diff   : 'impute' num  589 610 457 477 533 ...
  ..- attr(*, "imputed")= int  328 329 330 331 332 333 334 335 336 337 ...
 $ cattle_meat_prices: 'impute' num  104.7 120.9 126.7 101.5 99.3 ...
  ..- attr(*, "imputed")= int  336 337 338 339 340 341 342 343 344 345 ...
tidy_data$`Lamb price` <- as.numeric(tidy_data$`Lamb price`)
tidy_data$`Pork Price` <- as.numeric(tidy_data$`Pork Price`)
tidy_data$`Coarse wool Price` <- as.numeric(tidy_data$`Coarse wool Price`)
tidy_data$`Fine wool Price`<- as.numeric(tidy_data$`Fine wool Price`)
tidy_data$wool_price_diff <- as.numeric(tidy_data$wool_price_diff)
tidy_data$cattle_meat_prices <- as.numeric(tidy_data$cattle_meat_prices)

boxplot(tidy_data$Year, main = "year")

boxplot(tidy_data$`Beef Price`, main = "beef price")

boxplot(tidy_data$`Lamb price`, main = "lamb price")

boxplot(tidy_data$`Pork Price`, main = "pork price")

boxplot(tidy_data$`Coarse wool Price`, main = "coarse wool price")

boxplot(tidy_data$`Fine wool Price`, main = "fine wool price")

boxplot(tidy_data$wool_price_diff, main = "wool price difference")

boxplot(tidy_data$cattle_meat_prices, main = "cattle meat prices")


cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}

tidy_data$`Pork Price` <- tidy_data$`Pork Price` %>% cap()
tidy_data$`Fine wool Price` <- tidy_data$`Fine wool Price` %>% cap()
tidy_data$wool_price_diff <- tidy_data$wool_price_diff %>% cap()
tidy_data$cattle_meat_prices <- tidy_data$cattle_meat_prices %>% cap()

boxplot(tidy_data$`Pork Price`, main = "pork")

boxplot(tidy_data$`Fine wool Price`, main = "fine wool")

boxplot(tidy_data$wool_price_diff, main = "wool price difference")

boxplot(tidy_data$cattle_meat_prices, main = "cattle meat prices")

Transform

The last step in data manipulation is transforming it. In tidy_data, the wool_price_diff has right skewed distribution. The data has to be converted to normal distribution for futher hypothesis testing or analysis to be done. In this step, BoxCox transformation is done on wool_price_diff to transform it to normal distribution. It is viewed in the histogram.

# This is the R chunk for the Transform Section

hist(tidy_data$wool_price_diff)


boxcox_wool_price_diff <- BoxCox(tidy_data$wool_price_diff,lambda = "auto")
boxcox_wool_price_diff
  [1] 130.676224 130.676224 128.607765 132.859772 144.327692 142.651779 139.073185 148.662202
  [9] 146.269626 151.450040  87.009723  83.443367  76.488135 104.888209 124.265143 113.163045
 [17] 103.590002  93.392185  81.383946 101.414126 106.626667  98.738210  96.785914  99.761532
 [25]  97.168559  92.497635  86.093642  76.990143  73.306625  74.771996  68.672373  65.546669
 [33]  69.312690  64.183505  57.884094  57.886864  40.650110  49.775450  52.988494  52.497126
 [41]  51.423396  44.304236  51.829060  61.335424  53.884507  74.461345  85.824017  76.520963
 [49]  86.083931  98.489950 112.930305 107.700795 112.033308 123.127407 114.311103 123.990171
 [57] 127.570414 128.580167 121.568866 109.688410 104.537998  98.844874  95.078841  72.217909
 [65]  79.401805  75.409774  48.300589  59.175819  51.599025  48.752798  32.313692  50.799895
 [73]  63.722511  94.333338  82.361600  81.710264 101.427917 101.421022  85.166988  68.069449
 [81]  72.673056  86.117919  90.668775  92.054025  96.571039 102.238196 105.222137 104.565301
 [89] 102.419239 105.791496 101.375048  94.742784  87.411007  78.221631  75.191455  79.222158
 [97]  80.943220  81.542215  80.707674  87.377188  69.578692  64.170120  49.629784  62.708179
[105]  63.435172  66.013994  70.301969 100.322634 112.175567 100.871177 113.698846 117.064884
[113] 111.096001 110.977815 110.167218 110.995657 112.166677 126.589987 130.676224 137.623135
[121] 143.270524 143.217290 131.378727 125.906447 122.210789 119.013663 118.114893 115.343045
[129] 116.582576 117.664649 118.293956 110.240991 123.448221 114.286810 106.649270  94.109540
[137]  86.083931  64.405569  44.004352  49.501481  48.073961  56.702731  42.156812  33.474435
[145]  37.087008  40.194847  25.118679  15.408952  20.004536  26.479094  35.259951  32.865707
[153]  31.940607  32.276783  21.118207  24.321146  25.173254  21.776715   9.680800  -6.701597
[161]   9.968440  14.464967  27.753911  26.008985  31.610035  45.260026  43.365996  46.816673
[169]  55.597410  53.435699  54.490425  59.236311  64.864942  69.424869  71.353848  68.247889
[177]  60.897371  61.758886  64.851603  64.336034  64.894284  61.707362  49.822033  49.007656
[185]  51.469482  49.317583  47.018581  44.152863  41.500034  53.074043  65.820294  60.728401
[193]  64.025498  71.392611  68.428795  63.955826  64.301257  64.250418  59.645465  55.268618
[201]  54.184914  62.926569  60.351746  65.899920  68.156064  68.944423  68.766585  61.994629
[209]  63.768124  62.843017  87.106475  87.427915  88.702553  97.385323  95.976964  93.677931
[217]  95.735709  88.976437  84.542462  82.506929  74.952586  79.059860  75.077135  67.464745
[225]  65.711421  58.207913  51.345595  57.048246  60.788376  58.531137  50.199994  56.390053
[233]  59.919545  62.527345  69.056800  67.980170  66.008690  57.595750  56.546464  56.362108
[241]  62.054195  56.666467  57.056597  52.084578  54.377339  68.161312  76.621942 113.038937
[249] 112.348862 145.382267 152.305292 136.236661 123.727897 135.594831 133.359213 121.607803
[257] 117.815523 121.200924 111.314422 101.910213  86.020798  75.018682  60.540171  68.436656
[265]  56.616244  43.241005  34.533607  29.079533  37.228046  38.900878  51.302356  56.431963
[273]  57.309708  54.012106  52.253738  45.491185  39.074424  31.061097  15.374946 -12.456048
[281]  11.567127  37.330510  33.858432  38.964015  40.833612  27.651642  17.686629   4.222882
[289]  -8.892777  19.544879  12.498149  11.436943  19.914443  27.081629  30.532915  30.734312
[297]  27.609288  27.063865  25.474552  25.325862  41.759450  47.687588  42.150661  26.174371
[305]  27.262608  29.093418  31.876603  36.033261  36.904038  35.474948  25.810794  24.372614
[313]  28.895348  42.721476  33.653370  21.091269  21.765285  35.832336  59.200568  60.736581
[321]  67.256627  84.447185 102.478800 115.202108 119.281544 119.207515 102.355084  74.862523
[329]  74.862523  74.862523  74.862523  74.862523  74.862523  74.862523  74.862523  74.862523
[337]  74.862523  74.862523  74.862523  74.862523  74.862523  74.862523  74.862523  74.862523
[345]  74.862523  74.862523  74.862523  74.862523  74.862523  74.862523  74.862523  74.862523
[353]  74.862523  74.862523  74.862523  74.862523  74.862523  74.862523  74.862523  74.862523
[361]  74.862523
attr(,"lambda")
[1] 0.7469761
hist(boxcox_wool_price_diff)

The final tidy_data is cleaned and ready for analysis.

---
title: "MATH2349 Semester 1, 2020"
author: "JAHNAVI POKKUNOORI s3801330"
subtitle: Assignment 2
output:
  html_notebook: default
  pdf_document: default
---

## Required packages 

The required packages to generate this report are listed below.

```{r}
# This is the R chunk for the required packages
library(readr)
library(dplyr)
library(tidyr)
library(Hmisc)
library(ggplot2)
library(forecast)

```


## Executive Summary 

This report talks about the preprocessing of selected prices data from April 1990 to April 2020. The required columns with all observations from the main datasets are subsetted and merged to form one dataframe. The data is untidy, so it is tidied and the variables' datatypes are checked. Converions in data types for requisite variables are done. The variables with special and missing values are manipulated. New variables are created in the dataframe and outliers are sccanned for every numeric variable and settled. The choosen variable is transformed to a normal distribution for further analysis that could be done on that particular variable. 


## Data 

The open datasets are collected from kaggle website. 

The first data meat_prices taken from https://www.kaggle.com/kianwee/meat-prices-19902020 reflects changing prices of meat with their percentages for past three  decades. It has 31 observations with 11 variables. The prices indicated price per kilogram. 
The second data rawmaterial_prices is from https://www.kaggle.com/kianwee/agricultural-raw-material-prices-19902020 that details about the changes in prices of non-edible raw materials in the same time period of first data. It comprises of 361 observations of 25 variables.  
The data are obtained from indexmundi and both the datasets have character and numeric variables and are untidy.

Two subsets are created with variables month, beef price, lamb price, pork price from meat_prices and month, coarse wool price, fine wool price from rawmaterial_prices. These two subsets are merged by month to form merge_data comprising 361 observations of six variables.

The output of head(merge_data) gives details about variables in it. The data has 1 character variable Month and 5 numeric(dbl) variables. 


```{r}
# This is the R chunk for the Data Section
meat_prices <- read_csv("C:/Users/intel/Desktop/meat prices.csv")
head(meat_prices)

rawmaterial_prices <- read_csv("C:/Users/intel/Desktop/rawmaterial prices.csv")
head(rawmaterial_prices)

#subsetting and merging the 2 datasets#
cattle_prices <- meat_prices[, c(1,4,6,8)]
wool_prices <- rawmaterial_prices[, c(1,2,8)]
merge_data <- left_join(cattle_prices, wool_prices, by ="Month")
head(merge_data)

```

##Tidy & Manipulate Data I

merge_data is untidy because the column Month has more than one information. The column is split into two seperate columns - Month and year to make it tidy resulting in tidy_data of 361 observations with 7 variables. head(tidy_data) shows the description about variables in tidy_data.  

```{r}
# This is the R chunk for the Tidy & Manipulate Data I 

tidy_data <- merge_data %>% separate(Month, into = c("Month", "Year"), sep = "-")
head(tidy_data)

```


##	Understand

tidy_data has year as a character which is converted to Year. The variable Month is converted as a factor. The prices are decimal values and thus is double i.e numeric type variable.

```{r}
# This is the R chunk for the Understand Section

str(tidy_data)
attributes(tidy_data)

tidy_data$Year <- as.numeric(tidy_data$Year)
str(tidy_data$Year)

tidy_data$Month <- tidy_data$Month %>% as.factor()

tidy_data$Month <- factor(tidy_data$Month,levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), 
                         labels = (c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")))
str(tidy_data$Month)
levels(tidy_data$Month)

```

##	Scan I 

tidy_data has both special and missiong values - NaN and NAs. Lamb prices and pork prices have NaN values which are then imputed as NA values. 
Missing values are present in lamb prices, pork prices, coarse wool prices and fine wool prices.
More than 5% of the data in the variables are Na values. Deleting them would impact the analysis. Since, they are quantitative variables, the NA values are imputed with the mean values of those variables. tidy_data now has no NA values.

```{r}
# This is the R chunk for the Scan I
#Dealing with special values#
which(is.nan(tidy_data$Month))
which(is.nan(tidy_data$Year))
which(is.nan(tidy_data$`Beef Price`))
which(is.nan(tidy_data$`Lamb price`))
which(is.nan(tidy_data$`Pork Price`))
which(is.nan((tidy_data$`Coarse wool Price`)))
which(is.nan(tidy_data$`Fine wool Price`))

tidy_data$`Lamb price`[is.nan(tidy_data$`Lamb price`)] <- NA
tidy_data$`Pork Price`[is.nan(tidy_data$`Pork Price`)] <- NA

#Dealing with missing values#
colSums(is.na(tidy_data))

tidy_data$`Lamb price` <- tidy_data$`Lamb price` %>%  impute(fun = mean)
tidy_data$`Pork Price` <- tidy_data$`Pork Price` %>%  impute(fun = mean)
tidy_data$`Coarse wool Price` <- tidy_data$`Coarse wool Price` %>%  impute(fun = mean)
tidy_data$`Fine wool Price` <- tidy_data$`Fine wool Price` %>%  impute(fun = mean)

colSums(is.na(tidy_data))

```


##	Tidy & Manipulate Data II 

A new column wool_price_diff is mutated to find the difference between coarse wool and fine wool prices in every month for the three decades.
Another column cattle_meat_prices is created which outlines the total meat prices of beef, lamb and pork for every month from 1990 to 2020.

```{r}
# This is the R chunk for the Tidy & Manipulate Data II 

tidy_data <- tidy_data %>% mutate(wool_price_diff = `Fine wool Price` - `Coarse wool Price`)
tidy_data$cattle_meat_prices = tidy_data$`Beef Price`+tidy_data$`Lamb price`+tidy_data$`Pork Price`
head(tidy_data)

```


##	Scan II

After imputing the missing values, the imputed variables are needed to be converted into numeric to work on for outliers. Out of all the numeric variables, pork prices, fine wool prices, wool price diff and cattle meat prices has outliers which could be seen in the boxplot.
The dots outside the boxplot indicate the outliers.
Though the outliers are less than 5% in each of these columns, deleting them is not preffered as every observation carries some information that would influence the samples in the further analysis. Capping the outliers would replace the outliers with nearest percentiles and doesn't extremely affect the final analysis.
After capping the boxplots are more clear with no outliers. 

```{r}
# This is the R chunk for the Scan II
str(tidy_data)

tidy_data$`Lamb price` <- as.numeric(tidy_data$`Lamb price`)
tidy_data$`Pork Price` <- as.numeric(tidy_data$`Pork Price`)
tidy_data$`Coarse wool Price` <- as.numeric(tidy_data$`Coarse wool Price`)
tidy_data$`Fine wool Price`<- as.numeric(tidy_data$`Fine wool Price`)
tidy_data$wool_price_diff <- as.numeric(tidy_data$wool_price_diff)
tidy_data$cattle_meat_prices <- as.numeric(tidy_data$cattle_meat_prices)

boxplot(tidy_data$Year, main = "year")
boxplot(tidy_data$`Beef Price`, main = "beef price")
boxplot(tidy_data$`Lamb price`, main = "lamb price")
boxplot(tidy_data$`Pork Price`, main = "pork price")
boxplot(tidy_data$`Coarse wool Price`, main = "coarse wool price")
boxplot(tidy_data$`Fine wool Price`, main = "fine wool price")
boxplot(tidy_data$wool_price_diff, main = "wool price difference")
boxplot(tidy_data$cattle_meat_prices, main = "cattle meat prices")

cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}

tidy_data$`Pork Price` <- tidy_data$`Pork Price` %>% cap()
tidy_data$`Fine wool Price` <- tidy_data$`Fine wool Price` %>% cap()
tidy_data$wool_price_diff <- tidy_data$wool_price_diff %>% cap()
tidy_data$cattle_meat_prices <- tidy_data$cattle_meat_prices %>% cap()

boxplot(tidy_data$`Pork Price`, main = "pork")
boxplot(tidy_data$`Fine wool Price`, main = "fine wool")
boxplot(tidy_data$wool_price_diff, main = "wool price difference")
boxplot(tidy_data$cattle_meat_prices, main = "cattle meat prices")

```

##	Transform 

The last step in data manipulation is transforming it. In tidy_data, the wool_price_diff has right skewed distribution. The data has to be converted to normal distribution for futher hypothesis testing or analysis to be done. In this step, BoxCox transformation is done on wool_price_diff to transform it to normal distribution. It is viewed in the histogram.

```{r}
# This is the R chunk for the Transform Section

hist(tidy_data$wool_price_diff)

boxcox_wool_price_diff <- BoxCox(tidy_data$wool_price_diff,lambda = "auto")
boxcox_wool_price_diff

hist(boxcox_wool_price_diff)
```


The final tidy_data is cleaned and ready for analysis.