The required packages to generate this report are listed below.
# This is the R chunk for the required packages
library(readr)
library(dplyr)
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
Attaching package: 㤼㸱dplyr㤼㸲
The following objects are masked from 㤼㸱package:stats㤼㸲:
filter, lag
The following objects are masked from 㤼㸱package:base㤼㸲:
intersect, setdiff, setequal, union
library(tidyr)
library(Hmisc)
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2
RStudio Community is a great place to get help:
https://community.rstudio.com/c/tidyverse
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Attaching package: 㤼㸱Hmisc㤼㸲
The following objects are masked from 㤼㸱package:dplyr㤼㸲:
src, summarize
The following objects are masked from 㤼㸱package:base㤼㸲:
format.pval, units
library(ggplot2)
library(forecast)
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
This is forecast 8.12
Use suppressPackageStartupMessages() to eliminate package startup messages.
This report talks about the preprocessing of selected prices data from April 1990 to April 2020. The required columns with all observations from the main datasets are subsetted and merged to form one dataframe. The data is untidy, so it is tidied and the variables’ datatypes are checked. Converions in data types for requisite variables are done. The variables with special and missing values are manipulated. New variables are created in the dataframe and outliers are sccanned for every numeric variable and settled. The choosen variable is transformed to a normal distribution for further analysis that could be done on that particular variable.
The open datasets are collected from kaggle website.
The first data meat_prices taken from https://www.kaggle.com/kianwee/meat-prices-19902020 reflects changing prices of meat with their percentages for past three decades. It has 31 observations with 11 variables. The prices indicated price per kilogram. The second data rawmaterial_prices is from https://www.kaggle.com/kianwee/agricultural-raw-material-prices-19902020 that details about the changes in prices of non-edible raw materials in the same time period of first data. It comprises of 361 observations of 25 variables.
The data are obtained from indexmundi and both the datasets have character and numeric variables and are untidy.
Two subsets are created with variables month, beef price, lamb price, pork price from meat_prices and month, coarse wool price, fine wool price from rawmaterial_prices. These two subsets are merged by month to form merge_data comprising 361 observations of six variables.
The output of head(merge_data) gives details about variables in it. The data has 1 character variable Month and 5 numeric(dbl) variables.
# This is the R chunk for the Data Section
Parsed with column specification:
cols(
Month = [31mcol_character()[39m,
`Chicken Price` = [32mcol_double()[39m,
`Chicken price % Change` = [31mcol_character()[39m,
`Beef Price` = [32mcol_double()[39m,
`Beef price % Change` = [31mcol_character()[39m,
`Lamb price` = [32mcol_double()[39m,
`Lamb price % Change` = [31mcol_character()[39m,
`Pork Price` = [32mcol_double()[39m,
`Pork price % Change` = [31mcol_character()[39m,
`Salmon Price` = [32mcol_double()[39m,
`Salmon price % Change` = [31mcol_character()[39m
)
head(meat_prices)
rawmaterial_prices <- read_csv("C:/Users/intel/Desktop/rawmaterial prices.csv")
Parsed with column specification:
cols(
.default = col_character(),
`Coarse wool Price` = [32mcol_number()[39m,
`Copra Price` = [32mcol_number()[39m,
`Cotton Price` = [32mcol_double()[39m,
`Fine wool Price` = [32mcol_number()[39m,
`Hard log Price` = [32mcol_double()[39m,
`Hard sawnwood Price` = [32mcol_double()[39m,
`Hide Price` = [32mcol_double()[39m,
`Plywood Price` = [32mcol_double()[39m,
`Rubber Price` = [32mcol_double()[39m,
`Softlog Price` = [32mcol_double()[39m,
`Soft sawnwood Price` = [32mcol_double()[39m,
`Wood pulp Price` = [32mcol_double()[39m
)
See spec(...) for full column specifications.
head(rawmaterial_prices)
#subsetting and merging the 2 datasets#
cattle_prices <- meat_prices[, c(1,4,6,8)]
wool_prices <- rawmaterial_prices[, c(1,2,8)]
merge_data <- left_join(cattle_prices, wool_prices, by ="Month")
head(merge_data)
NA
##Tidy & Manipulate Data I
merge_data is untidy because the column Month has more than one information. The column is split into two seperate columns - Month and year to make it tidy resulting in tidy_data of 361 observations with 7 variables. head(tidy_data) shows the description about variables in tidy_data.
# This is the R chunk for the Tidy & Manipulate Data I
tidy_data <- merge_data %>% separate(Month, into = c("Month", "Year"), sep = "-")
head(tidy_data)
NA
tidy_data has year as a character which is converted to Year. The variable Month is converted as a factor. The prices are decimal values and thus is double i.e numeric type variable.
# This is the R chunk for the Understand Section
str(tidy_data)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 361 obs. of 7 variables:
$ Month : chr "Apr" "May" "Jun" "Jul" ...
$ Year : chr "90" "90" "90" "90" ...
$ Beef Price : num 2.47 2.48 2.5 2.56 2.65 2.7 2.59 2.53 2.55 2.59 ...
$ Lamb price : num 2.63 2.71 2.72 2.77 2.66 2.7 2.77 2.74 2.65 2.63 ...
$ Pork Price : num 99.6 115.7 121.5 96.2 94 ...
$ Coarse wool Price: num 482 447 441 418 418 ...
$ Fine wool Price : num 1072 1057 898 896 951 ...
attributes(tidy_data)
$names
[1] "Month" "Year" "Beef Price" "Lamb price"
[5] "Pork Price" "Coarse wool Price" "Fine wool Price"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
[23] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
[45] 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
[67] 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
[89] 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110
[111] 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132
[133] 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154
[155] 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
[177] 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
[199] 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
[221] 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242
[243] 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264
[265] 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286
[287] 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308
[309] 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330
[331] 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352
[353] 353 354 355 356 357 358 359 360 361
$class
[1] "tbl_df" "tbl" "data.frame"
tidy_data$Year <- as.numeric(tidy_data$Year)
str(tidy_data$Year)
num [1:361] 90 90 90 90 90 90 90 90 90 91 ...
tidy_data$Month <- tidy_data$Month %>% as.factor()
tidy_data$Month <- factor(tidy_data$Month,levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"),
labels = (c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")))
str(tidy_data$Month)
Factor w/ 12 levels "Jan","Feb","Mar",..: 4 5 6 7 8 9 10 11 12 1 ...
levels(tidy_data$Month)
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
tidy_data has both special and missiong values - NaN and NAs. Lamb prices and pork prices have NaN values which are then imputed as NA values. Missing values are present in lamb prices, pork prices, coarse wool prices and fine wool prices. More than 5% of the data in the variables are Na values. Deleting them would impact the analysis. Since, they are quantitative variables, the NA values are imputed with the mean values of those variables. tidy_data now has no NA values.
# This is the R chunk for the Scan I
#Dealing with special values#
which(is.nan(tidy_data$Month))
integer(0)
which(is.nan(tidy_data$Year))
integer(0)
which(is.nan(tidy_data$`Beef Price`))
integer(0)
which(is.nan(tidy_data$`Lamb price`))
[1] 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357
[23] 358 359 360 361
which(is.nan(tidy_data$`Pork Price`))
[1] 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349
[23] 350 351 352 353 354 355 356 357 358 359 360 361
which(is.nan((tidy_data$`Coarse wool Price`)))
integer(0)
which(is.nan(tidy_data$`Fine wool Price`))
integer(0)
tidy_data$`Lamb price`[is.nan(tidy_data$`Lamb price`)] <- NA
tidy_data$`Pork Price`[is.nan(tidy_data$`Pork Price`)] <- NA
#Dealing with missing values#
colSums(is.na(tidy_data))
Month Year Beef Price Lamb price Pork Price
0 0 0 26 34
Coarse wool Price Fine wool Price
34 34
tidy_data$`Lamb price` <- tidy_data$`Lamb price` %>% impute(fun = mean)
tidy_data$`Pork Price` <- tidy_data$`Pork Price` %>% impute(fun = mean)
tidy_data$`Coarse wool Price` <- tidy_data$`Coarse wool Price` %>% impute(fun = mean)
tidy_data$`Fine wool Price` <- tidy_data$`Fine wool Price` %>% impute(fun = mean)
colSums(is.na(tidy_data))
Month Year Beef Price Lamb price Pork Price
0 0 0 0 0
Coarse wool Price Fine wool Price
0 0
A new column wool_price_diff is mutated to find the difference between coarse wool and fine wool prices in every month for the three decades. Another column cattle_meat_prices is created which outlines the total meat prices of beef, lamb and pork for every month from 1990 to 2020.
# This is the R chunk for the Tidy & Manipulate Data II
tidy_data <- tidy_data %>% mutate(wool_price_diff = `Fine wool Price` - `Coarse wool Price`)
tidy_data$cattle_meat_prices = tidy_data$`Beef Price`+tidy_data$`Lamb price`+tidy_data$`Pork Price`
head(tidy_data)
NA
After imputing the missing values, the imputed variables are needed to be converted into numeric to work on for outliers. Out of all the numeric variables, pork prices, fine wool prices, wool price diff and cattle meat prices has outliers which could be seen in the boxplot. The dots outside the boxplot indicate the outliers. Though the outliers are less than 5% in each of these columns, deleting them is not preffered as every observation carries some information that would influence the samples in the further analysis. Capping the outliers would replace the outliers with nearest percentiles and doesn’t extremely affect the final analysis. After capping the boxplots are more clear with no outliers.
# This is the R chunk for the Scan II
str(tidy_data)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 361 obs. of 9 variables:
$ Month : Factor w/ 12 levels "Jan","Feb","Mar",..: 4 5 6 7 8 9 10 11 12 1 ...
$ Year : num 90 90 90 90 90 90 90 90 90 91 ...
$ Beef Price : num 2.47 2.48 2.5 2.56 2.65 2.7 2.59 2.53 2.55 2.59 ...
$ Lamb price : 'impute' num 2.63 2.71 2.72 2.77 2.66 2.7 2.77 2.74 2.65 2.63 ...
..- attr(*, "imputed")= int 336 337 338 339 340 341 342 343 344 345 ...
$ Pork Price : 'impute' num 99.6 115.7 121.5 96.2 94 ...
..- attr(*, "imputed")= int 328 329 330 331 332 333 334 335 336 337 ...
$ Coarse wool Price : 'impute' num 482 447 441 418 418 ...
..- attr(*, "imputed")= int 328 329 330 331 332 333 334 335 336 337 ...
$ Fine wool Price : 'impute' num 1072 1057 898 896 951 ...
..- attr(*, "imputed")= int 328 329 330 331 332 333 334 335 336 337 ...
$ wool_price_diff : 'impute' num 589 610 457 477 533 ...
..- attr(*, "imputed")= int 328 329 330 331 332 333 334 335 336 337 ...
$ cattle_meat_prices: 'impute' num 104.7 120.9 126.7 101.5 99.3 ...
..- attr(*, "imputed")= int 336 337 338 339 340 341 342 343 344 345 ...
tidy_data$`Lamb price` <- as.numeric(tidy_data$`Lamb price`)
tidy_data$`Pork Price` <- as.numeric(tidy_data$`Pork Price`)
tidy_data$`Coarse wool Price` <- as.numeric(tidy_data$`Coarse wool Price`)
tidy_data$`Fine wool Price`<- as.numeric(tidy_data$`Fine wool Price`)
tidy_data$wool_price_diff <- as.numeric(tidy_data$wool_price_diff)
tidy_data$cattle_meat_prices <- as.numeric(tidy_data$cattle_meat_prices)
boxplot(tidy_data$Year, main = "year")
boxplot(tidy_data$`Beef Price`, main = "beef price")
boxplot(tidy_data$`Lamb price`, main = "lamb price")
boxplot(tidy_data$`Pork Price`, main = "pork price")
boxplot(tidy_data$`Coarse wool Price`, main = "coarse wool price")
boxplot(tidy_data$`Fine wool Price`, main = "fine wool price")
boxplot(tidy_data$wool_price_diff, main = "wool price difference")
boxplot(tidy_data$cattle_meat_prices, main = "cattle meat prices")
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
tidy_data$`Pork Price` <- tidy_data$`Pork Price` %>% cap()
tidy_data$`Fine wool Price` <- tidy_data$`Fine wool Price` %>% cap()
tidy_data$wool_price_diff <- tidy_data$wool_price_diff %>% cap()
tidy_data$cattle_meat_prices <- tidy_data$cattle_meat_prices %>% cap()
boxplot(tidy_data$`Pork Price`, main = "pork")
boxplot(tidy_data$`Fine wool Price`, main = "fine wool")
boxplot(tidy_data$wool_price_diff, main = "wool price difference")
boxplot(tidy_data$cattle_meat_prices, main = "cattle meat prices")
The last step in data manipulation is transforming it. In tidy_data, the wool_price_diff has right skewed distribution. The data has to be converted to normal distribution for futher hypothesis testing or analysis to be done. In this step, BoxCox transformation is done on wool_price_diff to transform it to normal distribution. It is viewed in the histogram.
# This is the R chunk for the Transform Section
hist(tidy_data$wool_price_diff)
boxcox_wool_price_diff <- BoxCox(tidy_data$wool_price_diff,lambda = "auto")
boxcox_wool_price_diff
[1] 130.676224 130.676224 128.607765 132.859772 144.327692 142.651779 139.073185 148.662202
[9] 146.269626 151.450040 87.009723 83.443367 76.488135 104.888209 124.265143 113.163045
[17] 103.590002 93.392185 81.383946 101.414126 106.626667 98.738210 96.785914 99.761532
[25] 97.168559 92.497635 86.093642 76.990143 73.306625 74.771996 68.672373 65.546669
[33] 69.312690 64.183505 57.884094 57.886864 40.650110 49.775450 52.988494 52.497126
[41] 51.423396 44.304236 51.829060 61.335424 53.884507 74.461345 85.824017 76.520963
[49] 86.083931 98.489950 112.930305 107.700795 112.033308 123.127407 114.311103 123.990171
[57] 127.570414 128.580167 121.568866 109.688410 104.537998 98.844874 95.078841 72.217909
[65] 79.401805 75.409774 48.300589 59.175819 51.599025 48.752798 32.313692 50.799895
[73] 63.722511 94.333338 82.361600 81.710264 101.427917 101.421022 85.166988 68.069449
[81] 72.673056 86.117919 90.668775 92.054025 96.571039 102.238196 105.222137 104.565301
[89] 102.419239 105.791496 101.375048 94.742784 87.411007 78.221631 75.191455 79.222158
[97] 80.943220 81.542215 80.707674 87.377188 69.578692 64.170120 49.629784 62.708179
[105] 63.435172 66.013994 70.301969 100.322634 112.175567 100.871177 113.698846 117.064884
[113] 111.096001 110.977815 110.167218 110.995657 112.166677 126.589987 130.676224 137.623135
[121] 143.270524 143.217290 131.378727 125.906447 122.210789 119.013663 118.114893 115.343045
[129] 116.582576 117.664649 118.293956 110.240991 123.448221 114.286810 106.649270 94.109540
[137] 86.083931 64.405569 44.004352 49.501481 48.073961 56.702731 42.156812 33.474435
[145] 37.087008 40.194847 25.118679 15.408952 20.004536 26.479094 35.259951 32.865707
[153] 31.940607 32.276783 21.118207 24.321146 25.173254 21.776715 9.680800 -6.701597
[161] 9.968440 14.464967 27.753911 26.008985 31.610035 45.260026 43.365996 46.816673
[169] 55.597410 53.435699 54.490425 59.236311 64.864942 69.424869 71.353848 68.247889
[177] 60.897371 61.758886 64.851603 64.336034 64.894284 61.707362 49.822033 49.007656
[185] 51.469482 49.317583 47.018581 44.152863 41.500034 53.074043 65.820294 60.728401
[193] 64.025498 71.392611 68.428795 63.955826 64.301257 64.250418 59.645465 55.268618
[201] 54.184914 62.926569 60.351746 65.899920 68.156064 68.944423 68.766585 61.994629
[209] 63.768124 62.843017 87.106475 87.427915 88.702553 97.385323 95.976964 93.677931
[217] 95.735709 88.976437 84.542462 82.506929 74.952586 79.059860 75.077135 67.464745
[225] 65.711421 58.207913 51.345595 57.048246 60.788376 58.531137 50.199994 56.390053
[233] 59.919545 62.527345 69.056800 67.980170 66.008690 57.595750 56.546464 56.362108
[241] 62.054195 56.666467 57.056597 52.084578 54.377339 68.161312 76.621942 113.038937
[249] 112.348862 145.382267 152.305292 136.236661 123.727897 135.594831 133.359213 121.607803
[257] 117.815523 121.200924 111.314422 101.910213 86.020798 75.018682 60.540171 68.436656
[265] 56.616244 43.241005 34.533607 29.079533 37.228046 38.900878 51.302356 56.431963
[273] 57.309708 54.012106 52.253738 45.491185 39.074424 31.061097 15.374946 -12.456048
[281] 11.567127 37.330510 33.858432 38.964015 40.833612 27.651642 17.686629 4.222882
[289] -8.892777 19.544879 12.498149 11.436943 19.914443 27.081629 30.532915 30.734312
[297] 27.609288 27.063865 25.474552 25.325862 41.759450 47.687588 42.150661 26.174371
[305] 27.262608 29.093418 31.876603 36.033261 36.904038 35.474948 25.810794 24.372614
[313] 28.895348 42.721476 33.653370 21.091269 21.765285 35.832336 59.200568 60.736581
[321] 67.256627 84.447185 102.478800 115.202108 119.281544 119.207515 102.355084 74.862523
[329] 74.862523 74.862523 74.862523 74.862523 74.862523 74.862523 74.862523 74.862523
[337] 74.862523 74.862523 74.862523 74.862523 74.862523 74.862523 74.862523 74.862523
[345] 74.862523 74.862523 74.862523 74.862523 74.862523 74.862523 74.862523 74.862523
[353] 74.862523 74.862523 74.862523 74.862523 74.862523 74.862523 74.862523 74.862523
[361] 74.862523
attr(,"lambda")
[1] 0.7469761
hist(boxcox_wool_price_diff)
The final tidy_data is cleaned and ready for analysis.