This assignment is a coding one using the R programming language. The codes create synthetic datasets to work with for a car sales company. The ouputs from the codes are columns, values, datasets, graphs and other results. These results, workings, and the codes are explained where appropriate in plain texts.
Contents:
knitr::opts_chunk$set(echo = TRUE)
The code chunk below is used for loading packages required to produce the report
#install.packages("MASS") #if not yet installed, install
library(readr) # for reading files
suppressPackageStartupMessages(library(dplyr)) #to not show attached messages
suppressPackageStartupMessages(library(MASS)) #to not show attached messages
library(magrittr)
The following list describes each variable that is generated for the dataset 1 and 2:
Dataset 1: * Order_Number: This variable holds the unique orders for the dataset 1. * Qty: This is the variable that holds the quantity of an item or vehicle that is being ordered. * Price: This variable stores the price of a vehicle that is being ordered. * Order_dates: Variable that stores the date of the order of a car/vehicle. * Sold_status: This is a logical variable with a TRUE for car sold, FALSE for car not yet sold out. * Prdct_Code: This variable stores the unique code of the car model. It is shortened form of “Product_Code”, to fit report page width.
Dataset 2: * Prdct_Code: This variable is the common variable between the two datasets, 1 and 2. * MPG: This variable stores the miles per gallon data. * Cylndrs: This variable stores the number of Cylinders that each car or vehicle has in its engine. * Displmnt: This variable stores the displacement volume of air released as a result of works from the engine. * HrsPwr: The variable stores the amount of power measured in horsepower of the engine of a car vehicle. * Weight: This variable stores the measure of the weight in kilograms of the car. * Origin: This variable holds the name of the origin of the car.
Number of Rows and Columns: Two synthetic datasets are created as will be seen under the merged section below. They are datasets 1 and 2. Each dataset has between 50 and 100 rows (90 rows to be exact), and each has 5-10 variables (dataset 1 has 6, and dataset 2 has 8 variables).
Common Variable: As can be seen in the two synthetic datasets 1 and 2 below, the common variable, is the product code, which is shortened as Prdct_Code, to fit the page width display with all the other columns in a single lie.
Multiple Data Types: Each synthetic datasets include multiple data types. Dataset 1 has numerical types of integer and double. And it also have the other types including date, logical, and character. Dataset 2 has the data types of character, integer, double, and factor.
Generating list of columns/data: The following code chunks will generate each of the columns of the datasets..
The headings above the code chunks and comments within the code chunks explain in more detail the data generations.
This code chunk below generates the order Number of the dataset 1. The order number is a vector that starts from 1200 to 1289, to make 90 rows.
Order_Number <- as.character(c(1200:1289))
#Order_Number # If need be, then to uncomment this line to print
The code chunk below generates the order quantity, which is the number of cars that a customer orders. The code sets the lower and upper endpoints, which is the range of sample; the seed, which will enable reproducibility, and then it generate 90 samples of values for the quantity. It then converts each values to a numerical one (integer) before it stores them in the variable Qty. The Qty is commented but it can be uncommented if needed to print the head or the full number of quantity values.
# Set the parameters:
lower <- 1 # A lower endpoint for order_quantity
upper <- 5 # Choose an upper endpoint for order_quantity
n <- 90 # Set the sample size/rows
SEED <- 345 # Specify a seed
set.seed(SEED) # Set the seed
Order_Quantity <- runif(n, lower, upper) # Generate the synthetic data
Qty <- as.integer(Order_Quantity) # Convert to numeric integer
# Qty #If need be, then to uncomment this code to print
The codes below generates a sample of 90 cars.
# list of car model names to be rndomly mized and generate to 90 rows
car_models <- c("Toyota Corolla", "Toyota Caldina 7A","Toyota Caldina 2000",
"Toyota Caldina ZT", "Mazda Speed Axela",
"Subaru Levorg", "Toyota Corolla Van", "Toyoto Altis E",
"Toyota Altizza RS200", "Mazda Tribute", "Mitsubshi Airtrek",
"Toyota Crown","Nissan Xtrail", "Toyota Carina SI MyRoad",
"Toyota White Caldina","Toyota Caldina 1AZ", "Black Caldina",
"Mazda Tribute FB-X 4WD")
# Generate a sample of 90 cars
set.seed(123) # For reproducibility
sample_size <- 90
car_sample <- sample(car_models, sample_size, replace = TRUE)
#print(car_sample) #If need be, then to uncomment this code to print
The section of code chunks below generates the list of product codes for the above sample of cars. The product codes are unique to each unique sample of car. That is, ninety (90) sample of cars are generated while making sure each row must not be empty. And then a dataframe with a product code is assigned to each car model. Finally the two columns of car model and product code of the dataframe are store seperatly in their own variables of Car_Model and Product_Code, respectively, to be used during the merging process later on.
# 90 cars generated, and (ensuring non-empty car values)
set.seed(123) # For reproducibility
sample_size <- 90
non_empty_car_models <- car_models[car_models != ""] # Remove empty values
car_sample <- sample(non_empty_car_models, sample_size, replace = TRUE)
# A 5-character product code is generared (including the character "PC") for
# each unique model of car
unique_car_models <- unique(car_sample)
product_codes <- sprintf("PC%03d", seq_along(unique_car_models))
# Create a dataframe with product codes assigned to each car model
car_df <- data.frame(Car_Model = car_sample,
Product_Code = product_codes[match(car_sample,
unique_car_models)])
# Seperating car models and product codes into their own variables
# from the data frame, car_df
Car_Model <- car_df$Car_Model
Product_Code <- car_df$Product_Code
# This is a shortened version of the long form "Product_Code"
# to enable all columns to show A4-width-wise for the dataset 2.
Prdct_Code <- Product_Code
#print(car_Model) # If need be, then to uncomment this code to print
#print(Prdct_Code) # If need be, then to uncomment this code to print
Here, below the column name price is generated. The range of prices of the cars starts with the minimum price of thirty thousands (30 000) to the maximum price of one hundred and twenty thousand (120 000). The 90 samples to be generated has the seed of 445 to be used for reproducing the same data values in the future. Finally the price is converted to a numerical double value and rounded to two (2) decimal places before storing the result in the variable ‘Price’. Like the generated data above, the price data below is not printed yet at this time, especially just to keep to the page limit of this report. Thus the variable is commented below but they will all be displayed when printing the whole dataset 1 and 2, later, during the merging stage.
# Set the parameters:
lowerPrice <- 30000 # A lower endpoint for price
upperPrice <- 120000 # An upper endpoint for price
n <- 90 # sample size/rows
SEED_Price <- 445 # Specify a seed
set.seed(SEED_Price) # Set the seed
Order_Price <- runif(n, lowerPrice, upperPrice) # Generate the synthetic data
# Convert to double, then put to 2 decimal places
Price <- round(as.double(Order_Price),2)
#print(Price) #If need be, then to uncomment this code to print
The code chunk below shows the workings for the randomly generated dates. These 90 date values are randomly picked starting from the 1st of January in the year 2021 and up to the 31st of December 2022. Like the above variables, the order dates variable is commented so as not to be printed at this time, but will be so during the combination of all the columns in their datasets.
# Generate a sample of 90 random dates between 2010 and 2022
Order_dates <- sample(seq(as.Date("2021-01-01"),
as.Date("2022-12-31"), by="days"), 90)
# Print the sample of random dates
#print(Order_dates) #If need be, then to uncomment this code to print
The code chunks below samples out 90 rows of the logical sold status. The sold status values takes on the logical binary form of YES or NO.
# Sample of 90 random "Yes" and "No" values
sold <- sample(c("Yes", "No"), 90, replace = TRUE)
Sold_status <- as.logical(sold == "Yes") # Convert to logical datatype
# Printing line
# print(Sold_status) #If need be, then to uncomment this code to print
The above generated columns are for those of the dataset 1 or dataframe 1. Now the next set of variables to be generated will be for the dataset 2.
The code chunks below generated the list of MPG, which stands for the interger values for miles per gallon for the car on selling.Like above the sampling process below takes the form of specifying the minimum, maximum, seed, and the number of samples of 90 values. The MPG is then converted to integer and stored in the MPG variable for later merging with other variables for the dataset 2.
# Set the parameters:
lower <- 9 # Lower endpoint for Miles Per Gallon (MPG)
upper <- 30 # Upper endpoint Miles Per Gallon (MPG)
n <- 90 # Set the sample size/rows
SEED <- 347 # Specify a seed
# generate
set.seed(SEED) # Set the seed
MPG <- runif(n, lower, upper) # Generate the synthetic data
MPG <- as.integer(MPG) # Convert to numeric integer
#MPG #If need be, then to uncomment this code to print
The code below generates the number of random cylinder values. The sampling process is similar to the above mentioned variables as well.
set.seed(234) # Set seed for reproducibility
values <- c(3, 4, 6, 8) # Values to sample from
# Generate 90 random values from the set with replacement
Cylinders <- sample(values, 90, replace = TRUE)
# Shorten the var name to fit table in one page width
Cylndrs <- Cylinders
#print(Cylinders) #If need be, then to uncomment this code to print
The code below comes up with the list of weight data values for the dataset 2.
# Set the parameters:
lower_Weight <- 1613 # Lower endpoint for Weight
upper_Weight <- 4732 # Upper endpoint for Weight
n <- 90 # Set the sample size/rows
SEED_Weight <- 112 # Specify a seed
set.seed(SEED_Weight) # Set the seed
Weight <- runif(n, lower_Weight, upper_Weight) # Generate the synthetic data
Weight <- as.integer(Weight) # Convert to numeric integer
#Weight #If need be, then to uncomment this code to print
The code chunk below samples a 90 rows of values of the origin of the cars. The values for the country of origion for the cars are either, US, Japan, or Europe. The values are then converted to factor before stored in the Origin variable.
set.seed(822) # Set seed for reproducibility
values <- c("US", "Japan", "Europe") # Values to sample from
# Generate 90 random values from the set with replacement
Origin <- sample(values, 90, replace = TRUE)
Origin <- as.factor(Origin) # Connvert to factor and then stored in the Origin
#print(Origin) #If need be, then to uncomment this code to print
Now the code below combines the selected variable above for the dataset 1. Then the dataset 1 is previewd using the head function. The dataset1 is also written to an external CSV file named, dataset1.csv.
# dataset 1
dataset1 <- data.frame(Order_Number,
Qty,
Price,
Order_dates,
Sold_status,
Prdct_Code)
write_csv(dataset1, "dataset1.csv") # write the dataset1 to an external csv file
head(dataset1) # print some rows of the dataset for preview purposes
The code chunk below introduced three (3) missing values into the dataset 1 for the price variable. The missing values takes the form of ‘NA’.
rows_to_insert_missingvalues <- c(1, 2, 3)
dataset1$Price[rows_to_insert_missingvalues] <- NA
head(dataset1, n = 10)
The output above shows rows 1, 2 and 3 now hav missing values (NA).
The code chunk below introduces outliers to the rows: 6 and 10 for the variables of Qty and Price of the dataset 1. The outliers for the rows 6 and 10 for the variable Qty are 40 and 1 000 000, respectively. And the outliers for the rows 6 and 10 for the variable Price are 100 and 200, respectively. Upon previewing the dataset1 with 10 rows only, it will show the inserted outliers onto the dataframe.
dataset1$Qty[6] <- 40
dataset1$Price[6] <- 1000000
dataset1$Qty[10] <- 100
dataset1$Price[10] <- 200
head(dataset1, n = 10)
# convert into dataframe and store in the variable dataseet2
dataset2 <- data.frame(Prdct_Code,
Car_Model,
MPG,
Cylndrs,
Displmnt,
HrsPwr,
Weight,
Origin)
write_csv(dataset2, "dataset2.csv")#write generated dataset to external csv file
head(dataset2) # preview the dataset2
The code chunk below inserted missing values into the column named Cylinder (Cylndrs) of the dataset2. The missing values are inserted into the rows 2, 4, 6. The missing values are NA. Finally the code prints a preview showing these missing values in the dataset for the column Cylinder.
rows_to_insert_3missingvalues <- c(2, 4, 6)
dataset2$Cylndrs[rows_to_insert_3missingvalues] <- NA
head(dataset2, n = 10)
The code chunk below inserted outliers for the following columns in dataset2: MPG, Displmnt, Hrspwr, and Weight. The rows in which the outliers are inserted into are: 7, 10, 1, 10, 10, respectively, for the columns. The below code chunk also shows a second set of outliers introduced to various other rows (14, 20, 30, 20, 20) for the above mentioned columns. Finally, a preview of the changes of outliers is shown on the output dataset2.
dataset2$MPG[7] <- 50
dataset2$Cylndrs[10] <- 12
dataset2$Displmnt[1] <- 800
dataset2$HrsPwr[10] <- 800
dataset2$Weight[10] <- 7000
dataset2$MPG[14] <- 60
dataset2$Cylndrs[20] <- 14
dataset2$Displmnt[30] <- 900
dataset2$HrsPwr[20] <- 900
dataset2$Weight[20] <- 8000
head(dataset2, n = 20)
Now the following code chunks in thise MERGE section show various types of merge functions being applied. The full_merged_Datasets shows the merging of the dataframes of the dataset1 and dataset2 using the common column of the Product code (Prdct_Code). The “all = TRUE” parameter indicates that the full outer join is performed. This means that the two datasets will combine all rows from both datasets. The merge is then printed for preview by 10 rows.
full_merge_dataset <- merge(dataset1, dataset2, by = "Prdct_Code", all = TRUE)
# write and store the full_merge_dataset to an external csv file
write_csv(full_merge_dataset, "full_merge_dataset.csv")
head(full_merge_dataset, n = 20) # preview
The “inner_merged_Dataset” section of code, below, is an inner join, which means that the result of the join will contain only the rows from the dataset1 where there is a match on the “Prdct_Code” column with the dataset 2.
inner_merge <- merge(dataset1, dataset2, by = "Prdct_Code")
# write and store the inner_merge dataset to an external csv file
write_csv(inner_merge, "inner_merge.csv")
head(inner_merge, n = 20)
Left join, and right join merge are the other two types of join, but are not going to be shown here for now.
The data type of the variale dataset1 is of data frame (data.frame) as shown on the output below.
# Checking the data type of the dataset1 variable
class(dataset1)
## [1] "data.frame"
The structure of dataframe 1 below is made up of its various columns, which includes their names, data types, and some sample values, as shown by the output.
# Explore Data Structures of dataframe1
str(dataset1)
## 'data.frame': 90 obs. of 6 variables:
## $ Order_Number: chr "1200" "1201" "1202" "1203" ...
## $ Qty : num 1 2 2 3 2 40 2 4 2 100 ...
## $ Price : num NA NA NA 48309 62853 ...
## $ Order_dates : Date, format: "2021-07-15" "2021-02-18" ...
## $ Sold_status : logi FALSE TRUE FALSE TRUE FALSE TRUE ...
## $ Prdct_Code : chr "PC001" "PC002" "PC003" "PC004" ...
The structure of dataframe 2 below is made up of its various columns, which includes their names, data types, and some sample values, shown by the output.
# Explore Data Structures of dataframe2
str(dataset2)
## 'data.frame': 90 obs. of 8 variables:
## $ Prdct_Code: chr "PC001" "PC002" "PC003" "PC004" ...
## $ Car_Model : chr "Toyota White Caldina" "Toyota Carina SI MyRoad" "Toyota Caldina 2000" "Mazda Tribute" ...
## $ MPG : num 17 12 11 23 10 23 50 21 26 13 ...
## $ Cylndrs : num 3 NA 4 NA 4 NA 3 8 3 12 ...
## $ Displmnt : num 800 268 341 112 89 254 388 118 404 161 ...
## $ HrsPwr : num 113 136 110 122 117 117 126 112 143 800 ...
## $ Weight : num 2787 4488 4706 4660 2350 ...
## $ Origin : Factor w/ 3 levels "Europe","Japan",..: 2 3 3 1 1 1 2 3 3 3 ...
Below code chunks, checks the attributes of the data types. The attributes of the full merged of the two datasets are shown on the output below. The names of the columns are given as: “Prdct_Code”, “Order_Number”, etc. Also the names of the rows or the row numbers have been shown as row.names below. Finally the data type of the variable is shown as the dataframe (i.e data.frame)
# This chunk checks the attributes in the data
#Checking attributes:
attributes(full_merge_dataset)
## $names
## [1] "Prdct_Code" "Order_Number" "Qty" "Price" "Order_dates"
## [6] "Sold_status" "Car_Model" "MPG" "Cylndrs" "Displmnt"
## [11] "HrsPwr" "Weight" "Origin"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
## [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## [145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## [163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## [181] 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
## [199] 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
## [217] 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
## [235] 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252
## [253] 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270
## [271] 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
## [289] 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306
## [307] 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324
## [325] 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342
## [343] 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360
## [361] 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378
## [379] 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396
## [397] 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414
## [415] 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432
## [433] 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450
## [451] 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468
## [469] 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486
## [487] 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504
## [505] 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522
## [523] 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540
##
## $class
## [1] "data.frame"
# Below, is the attributes of other data types which can be uncommented to check
#attributes(inner_merge)
#attributes(inner_merge)
#attributes(left_merge)
# attributes(right_merge)
# attributes(dataset1)
# attributes(dataset2)
# attributes(Price)
# attributes(Qty)
# attributes(Weight)
# attributes(Origin)
# attributes(Prdct_Code)
The following conversions have already been applied above where appropriate. They have been applied after the sampling generation of each column for the synthesizing of the two datasets 1 and 2 above. Below shows the display of the conversions only of the data types in one place for ease of readability of how conversions above have been implemented.They do not need to be re-executed.
# Proper data type conversions shown but applied already
# So no need to be excuted again here.
#Order_Number <- as.character(Order_Number)
#Qty <- as.integer(Order_Quantity) # Quantity Convert to numeric integer
#Price <- round(as.double(Order_Price),2) # Price Convert to double, then
#round to 2 decimal places
#Order_dates <- sample(seq(as.Date("2021-01-01"), # Conversion of
# as.Date("2022-12-31"), by="days"), 90) # date
#Sold_status <- as.logical(sold == "Yes") # Convert to logical datatype
#MPG <- as.integer(MPG) # Miles per gallon (MPG) converted
#to numeric integer
#Weight <- as.integer(Weight) # COnvert Weight to integer
The following code factors the Origin variable in dataset 2, to order it from the car’s origin’s distance to where it is currently selling.The output shows the results of the labels and the labels.
origin <- dataset2$Origin
# specify labels and order
the_levels <- c("Japan", "Europe", "US")
the_labels <- c("Far", "Farther", "Farthest")
# factor the origin variable with labels and order
factor_Origin_Distance <- factor(origin, levels = the_levels, labels = the_labels)
# Printing factor variable
print(factor_Origin_Distance)
## [1] Far Farthest Farthest Farther Farther Farther Far Farthest
## [9] Farthest Farthest Farther Farther Farthest Far Far Far
## [17] Far Farthest Farthest Farthest Farther Farthest Farthest Far
## [25] Farthest Farther Farther Farther Far Farther Farthest Farther
## [33] Far Farther Far Far Farthest Farther Far Far
## [41] Farthest Farther Far Farthest Far Farthest Farthest Farther
## [49] Farther Far Farther Far Farther Farther Farther Farthest
## [57] Far Farther Farthest Farthest Far Farthest Far Far
## [65] Far Farthest Farther Far Farther Farther Farthest Far
## [73] Farthest Far Far Farthest Farthest Farther Farther Farther
## [81] Far Far Far Far Farthest Farther Far Far
## [89] Farther Farthest
## Levels: Far Farther Farthest
The following code chunk creates a new column by using the function mutate. The new column is called, Sales. The “Sales” data is obtained by multiplying the quantity (Qty) by the Price. The output shows the extra column, Sales.
# Create/mutate at least one variable from existing variables
# Print the resulting data frame
#mutate here
# Mutate to create a new column named,Sales,which is Quantity multiply by Price
dataset1 <- dataset1 %>%
mutate(Sales = dataset1$Qty * dataset1$Price)
head(dataset1)
The “scanCompleteRows” r code below scans for whether a row is completed or not. If a row returns ‘FALSE’, it means the row has missing value(s). If ‘TRUE’, it indicates non missing values is identified. After execution, the output below shows that rows 1, 2 and 3 has missing values.
# Looking for complete cases
rows_that_are_complete <- complete.cases(dataset1)
print(rows_that_are_complete)
## [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [13] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [37] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [49] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [73] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [85] TRUE TRUE TRUE TRUE TRUE TRUE
# Totaling the number of missing values in dataset 1
number_of_missing_value <- sum(is.na(dataset1))
print(paste("The nunmber of missing values in the dataset 1 is:",
number_of_missing_value))
## [1] "The nunmber of missing values in the dataset 1 is: 6"
# Totaling the number of missing values in dataset 2
number_of_missing_value <- sum(is.na(dataset2))
print(paste("The nunmber of missing values in the dataset 2 is:",
number_of_missing_value))
## [1] "The nunmber of missing values in the dataset 2 is: 3"
# Totaling the number of missing values in dataset 2
number_of_missing_value_in_full <- sum(is.na(full_merge_dataset))
print(paste("The nunmber of missing values in the full merged dataset is:",
number_of_missing_value_in_full))
## [1] "The nunmber of missing values in the full merged dataset is: 45"
Similarly, the above sum function can be used to count the total number of missing values in other merged datasets above. The function below tries to find which columns and rows of the dataset 1 has the missing values. If the output shows TRUE, then the row has missing value as per the column searched on. If the output shows FALSE, then it means the row doe not have a missing value.
The search for missinng values below shows that the columns price and sales of the dataset 1 have missing values on the rows, 1, 2, and 3.
# Check for missing values in a row in the searched column of dataset 1
# the columns price and sales below have missing values on rows 1,2,3.
# Since each executable line has same variale name, for ease of printing,
# each line can only be run one-at-a-time.
missingValues <- is.na(dataset1$Price)
#missingValues <- is.na(dataset1$Sales)
# after testing, the following columns, do not have missing values once
# uncommented one-at-a-time
#missingValues <- is.na(dataset1$Order_Number)
#missingValues <- is.na(dataset1$Qty)
#missingValues <- is.na(dataset1$Order_dates)
#missingValues <- is.na(dataset1$Sold_status)
#missingValues <- is.na(dataset1$Prdct_Code)
# Print TRUE/FALSE. If TRUE, then row has missing value. FALSE, no missing value.
print(missingValues)
## [1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE
Also the above code can also be executed for the dataset 2 in a similar fashion. The next section below attempts to scan for outliers.
The histogram below presents the visual concentration of the prices. It also identifies the outlier, which is the bar furtherest from the rest in the graph.
# histogram of the Prices of the vehicles
hist(dataset1$Price, main = "Histogram of Price")
Imagine if the range of data values for the prices is very wide, how can the numerical values be accurately identified? Here below, implementation of the IQR Method (Tukey’s Fence) is able to identify and pull out the outlier from the price column in the dataset 1.
# computing lower and upper fence for possible outliers
# also removing missing values during calculations
quantile1 <- quantile(dataset1$Price, 0.25, na.rm = TRUE)
quantile3 <- quantile(dataset1$Price, 0.75, na.rm = TRUE)
inter_quantile_range <- quantile3 - quantile1
low_fence <- quantile1 - 1.5 * inter_quantile_range
up_fence <- quantile3 + 1.5 * inter_quantile_range
# detecting any outliers below the lower fence and beyond the upper fence
outliers <- dataset1$Price[dataset1$Price < low_fence | dataset1$Price > up_fence]
#Concatenating and printing the formated text and outlier values
cat("The outliers are:", as.integer(outliers), "\n")
## The outliers are: NA NA NA 1000000
Dismissing the NA values in the output above, the outlier of the prices of the vehicles in the dataset 1 is, one million (1 000 000).
In addition, to count the number of times each of the outlier above has occured, the following cat() function prints the corresponding value for the above valid outlier for the dataset 1 price column, which is one (1).
# tallying the number of times of occurance of the values of the outliers,
# where the following table() function is used.
num_of_outlier_occurancies <- table(dataset1$Price)[as.character(outliers)]
cat("The corresponding frequency for each of the above outlier is:",
num_of_outlier_occurancies, "\n")
## The corresponding frequency for each of the above outlier is: NA NA NA 1
The above outlier frequency of one (1) indicates that the outlier of the price of one million (1000000) has occured 1 time.
The histogram below is used to checking foroutliers for the Sales data values. The sales outlier is 4e+07, which the longest form is 40 000 000.
# histogram of the Prices of the vehicles
hist(dataset1$Sales, main = "Histogram of Sales")
Below is the transformation processes. Firstly, pre-processing of the affected variables to be transformed is processed. These are price and quantity. The variable to be transformed is the Sales. Quantity and Price makes gives rise to the Sales variable. So quantity and price and made sure below not to have any missing values. This involves inserting the median values to substitute the missing values for the quantity and the price.
dataset1$Price
## [1] NA NA NA 48308.77 62852.63 1000000.00
## [7] 44102.38 73212.57 46976.22 200.00 35617.65 42803.48
## [13] 61115.03 80788.63 49291.00 73967.92 59886.19 75831.54
## [19] 38058.87 52390.19 38255.97 93535.89 68576.78 101834.88
## [25] 86647.12 103351.09 76506.94 65215.50 32862.37 104767.51
## [31] 85701.12 85833.69 61520.09 66309.10 32239.33 56476.37
## [37] 52316.62 69187.61 111542.67 89660.38 108836.07 113750.39
## [43] 72742.17 34705.22 75740.68 40591.13 68738.70 47929.37
## [49] 48961.15 86182.62 82498.70 44210.84 56307.54 116910.16
## [55] 30241.15 43575.14 42208.20 109638.65 114342.48 93378.66
## [61] 96103.88 79117.56 39197.31 95961.15 113242.94 57382.38
## [67] 114170.65 84182.49 48738.85 118048.10 84070.19 58232.92
## [73] 49432.55 118796.26 46387.53 66210.76 43456.75 89756.09
## [79] 102344.58 99499.15 38685.92 80991.45 111833.43 66468.47
## [85] 54797.73 75044.17 32334.74 68334.81 112989.76 31181.38
# select rows to insert median to replace the outliers
rows <- c(6, 10)
# computing the median
median <- median(dataset1$Qty, na.rm = TRUE)
# go through the rows and insert the value of the median
for (row in rows) {
dataset1[row, "Qty"] <- median
}
# select rows of the Price column to insert median to replace the outliers
rows <- c(1, 2, 3, 6, 10)
# computing the median of the Price
median_price <- median(dataset1$Price, na.rm = TRUE)
# go through the rows and insert the value of the median
for (row in rows) {
dataset1[row, "Price"] <- median_price
}
Once the quantity and the price has been successfully imputed above, the sales variable is able to be inserted into the dataset one and is worked out by the coode below using the mutate function.
dataset1 <- dataset1 %>%
mutate(Sales = Qty * Price)
Check the Sales data, if the missing values and the outliers have been removed. There should now be no outliers on the rows 1, 2, 3, 6, and 10. Now these rows as can be seen in the output below have been filled with the updates values of have 68576.78, 137153.56, 137153.56, 205730.34, and 205730.34, respectively.
print(dataset1$Sales)
## [1] 68576.78 137153.56 137153.56 144926.31 125705.26 205730.34 88204.76
## [8] 292850.28 93952.44 205730.34 71235.30 171213.92 61115.03 242365.89
## [15] 197164.00 73967.92 239544.76 75831.54 152235.48 209560.76 153023.88
## [22] 374143.56 68576.78 101834.88 86647.12 413404.36 229520.82 65215.50
## [29] 131449.48 314302.53 85701.12 171667.38 184560.27 265236.40 128957.32
## [36] 169429.11 52316.62 276750.44 334628.01 179320.76 326508.21 455001.56
## [43] 290968.68 34705.22 75740.68 121773.39 68738.70 95858.74 146883.45
## [50] 258547.86 329994.80 88421.68 112615.08 350730.48 30241.15 174300.56
## [57] 84416.40 109638.65 228684.96 93378.66 96103.88 79117.56 156789.24
## [64] 191922.30 113242.94 57382.38 114170.65 84182.49 97477.70 354144.30
## [71] 168140.38 174698.76 197730.20 475185.04 139162.59 66210.76 130370.25
## [78] 179512.18 409378.32 298497.45 116057.76 323965.80 447333.72 66468.47
## [85] 54797.73 225132.51 97004.22 205004.43 225979.52 31181.38
Transformation is applied to the variable Sales. Below is the histogram of the origional skewed to the right histogram sales.
hist(dataset1$Sales,
main = "Histogram of Sales",
xlab = "Miles per gallon")
summary(dataset1)
## Order_Number Qty Price Order_dates
## Length:90 Min. :1.000 Min. : 30241 Min. :2021-01-24
## Class :character 1st Qu.:1.000 1st Qu.: 48794 1st Qu.:2021-06-14
## Mode :character Median :3.000 Median : 68577 Median :2021-12-19
## Mean :2.478 Mean : 70877 Mean :2021-12-23
## 3rd Qu.:3.750 3rd Qu.: 88907 3rd Qu.:2022-06-27
## Max. :4.000 Max. :118796 Max. :2022-12-29
## Sold_status Prdct_Code Sales
## Mode :logical Length:90 Min. : 30241
## FALSE:42 Class :character 1st Qu.: 88259
## TRUE :48 Mode :character Median :145905
## Mean :173716
## 3rd Qu.:228009
## Max. :475185
For the sales data, the square root transformation is applied due to the histogram above is skewed to the right.
sqrt_sales <- sqrt(dataset1$Sales) # applying square root transformation
The histogram appears as follows:
hist(sqrt_sales, breaks = 7,
main = "Histogram of the Square Root of Sales",
xlab = "Square Root of Sales")
Above square root of sales has transformed the origional histogram and
now it is approximately normally distributed, closer to bell-shaped
histogram. This shows that most of the sales data values are closer to
the mean and the median than that of the original skewed histogram.
The following two summaries display the details for the two datasets. The variables have various attributes like lenght, class, mode, median, minimum, maximum, 1st and 3rd quartiles, quartile, and number of FALSE and TRUE.
summary(dataset1)
## Order_Number Qty Price Order_dates
## Length:90 Min. :1.000 Min. : 30241 Min. :2021-01-24
## Class :character 1st Qu.:1.000 1st Qu.: 48794 1st Qu.:2021-06-14
## Mode :character Median :3.000 Median : 68577 Median :2021-12-19
## Mean :2.478 Mean : 70877 Mean :2021-12-23
## 3rd Qu.:3.750 3rd Qu.: 88907 3rd Qu.:2022-06-27
## Max. :4.000 Max. :118796 Max. :2022-12-29
## Sold_status Prdct_Code Sales
## Mode :logical Length:90 Min. : 30241
## FALSE:42 Class :character 1st Qu.: 88259
## TRUE :48 Mode :character Median :145905
## Mean :173716
## 3rd Qu.:228009
## Max. :475185
summary(dataset2)
## Prdct_Code Car_Model MPG Cylndrs
## Length:90 Length:90 Min. : 9.00 Min. : 3.000
## Class :character Class :character 1st Qu.:14.25 1st Qu.: 4.000
## Mode :character Mode :character Median :20.50 Median : 4.000
## Mean :20.12 Mean : 5.218
## 3rd Qu.:24.00 3rd Qu.: 6.000
## Max. :60.00 Max. :14.000
## NA's :3
## Displmnt HrsPwr Weight Origin
## Min. : 87.0 Min. :110.0 Min. :1699 Europe:29
## 1st Qu.:155.0 1st Qu.:125.0 1st Qu.:2565 Japan :32
## Median :260.0 Median :135.0 Median :3361 US :29
## Mean :270.2 Mean :150.8 Mean :3410
## 3rd Qu.:364.8 3rd Qu.:144.8 3rd Qu.:4253
## Max. :900.0 Max. :900.0 Max. :8000
##
The final information in the summaries above of the two datasets have now completed this assignment 2.
Referrnces: https://www.kaggle.com/datasets/ankanhore545/classic-cars
Thank you.