INTRODUCTION:

This assignment is a coding one using the R programming language. The codes create synthetic datasets to work with for a car sales company. The ouputs from the codes are columns, values, datasets, graphs and other results. These results, workings, and the codes are explained where appropriate in plain texts.

Contents:

1. Setup
1. Data descriptions
1. Generating Synthetic columns and datasets
1. Merge the data
1. Understanding the data
1. Manipulating
1. Scan I
1. Scan II
1. Transform
1. Summary

0. SETUP

knitr::opts_chunk$set(echo = TRUE)

The code chunk below is used for loading packages required to produce the report

#install.packages("MASS") #if not yet installed, install
library(readr)  # for reading files
suppressPackageStartupMessages(library(dplyr)) #to not show attached messages
suppressPackageStartupMessages(library(MASS))  #to not show attached messages
library(magrittr)

1. DATA DESCRIPTIONS

The following list describes each variable that is generated for the dataset 1 and 2:

Dataset 1: * Order_Number: This variable holds the unique orders for the dataset 1. * Qty: This is the variable that holds the quantity of an item or vehicle that is being ordered. * Price: This variable stores the price of a vehicle that is being ordered. * Order_dates: Variable that stores the date of the order of a car/vehicle. * Sold_status: This is a logical variable with a TRUE for car sold, FALSE for car not yet sold out. * Prdct_Code: This variable stores the unique code of the car model. It is shortened form of “Product_Code”, to fit report page width.

Dataset 2: * Prdct_Code: This variable is the common variable between the two datasets, 1 and 2. * MPG: This variable stores the miles per gallon data. * Cylndrs: This variable stores the number of Cylinders that each car or vehicle has in its engine. * Displmnt: This variable stores the displacement volume of air released as a result of works from the engine. * HrsPwr: The variable stores the amount of power measured in horsepower of the engine of a car vehicle. * Weight: This variable stores the measure of the weight in kilograms of the car. * Origin: This variable holds the name of the origin of the car.

2. GENERATING SYNTHETIC COLUMNS and DATASETS

Number of Rows and Columns: Two synthetic datasets are created as will be seen under the merged section below. They are datasets 1 and 2. Each dataset has between 50 and 100 rows (90 rows to be exact), and each has 5-10 variables (dataset 1 has 6, and dataset 2 has 8 variables).

Common Variable: As can be seen in the two synthetic datasets 1 and 2 below, the common variable, is the product code, which is shortened as Prdct_Code, to fit the page width display with all the other columns in a single lie.

Multiple Data Types: Each synthetic datasets include multiple data types. Dataset 1 has numerical types of integer and double. And it also have the other types including date, logical, and character. Dataset 2 has the data types of character, integer, double, and factor.

Generating list of columns/data: The following code chunks will generate each of the columns of the datasets..

The headings above the code chunks and comments within the code chunks explain in more detail the data generations.

Order_Numbers: Generating a list of Order_Numbers

This code chunk below generates the order Number of the dataset 1. The order number is a vector that starts from 1200 to 1289, to make 90 rows.

Order_Number <- as.character(c(1200:1289))

#Order_Number # If need be, then to uncomment this line to print

Order Quantity: Generate a list of Order Quantity

The code chunk below generates the order quantity, which is the number of cars that a customer orders. The code sets the lower and upper endpoints, which is the range of sample; the seed, which will enable reproducibility, and then it generate 90 samples of values for the quantity. It then converts each values to a numerical one (integer) before it stores them in the variable Qty. The Qty is commented but it can be uncommented if needed to print the head or the full number of quantity values.

# Set the parameters: 
lower <- 1        # A lower endpoint for order_quantity 
upper <- 5        # Choose an upper endpoint for order_quantity
n <- 90           # Set the sample size/rows
SEED <- 345       # Specify a seed

set.seed(SEED)    # Set the seed
Order_Quantity <- runif(n, lower, upper)  # Generate the synthetic data
Qty <- as.integer(Order_Quantity)         # Convert to numeric integer
# Qty   #If need be, then to uncomment this code to print

Car Names: Generating a List of Car Names

The codes below generates a sample of 90 cars.

# list of car model names to be rndomly mized and generate to 90 rows
car_models <- c("Toyota Corolla", "Toyota Caldina 7A","Toyota Caldina 2000", 
                "Toyota Caldina ZT", "Mazda Speed Axela",
                "Subaru Levorg", "Toyota Corolla Van", "Toyoto Altis E",
                "Toyota Altizza RS200", "Mazda Tribute", "Mitsubshi Airtrek",
                "Toyota Crown","Nissan Xtrail", "Toyota Carina SI MyRoad", 
                "Toyota White Caldina","Toyota Caldina 1AZ", "Black Caldina", 
                "Mazda Tribute FB-X 4WD")

# Generate a sample of 90 cars
set.seed(123)  # For reproducibility
sample_size <- 90
car_sample <- sample(car_models, sample_size, replace = TRUE)
#print(car_sample) #If need be, then to uncomment this code to print

Product Codes: Generate a list of unique product codes for the sample cars.

The section of code chunks below generates the list of product codes for the above sample of cars. The product codes are unique to each unique sample of car. That is, ninety (90) sample of cars are generated while making sure each row must not be empty. And then a dataframe with a product code is assigned to each car model. Finally the two columns of car model and product code of the dataframe are store seperatly in their own variables of Car_Model and Product_Code, respectively, to be used during the merging process later on.

# 90 cars generated, and (ensuring non-empty car values)
set.seed(123)  # For reproducibility
sample_size <- 90
non_empty_car_models <- car_models[car_models != ""]  # Remove empty values
car_sample <- sample(non_empty_car_models, sample_size, replace = TRUE)

# A 5-character product code is generared (including the character "PC") for 
# each unique model of car
unique_car_models <- unique(car_sample)
product_codes <- sprintf("PC%03d", seq_along(unique_car_models))

# Create a dataframe with product codes assigned to each car model
car_df <- data.frame(Car_Model = car_sample,
                     Product_Code = product_codes[match(car_sample,
                                                        unique_car_models)])
# Seperating car models and product codes into their own variables
# from the data frame, car_df
Car_Model <- car_df$Car_Model
Product_Code <- car_df$Product_Code

# This is a shortened version of the long form "Product_Code"
# to enable all columns to show A4-width-wise for the dataset 2.
Prdct_Code <- Product_Code 

#print(car_Model)    # If need be, then to uncomment this code to print
#print(Prdct_Code) # If need be, then to uncomment this code to print

Price: Generate a price list of the vehicles

Here, below the column name price is generated. The range of prices of the cars starts with the minimum price of thirty thousands (30 000) to the maximum price of one hundred and twenty thousand (120 000). The 90 samples to be generated has the seed of 445 to be used for reproducing the same data values in the future. Finally the price is converted to a numerical double value and rounded to two (2) decimal places before storing the result in the variable ‘Price’. Like the generated data above, the price data below is not printed yet at this time, especially just to keep to the page limit of this report. Thus the variable is commented below but they will all be displayed when printing the whole dataset 1 and 2, later, during the merging stage.

# Set the parameters: 

lowerPrice <- 30000  # A lower endpoint for price
upperPrice <- 120000 # An upper endpoint for price
n <- 90              # sample size/rows
SEED_Price <- 445    # Specify a seed

set.seed(SEED_Price) # Set the seed 
Order_Price <- runif(n, lowerPrice, upperPrice) # Generate the synthetic data 

# Convert to double, then put to 2 decimal places
Price <- round(as.double(Order_Price),2)

#print(Price)               #If need be, then to uncomment this code to print

Order Dates: Generate Order Dates

The code chunk below shows the workings for the randomly generated dates. These 90 date values are randomly picked starting from the 1st of January in the year 2021 and up to the 31st of December 2022. Like the above variables, the order dates variable is commented so as not to be printed at this time, but will be so during the combination of all the columns in their datasets.

# Generate a sample of 90 random dates between 2010 and 2022
Order_dates <- sample(seq(as.Date("2021-01-01"),
                          as.Date("2022-12-31"), by="days"), 90)

# Print the sample of random dates
#print(Order_dates)           #If need be, then to uncomment this code to print

Sold_Status: Generate Sold_Status

The code chunks below samples out 90 rows of the logical sold status. The sold status values takes on the logical binary form of YES or NO.

# Sample of 90 random "Yes" and "No" values
sold <- sample(c("Yes", "No"), 90, replace = TRUE)

Sold_status <- as.logical(sold == "Yes") # Convert to logical datatype

# Printing line
# print(Sold_status)      #If need be, then to uncomment this code to print

The above generated columns are for those of the dataset 1 or dataframe 1. Now the next set of variables to be generated will be for the dataset 2.

Generate Dataset 2

Miles Per Gallon (MPG): Generate a list of Miles Per Gallon data

The code chunks below generated the list of MPG, which stands for the interger values for miles per gallon for the car on selling.Like above the sampling process below takes the form of specifying the minimum, maximum, seed, and the number of samples of 90 values. The MPG is then converted to integer and stored in the MPG variable for later merging with other variables for the dataset 2.

# Set the parameters: 
lower <- 9  # Lower endpoint for Miles Per Gallon (MPG)
upper <- 30 # Upper endpoint Miles Per Gallon (MPG)
n <- 90     # Set the sample size/rows
SEED <- 347 # Specify a seed

# generate 
set.seed(SEED)                 # Set the seed
MPG <- runif(n, lower, upper)  # Generate the synthetic data
MPG <- as.integer(MPG)         # Convert to numeric integer
#MPG                      #If need be, then to uncomment this code to print

Cylinders: Generate Cylinders data list

The code below generates the number of random cylinder values. The sampling process is similar to the above mentioned variables as well.

set.seed(234)               # Set seed for reproducibility
values <- c(3, 4, 6, 8)     # Values to sample from

# Generate 90 random values from the set with replacement
Cylinders <- sample(values, 90, replace = TRUE)

# Shorten the var name to fit table in one page width
Cylndrs <- Cylinders

#print(Cylinders)          #If need be, then to uncomment this code to print

Weight: Generate a list of Weight

The code below comes up with the list of weight data values for the dataset 2.

# Set the parameters: 
lower_Weight <- 1613             # Lower endpoint for Weight
upper_Weight <- 4732             # Upper endpoint for Weight
n <- 90                          # Set the sample size/rows
SEED_Weight <- 112               # Specify a seed

set.seed(SEED_Weight)                            # Set the seed
Weight <- runif(n, lower_Weight, upper_Weight)   # Generate the synthetic data
Weight <- as.integer(Weight)                     # Convert to numeric integer
#Weight                   #If need be, then to uncomment this code to print

Origin: Generate a list of Origin data

The code chunk below samples a 90 rows of values of the origin of the cars. The values for the country of origion for the cars are either, US, Japan, or Europe. The values are then converted to factor before stored in the Origin variable.

set.seed(822)    # Set seed for reproducibility

values <- c("US", "Japan", "Europe")   # Values to sample from

# Generate 90 random values from the set with replacement
Origin <- sample(values, 90, replace = TRUE)

Origin <- as.factor(Origin) # Connvert to factor and then stored in the Origin

#print(Origin)      #If need be, then to uncomment this code to print

Correlated Random Data: Synthetic datasets include correlated random data

The code chunks below generated the correlared random data for the variables displacement (Displmnt) and horsepower (HrsPwr).

# correlation coefficient
correlation <- 0.8

# min and max values for Displacement
min_displacement <- 78
max_displacement <- 450

# within the specified range, generate Displacement values 
displacement <- runif(90, min_displacement, max_displacement)

# Correlated Horsepower values are generated based on Displacement values
scaled_displacement <- (displacement - min_displacement) / 
  (max_displacement - min_displacement)
mean_horsepower <- 132
sd_horsepower <- 23
error <- rnorm(90, mean = 0, sd = sd_horsepower * sqrt(1 - correlation^2))
horsepower <- mean_horsepower + correlation * scaled_displacement + error

# A dataframe is created to store the two variables
correlated_data <- data.frame(Displmnt = displacement, Hrspwr = horsepower)

correlated_data <- round((correlated_data),0) # put to zero decimal places

Displmnt <- correlated_data$Displmnt
HrsPwr <- correlated_data$Hrspwr

#print(correlated_data)     #If need be, then to uncomment this code to print

Dataset 1: Create Synthesized dataset 1

Now the code below combines the selected variable above for the dataset 1. Then the dataset 1 is previewd using the head function. The dataset1 is also written to an external CSV file named, dataset1.csv.

# dataset 1
dataset1 <- data.frame(Order_Number,
                      Qty,
                      Price,
                      Order_dates,
                      Sold_status,
                      Prdct_Code)

write_csv(dataset1, "dataset1.csv") # write the dataset1 to an external csv file

head(dataset1) # print some rows of the dataset for preview purposes

Missing Values: Introduce Synthetic Dataset 1 with 3 missing values.

The code chunk below introduced three (3) missing values into the dataset 1 for the price variable. The missing values takes the form of ‘NA’.

rows_to_insert_missingvalues <- c(1, 2, 3)
dataset1$Price[rows_to_insert_missingvalues] <- NA

head(dataset1, n = 10)

The output above shows rows 1, 2 and 3 now hav missing values (NA).

Outlier: Introduce numeric variable to at least one outlier.Here: 2 outliers

The code chunk below introduces outliers to the rows: 6 and 10 for the variables of Qty and Price of the dataset 1. The outliers for the rows 6 and 10 for the variable Qty are 40 and 1 000 000, respectively. And the outliers for the rows 6 and 10 for the variable Price are 100 and 200, respectively. Upon previewing the dataset1 with 10 rows only, it will show the inserted outliers onto the dataframe.

dataset1$Qty[6] <- 40
dataset1$Price[6] <- 1000000

dataset1$Qty[10] <- 100
dataset1$Price[10] <- 200
head(dataset1, n = 10)

Dataset 2: Create Synthesized dataset 2

# convert into dataframe and store in the variable dataseet2
dataset2 <- data.frame(Prdct_Code,
                        Car_Model,
                        MPG,
                        Cylndrs,
                        Displmnt,
                        HrsPwr,
                        Weight,
                        Origin)

write_csv(dataset2, "dataset2.csv")#write generated dataset to external csv file

head(dataset2) # preview the dataset2

Dataset2 Missing Values: Introduce Synthetic Dataset2 with 3 missing values.

The code chunk below inserted missing values into the column named Cylinder (Cylndrs) of the dataset2. The missing values are inserted into the rows 2, 4, 6. The missing values are NA. Finally the code prints a preview showing these missing values in the dataset for the column Cylinder.

rows_to_insert_3missingvalues <- c(2, 4, 6)
dataset2$Cylndrs[rows_to_insert_3missingvalues] <- NA

head(dataset2, n = 10)

outliers for dataset 2: Introduce numeric variable to at least one outlier.

Here: 2 outliers/var

The code chunk below inserted outliers for the following columns in dataset2: MPG, Displmnt, Hrspwr, and Weight. The rows in which the outliers are inserted into are: 7, 10, 1, 10, 10, respectively, for the columns. The below code chunk also shows a second set of outliers introduced to various other rows (14, 20, 30, 20, 20) for the above mentioned columns. Finally, a preview of the changes of outliers is shown on the output dataset2.

dataset2$MPG[7] <- 50
dataset2$Cylndrs[10] <- 12
dataset2$Displmnt[1] <- 800
dataset2$HrsPwr[10] <- 800
dataset2$Weight[10] <- 7000

dataset2$MPG[14] <- 60
dataset2$Cylndrs[20] <- 14
dataset2$Displmnt[30] <- 900
dataset2$HrsPwr[20] <- 900
dataset2$Weight[20] <- 8000

head(dataset2, n = 20)

3. MERGE

Full Outer Join

Now the following code chunks in thise MERGE section show various types of merge functions being applied. The full_merged_Datasets shows the merging of the dataframes of the dataset1 and dataset2 using the common column of the Product code (Prdct_Code). The “all = TRUE” parameter indicates that the full outer join is performed. This means that the two datasets will combine all rows from both datasets. The merge is then printed for preview by 10 rows.

full_merge_dataset <- merge(dataset1, dataset2, by = "Prdct_Code", all = TRUE)

# write and store the full_merge_dataset to an external csv file
write_csv(full_merge_dataset, "full_merge_dataset.csv") 

head(full_merge_dataset, n = 20) # preview

Inner Join

The “inner_merged_Dataset” section of code, below, is an inner join, which means that the result of the join will contain only the rows from the dataset1 where there is a match on the “Prdct_Code” column with the dataset 2.

inner_merge <- merge(dataset1, dataset2, by = "Prdct_Code")

# write and store the inner_merge dataset to an external csv file
write_csv(inner_merge, "inner_merge.csv") 

head(inner_merge, n = 20)

Left join, and right join merge are the other two types of join, but are not going to be shown here for now.

4. UNDERSTAND

Checking the data of the variables

The data type of the variale dataset1 is of data frame (data.frame) as shown on the output below.

# Checking the data type of the dataset1 variable
class(dataset1)

## [1] "data.frame"

Exploring the data structures of the dataframes or dataset 1 and 2.

The structure of dataframe 1 below is made up of its various columns, which includes their names, data types, and some sample values, as shown by the output.

# Explore Data Structures of dataframe1
str(dataset1)

## 'data.frame':    90 obs. of  6 variables:
##  $ Order_Number: chr  "1200" "1201" "1202" "1203" ...
##  $ Qty         : num  1 2 2 3 2 40 2 4 2 100 ...
##  $ Price       : num  NA NA NA 48309 62853 ...
##  $ Order_dates : Date, format: "2021-07-15" "2021-02-18" ...
##  $ Sold_status : logi  FALSE TRUE FALSE TRUE FALSE TRUE ...
##  $ Prdct_Code  : chr  "PC001" "PC002" "PC003" "PC004" ...

The structure of dataframe 2 below is made up of its various columns, which includes their names, data types, and some sample values, shown by the output.

# Explore Data Structures of dataframe2
str(dataset2)

## 'data.frame':    90 obs. of  8 variables:
##  $ Prdct_Code: chr  "PC001" "PC002" "PC003" "PC004" ...
##  $ Car_Model : chr  "Toyota White Caldina" "Toyota Carina SI MyRoad" "Toyota Caldina 2000" "Mazda Tribute" ...
##  $ MPG       : num  17 12 11 23 10 23 50 21 26 13 ...
##  $ Cylndrs   : num  3 NA 4 NA 4 NA 3 8 3 12 ...
##  $ Displmnt  : num  800 268 341 112 89 254 388 118 404 161 ...
##  $ HrsPwr    : num  113 136 110 122 117 117 126 112 143 800 ...
##  $ Weight    : num  2787 4488 4706 4660 2350 ...
##  $ Origin    : Factor w/ 3 levels "Europe","Japan",..: 2 3 3 1 1 1 2 3 3 3 ...

Checks the attributes of the data types

Below code chunks, checks the attributes of the data types. The attributes of the full merged of the two datasets are shown on the output below. The names of the columns are given as: “Prdct_Code”, “Order_Number”, etc. Also the names of the rows or the row numbers have been shown as row.names below. Finally the data type of the variable is shown as the dataframe (i.e data.frame)

# This chunk checks the attributes in the data 
#Checking attributes:
attributes(full_merge_dataset)

## $names
##  [1] "Prdct_Code"   "Order_Number" "Qty"          "Price"        "Order_dates" 
##  [6] "Sold_status"  "Car_Model"    "MPG"          "Cylndrs"      "Displmnt"    
## [11] "HrsPwr"       "Weight"       "Origin"      
## 
## $row.names
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
## [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## [145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## [163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## [181] 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
## [199] 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
## [217] 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
## [235] 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252
## [253] 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270
## [271] 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
## [289] 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306
## [307] 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324
## [325] 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342
## [343] 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360
## [361] 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378
## [379] 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396
## [397] 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414
## [415] 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432
## [433] 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450
## [451] 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468
## [469] 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486
## [487] 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504
## [505] 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522
## [523] 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540
## 
## $class
## [1] "data.frame"

# Below, is the attributes of other data types which can be uncommented to check
#attributes(inner_merge)
#attributes(inner_merge)
#attributes(left_merge)
# attributes(right_merge)
# attributes(dataset1)
# attributes(dataset2)
# attributes(Price)
# attributes(Qty)
# attributes(Weight)
# attributes(Origin)
# attributes(Prdct_Code)

Proper data type conversions displayed

The following conversions have already been applied above where appropriate. They have been applied after the sampling generation of each column for the synthesizing of the two datasets 1 and 2 above. Below shows the display of the conversions only of the data types in one place for ease of readability of how conversions above have been implemented.They do not need to be re-executed.

# Proper data type conversions shown but applied already
# So no need to be excuted again here.

#Order_Number <- as.character(Order_Number)
#Qty <- as.integer(Order_Quantity)        # Quantity Convert to numeric integer
#Price <- round(as.double(Order_Price),2) # Price Convert to double, then 
                                                #round to 2 decimal places

#Order_dates <- sample(seq(as.Date("2021-01-01"),               # Conversion of
#                          as.Date("2022-12-31"), by="days"), 90)     # date

#Sold_status <- as.logical(sold == "Yes") # Convert to logical datatype

#MPG <- as.integer(MPG)                   # Miles per gallon (MPG) converted
                                               #to numeric integer

#Weight <- as.integer(Weight)             # COnvert Weight to integer

Factor variable labelled and/or ordered.

The following code factors the Origin variable in dataset 2, to order it from the car’s origin’s distance to where it is currently selling.The output shows the results of the labels and the labels.

origin <- dataset2$Origin

# specify labels and order
the_levels <- c("Japan", "Europe", "US")
the_labels <- c("Far", "Farther", "Farthest")

# factor the origin variable with labels and order
factor_Origin_Distance <- factor(origin, levels = the_levels, labels = the_labels)

# Printing factor variable
print(factor_Origin_Distance)

##  [1] Far      Farthest Farthest Farther  Farther  Farther  Far      Farthest
##  [9] Farthest Farthest Farther  Farther  Farthest Far      Far      Far     
## [17] Far      Farthest Farthest Farthest Farther  Farthest Farthest Far     
## [25] Farthest Farther  Farther  Farther  Far      Farther  Farthest Farther 
## [33] Far      Farther  Far      Far      Farthest Farther  Far      Far     
## [41] Farthest Farther  Far      Farthest Far      Farthest Farthest Farther 
## [49] Farther  Far      Farther  Far      Farther  Farther  Farther  Farthest
## [57] Far      Farther  Farthest Farthest Far      Farthest Far      Far     
## [65] Far      Farthest Farther  Far      Farther  Farther  Farthest Far     
## [73] Farthest Far      Far      Farthest Farthest Farther  Farther  Farther 
## [81] Far      Far      Far      Far      Farthest Farther  Far      Far     
## [89] Farther  Farthest
## Levels: Far Farther Farthest

5. MANIPULATE DATA

Create/mutate at least one variable from the existing ones.

The following code chunk creates a new column by using the function mutate. The new column is called, Sales. The “Sales” data is obtained by multiplying the quantity (Qty) by the Price. The output shows the extra column, Sales.

# Create/mutate at least one variable from existing variables

# Print the resulting data frame
#mutate here

# Mutate to create a new column named,Sales,which is Quantity multiply by Price
dataset1 <- dataset1 %>%
  mutate(Sales = dataset1$Qty * dataset1$Price)

head(dataset1)

6. SCAN I

Complete Cases

The “scanCompleteRows” r code below scans for whether a row is completed or not. If a row returns ‘FALSE’, it means the row has missing value(s). If ‘TRUE’, it indicates non missing values is identified. After execution, the output below shows that rows 1, 2 and 3 has missing values.

# Looking for complete cases
rows_that_are_complete <- complete.cases(dataset1)
print(rows_that_are_complete)

##  [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [37]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [49]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [61]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [85]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

Counting the number of missing values in the dataset 1

# Totaling the number of missing values in dataset 1
number_of_missing_value <- sum(is.na(dataset1))
print(paste("The nunmber of missing values in the dataset 1 is:", 
            number_of_missing_value))

## [1] "The nunmber of missing values in the dataset 1 is: 6"

Counting the number of missing values in the dataset 2

# Totaling the number of missing values in dataset 2
number_of_missing_value <- sum(is.na(dataset2))
print(paste("The nunmber of missing values in the dataset 2 is:", 
            number_of_missing_value))

## [1] "The nunmber of missing values in the dataset 2 is: 3"

Counting the number of missing values in the full merged dataset

# Totaling the number of missing values in dataset 2
number_of_missing_value_in_full <- sum(is.na(full_merge_dataset))
print(paste("The nunmber of missing values in the full merged dataset is:", 
            number_of_missing_value_in_full))

## [1] "The nunmber of missing values in the full merged dataset is: 45"

Similarly, the above sum function can be used to count the total number of missing values in other merged datasets above. The function below tries to find which columns and rows of the dataset 1 has the missing values. If the output shows TRUE, then the row has missing value as per the column searched on. If the output shows FALSE, then it means the row doe not have a missing value.

The search for missinng values below shows that the columns price and sales of the dataset 1 have missing values on the rows, 1, 2, and 3.

Identify the column and the row with missing values in the dataset 1

# Check for missing values in a row in the searched column of dataset 1
# the columns price and sales below have missing values on rows 1,2,3.
# Since each executable line has same variale name, for ease of printing,
# each line can only be run one-at-a-time.

missingValues <- is.na(dataset1$Price)
#missingValues <- is.na(dataset1$Sales)

# after testing, the following columns, do not have missing values once
                                        # uncommented one-at-a-time
#missingValues <- is.na(dataset1$Order_Number)
#missingValues <- is.na(dataset1$Qty)
#missingValues <- is.na(dataset1$Order_dates)
#missingValues <- is.na(dataset1$Sold_status)
#missingValues <- is.na(dataset1$Prdct_Code)

# Print TRUE/FALSE. If TRUE, then row has missing value. FALSE, no missing value.
print(missingValues)

##  [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE

Also the above code can also be executed for the dataset 2 in a similar fashion. The next section below attempts to scan for outliers.

7. SCAN II

Identifying Outlier using Histogram for the column, Price

The histogram below presents the visual concentration of the prices. It also identifies the outlier, which is the bar furtherest from the rest in the graph.

# histogram of the Prices of the vehicles
hist(dataset1$Price, main = "Histogram of Price")

Imagine if the range of data values for the prices is very wide, how can the numerical values be accurately identified? Here below, implementation of the IQR Method (Tukey’s Fence) is able to identify and pull out the outlier from the price column in the dataset 1.

# computing lower and upper fence for possible outliers
# also removing missing values during calculations
quantile1 <- quantile(dataset1$Price, 0.25, na.rm = TRUE)
quantile3 <- quantile(dataset1$Price, 0.75, na.rm = TRUE)
inter_quantile_range <- quantile3 - quantile1
low_fence <- quantile1 - 1.5 * inter_quantile_range
up_fence <- quantile3 + 1.5 * inter_quantile_range

# detecting any outliers below the lower fence and beyond the upper fence 
outliers <- dataset1$Price[dataset1$Price < low_fence | dataset1$Price > up_fence]

#Concatenating and printing the formated text and outlier values
cat("The outliers are:", as.integer(outliers), "\n")

## The outliers are: NA NA NA 1000000

Dismissing the NA values in the output above, the outlier of the prices of the vehicles in the dataset 1 is, one million (1 000 000).

In addition, to count the number of times each of the outlier above has occured, the following cat() function prints the corresponding value for the above valid outlier for the dataset 1 price column, which is one (1).

# tallying the number of times of occurance of the values of the outliers,
# where the following table() function is used.
num_of_outlier_occurancies <- table(dataset1$Price)[as.character(outliers)]

cat("The corresponding frequency for each of the above outlier is:", 
    num_of_outlier_occurancies, "\n")

## The corresponding frequency for each of the above outlier is: NA NA NA 1

The above outlier frequency of one (1) indicates that the outlier of the price of one million (1000000) has occured 1 time.

Identifying Outlier using Histogram for the column, Sales

The histogram below is used to checking foroutliers for the Sales data values. The sales outlier is 4e+07, which the longest form is 40 000 000.

# histogram of the Prices of the vehicles
hist(dataset1$Sales, main = "Histogram of Sales")

8. TRANSFORM

Below is the transformation processes. Firstly, pre-processing of the affected variables to be transformed is processed. These are price and quantity. The variable to be transformed is the Sales. Quantity and Price makes gives rise to the Sales variable. So quantity and price and made sure below not to have any missing values. This involves inserting the median values to substitute the missing values for the quantity and the price.

dataset1$Price

##  [1]         NA         NA         NA   48308.77   62852.63 1000000.00
##  [7]   44102.38   73212.57   46976.22     200.00   35617.65   42803.48
## [13]   61115.03   80788.63   49291.00   73967.92   59886.19   75831.54
## [19]   38058.87   52390.19   38255.97   93535.89   68576.78  101834.88
## [25]   86647.12  103351.09   76506.94   65215.50   32862.37  104767.51
## [31]   85701.12   85833.69   61520.09   66309.10   32239.33   56476.37
## [37]   52316.62   69187.61  111542.67   89660.38  108836.07  113750.39
## [43]   72742.17   34705.22   75740.68   40591.13   68738.70   47929.37
## [49]   48961.15   86182.62   82498.70   44210.84   56307.54  116910.16
## [55]   30241.15   43575.14   42208.20  109638.65  114342.48   93378.66
## [61]   96103.88   79117.56   39197.31   95961.15  113242.94   57382.38
## [67]  114170.65   84182.49   48738.85  118048.10   84070.19   58232.92
## [73]   49432.55  118796.26   46387.53   66210.76   43456.75   89756.09
## [79]  102344.58   99499.15   38685.92   80991.45  111833.43   66468.47
## [85]   54797.73   75044.17   32334.74   68334.81  112989.76   31181.38

# select rows to insert median to replace the outliers
rows <- c(6, 10)

# computing the median
median <- median(dataset1$Qty, na.rm = TRUE)

# go through the rows and insert the value of the median
for (row in rows) {
  dataset1[row, "Qty"] <- median
}

# select rows of the Price column to insert median to replace the outliers
rows <- c(1, 2, 3, 6, 10)

# computing the median of the Price
median_price <- median(dataset1$Price, na.rm = TRUE)

# go through the rows and insert the value of the median
for (row in rows) {
  dataset1[row, "Price"] <- median_price
}

Mutate

Once the quantity and the price has been successfully imputed above, the sales variable is able to be inserted into the dataset one and is worked out by the coode below using the mutate function.

dataset1 <- dataset1 %>%
  mutate(Sales = Qty * Price)

Check the Sales data, if the missing values and the outliers have been removed. There should now be no outliers on the rows 1, 2, 3, 6, and 10. Now these rows as can be seen in the output below have been filled with the updates values of have 68576.78, 137153.56, 137153.56, 205730.34, and 205730.34, respectively.

print(dataset1$Sales)

##  [1]  68576.78 137153.56 137153.56 144926.31 125705.26 205730.34  88204.76
##  [8] 292850.28  93952.44 205730.34  71235.30 171213.92  61115.03 242365.89
## [15] 197164.00  73967.92 239544.76  75831.54 152235.48 209560.76 153023.88
## [22] 374143.56  68576.78 101834.88  86647.12 413404.36 229520.82  65215.50
## [29] 131449.48 314302.53  85701.12 171667.38 184560.27 265236.40 128957.32
## [36] 169429.11  52316.62 276750.44 334628.01 179320.76 326508.21 455001.56
## [43] 290968.68  34705.22  75740.68 121773.39  68738.70  95858.74 146883.45
## [50] 258547.86 329994.80  88421.68 112615.08 350730.48  30241.15 174300.56
## [57]  84416.40 109638.65 228684.96  93378.66  96103.88  79117.56 156789.24
## [64] 191922.30 113242.94  57382.38 114170.65  84182.49  97477.70 354144.30
## [71] 168140.38 174698.76 197730.20 475185.04 139162.59  66210.76 130370.25
## [78] 179512.18 409378.32 298497.45 116057.76 323965.80 447333.72  66468.47
## [85]  54797.73 225132.51  97004.22 205004.43 225979.52  31181.38

Transformation is applied to the variable Sales. Below is the histogram of the origional skewed to the right histogram sales.

hist(dataset1$Sales, 
      main = "Histogram of Sales", 
      xlab = "Miles per gallon")

summary(dataset1)

##  Order_Number            Qty            Price         Order_dates        
##  Length:90          Min.   :1.000   Min.   : 30241   Min.   :2021-01-24  
##  Class :character   1st Qu.:1.000   1st Qu.: 48794   1st Qu.:2021-06-14  
##  Mode  :character   Median :3.000   Median : 68577   Median :2021-12-19  
##                     Mean   :2.478   Mean   : 70877   Mean   :2021-12-23  
##                     3rd Qu.:3.750   3rd Qu.: 88907   3rd Qu.:2022-06-27  
##                     Max.   :4.000   Max.   :118796   Max.   :2022-12-29  
##  Sold_status      Prdct_Code            Sales       
##  Mode :logical   Length:90          Min.   : 30241  
##  FALSE:42        Class :character   1st Qu.: 88259  
##  TRUE :48        Mode  :character   Median :145905  
##                                     Mean   :173716  
##                                     3rd Qu.:228009  
##                                     Max.   :475185

For the sales data, the square root transformation is applied due to the histogram above is skewed to the right.

sqrt_sales <- sqrt(dataset1$Sales) # applying square root transformation

The histogram appears as follows:

hist(sqrt_sales, breaks = 7,
     main = "Histogram of the Square Root of Sales", 
     xlab = "Square Root of Sales")

Above square root of sales has transformed the origional histogram and now it is approximately normally distributed, closer to bell-shaped histogram. This shows that most of the sales data values are closer to the mean and the median than that of the original skewed histogram.

9. SUMMARY STATISTICS

The following two summaries display the details for the two datasets. The variables have various attributes like lenght, class, mode, median, minimum, maximum, 1st and 3rd quartiles, quartile, and number of FALSE and TRUE.

summary(dataset1)

##  Order_Number            Qty            Price         Order_dates        
##  Length:90          Min.   :1.000   Min.   : 30241   Min.   :2021-01-24  
##  Class :character   1st Qu.:1.000   1st Qu.: 48794   1st Qu.:2021-06-14  
##  Mode  :character   Median :3.000   Median : 68577   Median :2021-12-19  
##                     Mean   :2.478   Mean   : 70877   Mean   :2021-12-23  
##                     3rd Qu.:3.750   3rd Qu.: 88907   3rd Qu.:2022-06-27  
##                     Max.   :4.000   Max.   :118796   Max.   :2022-12-29  
##  Sold_status      Prdct_Code            Sales       
##  Mode :logical   Length:90          Min.   : 30241  
##  FALSE:42        Class :character   1st Qu.: 88259  
##  TRUE :48        Mode  :character   Median :145905  
##                                     Mean   :173716  
##                                     3rd Qu.:228009  
##                                     Max.   :475185

summary(dataset2)

##   Prdct_Code         Car_Model              MPG           Cylndrs      
##  Length:90          Length:90          Min.   : 9.00   Min.   : 3.000  
##  Class :character   Class :character   1st Qu.:14.25   1st Qu.: 4.000  
##  Mode  :character   Mode  :character   Median :20.50   Median : 4.000  
##                                        Mean   :20.12   Mean   : 5.218  
##                                        3rd Qu.:24.00   3rd Qu.: 6.000  
##                                        Max.   :60.00   Max.   :14.000  
##                                                        NA's   :3       
##     Displmnt         HrsPwr          Weight        Origin  
##  Min.   : 87.0   Min.   :110.0   Min.   :1699   Europe:29  
##  1st Qu.:155.0   1st Qu.:125.0   1st Qu.:2565   Japan :32  
##  Median :260.0   Median :135.0   Median :3361   US    :29  
##  Mean   :270.2   Mean   :150.8   Mean   :3410              
##  3rd Qu.:364.8   3rd Qu.:144.8   3rd Qu.:4253              
##  Max.   :900.0   Max.   :900.0   Max.   :8000              
##

The final information in the summaries above of the two datasets have now completed this assignment 2.

Referrnces: https://www.kaggle.com/datasets/ankanhore545/classic-cars

Thank you.

Data Wrangling Assessment Task 2: Creating and pre-processing synthetic data

Creating Synthetic Datasets for Car Sales Company

Teti Dei