Assessment 2: Creating and pre-processing synthetic data

Introduction

We are trying to simulate a real case scenario, as if we were Data Scientist for a company, in this case Uber; by generating and pre-processing realistic synthetic data, and dealing with random and messy data, with missing values and outliers.

In addition, as data is often drawn from several sources, we are going to combine at least these synthetic data sets.

Setup

Inserting and loading the packages you need to produce the report here:

knitr::opts_chunk$set(echo = TRUE, 
                      warning = FALSE, 
                      message = FALSE)

# Install script, should installing packages be needed.
#install.packages("dplyr")
#install.packages("magrittr")
#install.packages("outliers")
#install.packages("tidyr")
#install.packages("Hmisc")
#install.packages("knitr")
#install.packages("MVN")
#install.packages("readr")
#install.packages("here")
#install.packages("openxlsx")

# Accessing the necessary libraries
library(dplyr)        # For data wrangling # For mutating a new column
library(magrittr)     # For pipes
library(outliers) 
library(tidyr)    # For reading and writing data 
library(Hmisc) # for imputing values
library(MVN)
library(knitr) # Useful for creating nice tables
library(readr) # Useful for importing and exporting data
library(here)   # For specifying file paths
library(openxlsx)

STEP 1: Generate the data.

In order to generate the required data, in three dataframes, we are going to use a number of methods.

As some of this methods will, without giving them a set SEED, create a random value each time they run; for us to be able to perform some studies, we will set a a SEED value for all of the generators.

# Set a new value for SEED 
SEED <- 303 

# Set the seed
set.seed(SEED)

For the driver dataframe, we are going to generate a number of variables, for 50 drivers, which will act as our observations:

driver_id : with a numeric value, this is a simple index generator from 1-50.
exp : driver experience in months, which will be generated randomly for the last 10 years, thus from 0 for new drivers to 120 months. In this case, we will be using the runif function.
driver_rating : randomly from 1 to 5 stars (doubles), as given by the member. In this case, we will be using the rnorm function, as the distribution will better resemble a skewed distribution around an average value.
car_type : a discrete selection of values. LOOK INTO MAKING THE DISTRIBUTION SKEWED TOWARDS HATCH AND SEDAN.
n_rides_d : number of rides provided; as it should be a function of the time spent as an Uber driver, we will create this data correlating it from exp, with an average of rides per month of 500, starting obviously at 0 rides and with an error of 15%.

# value settings
nd <- 50 # number of drivers
md <- 4.1 # mean driver rating
sdd <- 0.5 # standard deviation for the driver rating


# variables.
driver_id <- c(1:50)

exp <- as.integer(runif(nd, 0, 120)) # making it an integer value

driver_rating <- rnorm(50, mean = md, sd = sdd) %>% 
                            round(digits = 2)

car_type <- sample(c("hatch", "sedan", "SUV", "minivan", NA), 
                   nd, 
                   replace = TRUE, 
                   prob = c(0.3, 0.4, 0.15, 0.05, 0.1))

# for the number of rides
rpm <- 500 # rides per month
rerror <- 75 # 15% ride error

error1 <- rnorm(nd, mean = rpm, sd = rerror) # generating random error

# Include the error within the linear model 
y_vec <- function(x) {
              rpm * x + error1 
} # creating the function to correlate the data

n_rides_d <- as.integer(y_vec(exp)) # as we cannot have parts of a trip

# driver dataframe
driver <- data.frame(driver_id = driver_id,
                     exp = exp,
                     avg_driver_rating = driver_rating,
                     car_type = car_type,
                     n_rides_d = n_rides_d)

head(driver)

For the member dataframe, we are going to generate a number of variables, for a sample of 150 members, which will act as our observations:

user_id : with a numeric value, this is a simple index generator from 1-150.
member_since : with a random year, for the last 10 years, taking into consideration that this is a sample set, we are assuming a normal distribution, instead of the known distribution correlated with the company’s growth.
member_rating : member rating as given by the drivers, we will be using the rnorm function, as the distribution will better resemble a skewed distribution.
premium : whether a member has selected premium membership only, here we will induce a NA, for those who have not chosen.

# value settings
nm <- 150 # number of members
mm <- 4.3 # mean member rating
sdm <- 0.61 # standard deviation for the member rating

# variables
user_id <- c(1:nm)

member_since <- round(runif(nm, 2012, 2022), digits = 2) # we will have to transform this to a date format

member_rating <- rnorm(nm, mean = mm, sd = sdm) %>% 
                            round(digits = 2)

premium <- sample(c("Y", "N", NA), nm, replace = TRUE, prob = c(0.2, 0.6, 0.2))


# member dataframe 
member <- data.frame(member_id = user_id,
                     member_since = member_since,
                     avg_member_rating = member_rating,
                     premium = premium)

head(member)

For the rides dataframe, we are going to generate a number of variables, which contain elements from the previous tables, plus some intrinsic to this one, for 300 rides, which will act as our observations:

driver_id : in this case it will randomly pick out of the 50 drivers with replacement, using the sample function.
customer_id : it will randomly pick out of the 150 members with replacement.
driver_ride_rating : generating a rating for the driver during the ride, randomly from 1 to 5 stars (numeric), as given by the member, this cannot be missing as it is required for a driver to provide at rating to close the job. A rating probability, similar to the averages above, has been provided.
member_ride_rating : same as above, although rider rates member, there is an opportunity here to introduce missing values, as members can forget to provide ratings. A rating probability has been provided, relative to the averages on previous tables.
ride_distance : correlated with the cost, the booking fee, plus an error of time spent,in km.
notes : a selection from the ride
ride_fee : in dollars. This will be created after the dataframe has been emsembled from the previous variables, to showcase a different methodology, fees vary between $8-$150.

nr <-  300 # number of rides
ratings <- factor(c(1, 2, 3, 4, 5), ordered = TRUE) # factor ratings
ratings_NA <- c("1", "2", "3", "4", "5", NA) # char values

# variables
d_id  <- sample(driver_id, nr, replace = TRUE)

m_id  <- sample(user_id, nr, replace = TRUE)

driver_ride_rating <- sample(ratings, nr, replace = TRUE, prob = c(0.02, 0.04, 0.06, 0.28, 0.7))

member_ride_rating <- sample(ratings_NA, nr, replace = TRUE, prob = c(0.02, 0.04, 0.14, 0.3, 0.3, 0.2))

#ride_fee <- round(runif(nr, 8, 150), digits = 2) # a normal distribution provides too larger values
avdfee <- 27.54 # average ride fee
sdfee <- 60 # standard deviation

ride_fee <- rnorm(nr, mean = avdfee, sd = sdfee) %>% # for a better value spread
                    round(digits = 2) %>% 
                    abs() # to eliminate possible negatives


# to generate the ride distance, we have a number of factors to input
bf <- -5.5 # booking fee offset
ppk <- 0.54 # km per dollar
error2 <- 3 # km or the time equivalent


y_d <- function(x) {
              bf + ppk * x + error2 
}

ride_distance <- round(y_d(ride_fee), digits = 2)


notes <- sample(c("Good navigation", "Good driving", "Clean Vehicle", NA), nr, replace = TRUE)


# ride dataframe
rides <- data.frame(driver_id = d_id,
                   member_id = m_id,
                   driver_ride_rating = driver_ride_rating,
                   member_ride_rating = member_ride_rating,
                   fee = ride_fee,
                   ride_distance = ride_distance,
                   notes = notes)

head(rides)

STEP 2: Merge your synthetic datasets.

We are going to extend the rides dataset with information from the other two tables by performing two left joins. First, we are going to create a copy of the rides set, in order to be able to go back to it, should be need to. Secondly, we are going to join the driver and member tables, to be able to have all the information we need in one table, by selecting certain columns. For this we are going to use the driver_id and member_id columns to perform the joins, which R automatically assigns as the columns to join by, given that they are name in the same way.

rides_joined <- rides # creating a copy of rides

rides_joined %<>% 
  left_join(driver %>% 
              select(1:4)) %>% # leaving the number of rides behind
  left_join(member)

trips <- rides_joined

head(trips)

To showcase the requirements:

str(trips)

## 'data.frame':    300 obs. of  13 variables:
##  $ driver_id         : int  13 23 18 17 11 27 33 7 25 11 ...
##  $ member_id         : int  40 46 52 108 31 60 6 103 113 140 ...
##  $ driver_ride_rating: Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 4 5 5 5 4 5 5 5 5 4 ...
##  $ member_ride_rating: chr  "4" "5" "3" "4" ...
##  $ fee               : num  66.01 8.49 116.51 211.09 66.74 ...
##  $ ride_distance     : num  33.15 2.08 60.42 111.49 33.54 ...
##  $ notes             : chr  "Clean Vehicle" "Clean Vehicle" "Good navigation" "Clean Vehicle" ...
##  $ exp               : int  12 99 28 49 108 96 93 63 99 108 ...
##  $ avg_driver_rating : num  4.38 3.34 4.5 4.4 3.46 4.2 3.65 3.66 3.3 3.46 ...
##  $ car_type          : chr  "SUV" "hatch" "hatch" "SUV" ...
##  $ member_since      : num  2022 2020 2019 2020 2016 ...
##  $ avg_member_rating : num  3.11 3.92 4.88 4.07 4.82 4.3 4.21 3.6 4.19 4.04 ...
##  $ premium           : chr  "Y" "N" "Y" "N" ...

Let’s change a few of the variables into their correct formats.

trips$member_ride_rating  %<>%  as.integer() %>% 
                                as.factor()

rides$member_ride_rating  %<>%  as.integer() %>% # in case we need to use it separately
                                as.factor()

str(trips$member_ride_rating)

##  Factor w/ 5 levels "1","2","3","4",..: 4 5 3 4 4 5 NA 3 5 4 ...

The premium column can be changed to an ordered factor too:

trips$premium  %<>%  as.factor() %>% 
                      as.ordered()

member$premium  %<>%  as.factor() %>% 
                      as.ordered()

str(trips$premium)

##  Ord.factor w/ 2 levels "N"<"Y": 2 1 2 1 2 2 NA NA NA 1 ...

STEP 3: Create/mutate at least one variable from the existing ones.

n_rides_m : We were going to create a number of rides variable, at the member table, based in the time as a member. (UNABLE TO GENERATE DESIRED RESULT)

In the ride table: - ride_time : in minutes, will be created based on the distance, a given trip average speed and traffic and navigation errors.

'
I was trying something here, but it is not working out
#### number of rides for the member table
# first transforming the number to a date time format.
avridepw <- 6 # rides per month
error3 <- 5 # 

member %<>% 
  mutate(n_rides_m = (avridepw * 12 * (2022 - member_since) + 
                      #error3 * 12 * (2022 - member_since[member$premium == "Y"])+
                        #I do not know how to add error to the premium members
                        runif(1, min = 0, max = 10) * (2022 - member_since)))
  
member$n_rides_m %<>%  as.numeric() '

## [1] "\nI was trying something here, but it is not working out\n#### number of rides for the member table\n# first transforming the number to a date time format.\navridepw <- 6 # rides per month\nerror3 <- 5 # \n\nmember %<>% \n  mutate(n_rides_m = (avridepw * 12 * (2022 - member_since) + \n                      #error3 * 12 * (2022 - member_since[member$premium == \"Y\"])+\n                        #I do not know how to add error to the premium members\n                        runif(1, min = 0, max = 10) * (2022 - member_since)))\n  \nmember$n_rides_m %<>%  as.numeric() "

# for the ride time, we have an average speed of 40km/h and a traffic or congestion error of 20%
avs <- 3/2 # in min/km
error3 <- 0.05 # in min/km
trips %<>% 
  mutate(ride_time = (avs * ride_distance + error3 * ride_distance) 
         %>% as.integer())

head(trips)

STEP 4: Scan all variables for missing values.

Given that we are going to deal with the NAs in bulk for each one of the tables, using the colSums function will give us a easier overview of the tasks at hand.

trips dataset

colSums(is.na(trips))

##          driver_id          member_id driver_ride_rating member_ride_rating 
##                  0                  0                  0                 76 
##                fee      ride_distance              notes                exp 
##                  0                  0                 70                  0 
##  avg_driver_rating           car_type       member_since  avg_member_rating 
##                  0                 16                  0                  0 
##            premium          ride_time 
##                 76                  0

We can see the macro effect, which small amounts of NAs in the tables used to join datasets, have overall.

member table

colSums(is.na(member))

##         member_id      member_since avg_member_rating           premium 
##                 0                 0                 0                38

Given that most people using premium would have chosen that option in advance, we are going to assume that all of the missing values are from customers that forgot to complete their profile, thus we will impute the value “N”, and update all of the columns with those values missing.

member$premium <- impute(member$premium, fun = mode)
trips$premium <- impute(trips$premium, fun = mode)

colSums(is.na(member)) # to show that it has been fixed

##         member_id      member_since avg_member_rating           premium 
##                 0                 0                 0                 0

driver table

colSums(is.na(driver))

##         driver_id               exp avg_driver_rating          car_type 
##                 0                 0                 0                 3 
##         n_rides_d 
##                 0

In order deal with these missing values, we are going to impute a new factor.

driver %<>% 
  mutate(car_type = case_when(is.na(car_type) ~ "unknown", 
                               TRUE ~ car_type)) 
trips %<>% 
  mutate(car_type = case_when(is.na(car_type) ~ "unknown", 
                               TRUE ~ car_type)) 
colSums(is.na(driver))

##         driver_id               exp avg_driver_rating          car_type 
##                 0                 0                 0                 0 
##         n_rides_d 
##                 0

Looking again at the overall missing values at the trips table,

colSums(is.na(trips))

##          driver_id          member_id driver_ride_rating member_ride_rating 
##                  0                  0                  0                 76 
##                fee      ride_distance              notes                exp 
##                  0                  0                 70                  0 
##  avg_driver_rating           car_type       member_since  avg_member_rating 
##                  0                  0                  0                  0 
##            premium          ride_time 
##                  0                  0

We can see that we are missing some notes and some member ratings, for this we are going to use similar methods than above. For :

member_ride_rating we will do with imputing the mode, as the ride did not raise any negative or extreme positive feelings, to provide a rating; given that it is set as a character type.
notes we will rewrite the missing responses with “Left blank”.

rides$member_ride_rating <- impute(rides$member_ride_rating, fun = mode)
trips$member_ride_rating <- impute(trips$member_ride_rating, fun = mode)

colSums(is.na(member))

##         member_id      member_since avg_member_rating           premium 
##                 0                 0                 0                 0

trips %<>% 
  mutate(notes = case_when(is.na(notes) ~ "Left blank", 
                               TRUE ~ notes))

rides %<>% 
  mutate(notes = case_when(is.na(notes) ~ "Left blank", 
                               TRUE ~ notes))

colSums(is.na(member)) # for the check

##         member_id      member_since avg_member_rating           premium 
##                 0                 0                 0                 0

STEP 5: Scan all numeric variables for outliers.

Use any suitable technique to deal with the outliers.

First we start by having a look at the dataset’s summary.

summary(trips)

## 
##  76 values imputed to 4 
## 
## 
##  76 values imputed to N

##    driver_id       member_id      driver_ride_rating member_ride_rating
##  Min.   : 1.00   Min.   :  1.00   1:  6              1:  9             
##  1st Qu.:13.00   1st Qu.: 39.00   2: 11              2:  7             
##  Median :24.00   Median : 76.00   3: 16              3: 46             
##  Mean   :25.05   Mean   : 76.73   4: 68              4:169             
##  3rd Qu.:36.00   3rd Qu.:115.00   5:199              5: 69             
##  Max.   :50.00   Max.   :150.00                                        
##       fee         ride_distance       notes                exp        
##  Min.   :  0.09   Min.   : -2.45   Length:300         Min.   :  4.00  
##  1st Qu.: 19.52   1st Qu.:  8.04   Class :character   1st Qu.: 28.00  
##  Median : 45.73   Median : 22.20   Mode  :character   Median : 49.00  
##  Mean   : 53.09   Mean   : 26.17                      Mean   : 57.68  
##  3rd Qu.: 77.57   3rd Qu.: 39.39                      3rd Qu.: 93.00  
##  Max.   :211.09   Max.   :111.49                      Max.   :117.00  
##  avg_driver_rating   car_type          member_since  avg_member_rating premium
##  Min.   :2.730     Length:300         Min.   :2012   Min.   :2.780     N:229  
##  1st Qu.:3.650     Class :character   1st Qu.:2014   1st Qu.:3.810     Y: 71  
##  Median :4.060     Mode  :character   Median :2017   Median :4.285            
##  Mean   :4.085                        Mean   :2017   Mean   :4.275            
##  3rd Qu.:4.490                        3rd Qu.:2019   3rd Qu.:4.720            
##  Max.   :5.380                        Max.   :2022   Max.   :5.790            
##    ride_time     
##  Min.   : -3.00  
##  1st Qu.: 12.00  
##  Median : 34.00  
##  Mean   : 40.09  
##  3rd Qu.: 60.25  
##  Max.   :172.00

We can see that we need to change a couple of things, which are problems that have arised at the data generation stage:

1.- the maximum average ratings are about the maximum rating of 5.

trips$avg_member_rating[trips$avg_member_rating > 5] <- mean(trips$avg_member_rating)

member$avg_member_rating[member$avg_member_rating > 5] <- mean(member$avg_member_rating)

trips$avg_driver_rating[trips$avg_driver_rating > 5] <- mean(trips$avg_driver_rating)

driver$avg_member_rating[driver$avg_driver_rating > 5] <- mean(driver$avg_driver_rating)

summary(trips$avg_driver_rating)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.730   3.650   4.060   4.052   4.440   4.950

2.-Negative riding distance

trips$ride_distance[trips$ride_distance < 0.01] <- mean(trips$ride_distance)

rides$ride_distance[rides$ride_distance > 0.01] <- mean(rides$ride_distance)

trips$ride_time[trips$ride_time < 0.01] <- mean(trips$ride_time)

summary(trips$ride_time)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   18.00   40.09   42.73   60.25  172.00

Now we can apply some of the methods to find outliers, starting with boxplots for discrete variables, like the ride ratings:

trips$driver_ride_rating %>%  
  boxplot(main = "Boxplot of driver ratings", ylab = "rating", col = "grey")

Here we can see that the outliers are linked to negative ratings, which comes to show that the distributions used to generate the data resemble daily life, as drivers would eventualy get pushed out of the app, if consistently delivered poor ratings.

trips$member_ride_rating %>%  
  boxplot(main = "Boxplot of member ratings", ylab = "rating", col = "grey")

This one is a little harder to read, as the quatiles and the mean are collapsed in the same rating, might mean that the distribution used to generate the data is too centered around the mean value.

And also use a this method for a continuous variable.

boxtf <- trips$ride_distance %>%  
  boxplot(main = "Trip fee boxplot", ylab = "Trip fee ($)", col = "grey")

This one gives us a neat number of outliers, which seem to be three, and to do with the most expensive trips. We can extract the outliers directly from the boxplot, by using the list structure of the boxplot function, which helps us finding them easily.

boxtf

## $stats
##          [,1]
## [1,]  0.11000
## [2,] 10.91000
## [3,] 26.16883
## [4,] 39.47500
## [5,] 82.08000
## 
## $n
## [1] 300
## 
## $conf
##          [,1]
## [1,] 23.56310
## [2,] 28.77457
## 
## $out
## [1] 111.49  86.32  97.62  84.81  88.91
## 
## $group
## [1] 1 1 1 1 1
## 
## $names
## [1] ""

The fourth element of the list gives us the three outliers for this variable. Given that the trip distance and time are correlated by design with the fee, we should find similar distributions, although we had made some changes to the distribution before.

boxtd <- trips$ride_distance %>%  
  boxplot(main = "Trip distance boxplot", ylab = "Trip distance (km)", col = "grey")

This one is showing more outliers than the previous one, this could be because of the error induced in the correlation.

boxtd[[4]] # from the 4th position in the list

## [1] 111.49  86.32  97.62  84.81  88.91

Let’s look at the trip time:

boxtt <- trips$ride_time %>%  
  boxplot(main = "Trip time boxplot", ylab = "Trip time (min)", col = "grey")

Looking into the values, we can see that they are trips over two hours.

boxtt[[4]]

## [1] 172 127 133 151 125 131 137

In this instances, we are going to use value capping, to replace the outliers, with the upper limit, which can be found in the respective boxplot list, under the stats section in position 5.

trips$fee[boxtf[[4]]] <- boxtf$stats[[5]]
trips$ride_distance[boxtd[[4]]] <- boxtd$stats[[5]]
trips$ride_time[boxtt[[4]]] <- boxtt$stats[[5]]

trips$ride_time %>%  
  boxplot(main = "Trip time boxplot without original outliers", ylab = "Trip time (min)", col = "grey")

This shows that by compressing this upper outliers, the IQR has changed and pushed new points outside the fence.

Let’s look at continuous variables with the aid of the z-score method and QQ-plots. Looking at the Average ratings:

hist(trips$avg_driver_rating, breaks = 20, main = "Avg driver rating histogram", xlab = NULL)

We can see the distribution does not fit a Gaussian bell smoothly, this is because in its conception, the values were skewed to the left. This, this methodology will be better applied after transforming the data. But let’s see this on display following the creation of the z-scores and the QQ plot.

z.scores <- trips$avg_driver_rating %>%  
  outliers::scores(type = "z")

z.scores %>% summary()

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.69869 -0.82063  0.01633  0.00000  0.79206  1.83316

Note, we can observe the histogram of these z-scores against the theoretical normal distribution to check for normality:

hist(z.scores, freq = FALSE, breaks = 20, main = "Avg driver rating z-scores histogram with theoretical normal", xlab = NULL)

x <- seq(-10, 10, by = 0.001) 
lines(x = x, y = dnorm(x), col = "red")

If the data was approximately Normal, the Q-Q plot should be an approximately straight line. Note the deviation in both tails of the distribution, and the clustering of data:

qqnorm(z.scores, main = "Q-Q Plot of driver ratings")
qqline(z.scores, col = "red", lwd = 1, lty = 2)

We can see there is large departures at values below -1 and above 1.5; with the line loosing smoothness due to clustering. This confirms, that even though the data was created with a normal distribution function, the parameters selected have provided a diversion from normality.

STEP 6: Apply a data transformation to at least one of the variables.

Given that we know, the data for the average ratings is skewed to the left, a logarithmic transformation might aid visualising the qq-plots. By creating a vector with the transforms column values, we can later on decide, which to add to the dataset.

log_avg_d_rat <- trips$avg_driver_rating %>% log()

hist(log_avg_d_rat, breaks = 20, main = "Log avg driver rating histogram", xlab = NULL)

It seems this transformation further skews the distribution to the left. For this reason lets try.

sqr_avg_d_rat <- (trips$avg_driver_rating)^2

hist(sqr_avg_d_rat, breaks = 20, main = "Sqr avg driver rating histogram", xlab = NULL)

This seems to center the distribution better, although the clustering of values is quite a lot more visible this way.

If we try the z-scores and QQ plot with this distribution.

sqrz.scores <- sqr_avg_d_rat %>%  
  outliers::scores(type = "z")

sqrz.scores %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.3347 -0.8460 -0.0442  0.0000  0.7750  1.9897

hist(sqrz.scores, freq = FALSE, breaks = 20, main = "Sqr avg driver rating z-scores histogram with theoretical normal", xlab = NULL)

x <- seq(-10, 10, by = 0.001) 
lines(x = x, y = dnorm(x), col = "red")

Now that the distribution seems to approximate more to the Normal, the Q-Q plot should be an approximately straight line.

qqnorm(sqrz.scores, main = "Q-Q Plot of sqr driver ratings")
qqline(sqrz.scores, col = "red", lwd = 1, lty = 2)

Interesting that the end result is very similar to the original prior transformation.

STEP 7: Summary statistics.

Providing the summary statistics (mean, median, minimum, maximum, standard deviation) for each numeric variable grouped by one of the qualitative (categorical) variable.

Let’s start with the avg_member_rating for grouped by premium members:

sum_stats1 <- trips %>% 
  group_by(premium) %>%
  summarise(Min_Rating = min(avg_member_rating), 
            Median_Rating = median(avg_member_rating), 
            Mean_Rating = mean(avg_member_rating), 
            SD_Rating = sd(avg_member_rating),
            Max_Rating = max(avg_member_rating)
            )
sum_stats1

Interestingly, without adding a positive tendency towards premium members, to the creation of the values seem to have that tendency on their own.

Now it is the turn for the avg_driver_rating for grouped by car_type:

sum_stats2 <- trips %>% 
  group_by(car_type) %>%
  summarise(Min_Rating = min(avg_driver_rating), 
            Median_Rating = median(avg_driver_rating), 
            Mean_Rating = mean(avg_driver_rating), 
            SD_Rating = sd(avg_driver_rating),
            Max_Rating = max(avg_driver_rating)
            )
sum_stats2

SOmething similar happens here with the type of vehicle and the ratings, there should be a tendency to have better ratings, the better the car, and although that was not imputed in the creation of the data, it seems to follow that trend.

Let’s look at the fee for grouped by premium members:

sum_stats3 <- trips %>% 
  group_by(premium) %>%
  summarise(Min_Fee = min(fee), 
            Median_Fee = median(fee), 
            Mean_Fee = mean(fee), 
            SD_Fee = sd(fee),
            Max_Fee = max(fee)
            )
sum_stats3

The tendency for this values is obviously contrary to the expected, plus it still shows that there is some more work to be done to reflect the actual minimun values.

One last summary for the ride_time for grouped by notes:

sum_stats4 <- trips %>% 
  group_by(notes) %>%
  summarise(Min_time = min(ride_time), 
            Median_time = median(ride_time), 
            Mean_time = mean(ride_time), 
            SD_time = sd(ride_time),
            Max_time = max(ride_time)
            )
sum_stats4

A better standard deviation, and or error, need to be selected when generating this values.

Saving requirements

Include your report as a .pdf, the synthetic dataset files (.xlsx) produced at step 1 and the merged dataset file (.xlsx) produced at step 2, in a single submission for Assessment 2 via Canvas. For help on how to save your R markdown report as a pdf, see the Getting Started Guide in Orientation week. Write your plain text here.

'
# Create and format workbook, create worksheets 
wb1 <- openxlsx::createWorkbook()
modifyBaseFont(wb1, fontSize = 8, fontName = "Arial") 

addWorksheet(wb1, sheetName = "driver") 
addWorksheet(wb1, sheetName = "member")
addWorksheet(wb1, sheetName = "rides")

# Tab titles 
tab1_title <- "Table 1: Drivers" 
tab2_title <- "Table 2: Members" 
tab3_title <- "Table 3: Rides"

writeData(wb1 = wb1, 
          sheet = "driver", 
          x = tab1_title)
writeData(wb1 = wb1, 
          sheet = "member", 
          x = tab2_title)
writeData(wb1 = wb1, 
          sheet = "rides", 
          x = tab3_title)

# Add the data 
writeData(wb1, 
          sheet = "driver", 
          x = driver, 
          startRow = 3, 
          headerStyle = createStyle(textDecoration = "Bold"), 
          borders = "surrounding", 
          borderStyle = "thick", 
          withFilter = c(TRUE))
writeData(wb1, 
          sheet = "member", 
          x = member, 
          startRow = 3, 
          headerStyle = createStyle(textDecoration = "Bold"), 
          borders = "surrounding", 
          borderStyle = "thick", 
          withFilter = c(TRUE))
writeData(wb1, 
          sheet = "rides", 
          x = rides, 
          startRow = 3, 
          headerStyle = createStyle(textDecoration = "Bold"), 
          borders = "surrounding", 
          borderStyle = "thick", 
          withFilter = c(TRUE))

# Save out to Excel 
saveWorkbook(wb1, 
             here("MATH2405_codes/output", "Original Synthetic Data.xlsx"), 
             overwrite = TRUE) 
'

## [1] "\n# Create and format workbook, create worksheets \nwb1 <- openxlsx::createWorkbook()\nmodifyBaseFont(wb1, fontSize = 8, fontName = \"Arial\") \n\naddWorksheet(wb1, sheetName = \"driver\") \naddWorksheet(wb1, sheetName = \"member\")\naddWorksheet(wb1, sheetName = \"rides\")\n\n# Tab titles \ntab1_title <- \"Table 1: Drivers\" \ntab2_title <- \"Table 2: Members\" \ntab3_title <- \"Table 3: Rides\"\n\nwriteData(wb1 = wb1, \n          sheet = \"driver\", \n          x = tab1_title)\nwriteData(wb1 = wb1, \n          sheet = \"member\", \n          x = tab2_title)\nwriteData(wb1 = wb1, \n          sheet = \"rides\", \n          x = tab3_title)\n\n# Add the data \nwriteData(wb1, \n          sheet = \"driver\", \n          x = driver, \n          startRow = 3, \n          headerStyle = createStyle(textDecoration = \"Bold\"), \n          borders = \"surrounding\", \n          borderStyle = \"thick\", \n          withFilter = c(TRUE))\nwriteData(wb1, \n          sheet = \"member\", \n          x = member, \n          startRow = 3, \n          headerStyle = createStyle(textDecoration = \"Bold\"), \n          borders = \"surrounding\", \n          borderStyle = \"thick\", \n          withFilter = c(TRUE))\nwriteData(wb1, \n          sheet = \"rides\", \n          x = rides, \n          startRow = 3, \n          headerStyle = createStyle(textDecoration = \"Bold\"), \n          borders = \"surrounding\", \n          borderStyle = \"thick\", \n          withFilter = c(TRUE))\n\n# Save out to Excel \nsaveWorkbook(wb1, \n             here(\"MATH2405_codes/output\", \"Original Synthetic Data.xlsx\"), \n             overwrite = TRUE) \n"

# write to a csv file in the output sub-directory

write_csv(member, 
          path = here("/Users/samuelklettnavarro/PY4E/MATH2405_codes/output", "member.csv"))
write_csv(driver, 
          path = here("/Users/samuelklettnavarro/PY4E/MATH2405_codes/output", "driver.csv"))
write_csv(rides, 
          path = here("/Users/samuelklettnavarro/PY4E/MATH2405_codes/output", "rides.csv"))
write_csv(rides_joined, path = here("/Users/samuelklettnavarro/PY4E/MATH2405_codes/output", "rides_joined.csv"))

References

-STHDA (Unknown), ‘Saving Data into R Data Format: RDS and RDATA’, Statistical tools for high-throughput data analysis. Accessed on 27 March 2022 http://www.sthda.com/english/wiki/saving-data-into-r-data-format-rds-and-rdata