SQL Test

Question 1

Certain analyses could be complicated if the investigator does not consider the differences in Uber’s products, such as the price of the product, the routing method, the demand and supply of riders and drivers, the demographics of typical riders and drivers, and customer satisfaction. The average price and profit will be wrong if the investigator does not weight by the proportion of drivers driving for different products or taking separate averages across the product. Average distance will be impacted given that customers are likely calling different services for different desired locations (i.e. there may be more customers getting longer rides or going to the airport more frequently in more spacious uberXLs). Analytics around demand and supply will be confused given the density of different products by geographic location (i.e. there is likely a dearth of UberBLACKs in lower-income areas). Similarly, the general demand for certain products will vary, with customers likely favoring the cheapest option. However, customer preference will also vary by geographic location and supply of drivers in their area.

These differences highlight several important questions that are impacted by the product: which drivers are driving the most, earning the most, amount earned per customer across drivers, and how many passengers a driver is getting within a shift. These inquiries would be inflated or grossly misrepresent the true nature of Uber’s business. Without isolating for product type, any analytics gleaned from the cross-dispatch database will be biased.

Question 2

First, I make some sample data to help me think through the queries and the asks. Once that is done, I generate a query that gives the number of rides completed in NYC for March 2014 grouped by week. Although the query returns no records, that is only by luck of the random data generation.

drivers <- data.table(driverId = sample(x = 20, size = 100,replace = T), 
                      time = sample(x = 0:23, size = 100, replace = T),
                      date = sample(x = c("03/01/2012", "03/02/2012", "03/03/2012", 
                                          "03/04/2012", "03/05/2012", "03/06/2012", "03/07/2012" ), 
                                    size = 100, replace = T),
                      type = sample(x = c("uberX", "uberXL", "UberBLACK", "UberSUV"), size = 100, replace = T),
                      fare = sample(x = 7:50, size = 100, replace = T),
                      numPassenger = sample(x = 5, size = 100, replace = T),
                      status = c("canceled", "completed"))

drivers <- drivers[order(driverId, date)]

set.seed(8675309)
cities <- data.table(id = sample(x = 2, size = 5, replace = T), 
                     timezone = sample(x = c("America/Los_Angeles", 
                                             "America/New_York"), size = 10000, replace = T))
cities[timezone == "America/New_York", name := "new_york"]
cities[timezone == "America/Los_Angeles", name := "san_francisco"]

trips <- data.table(request_at = randomDates(10000),
                    driver_id = sample(x = 5, size = 10000, replace = T),
                    city_id = sample(x = 2, size = 10000, replace = T),
                    fare = sample(x = 8:50),
                    status = c("canceled", "completed"),
                    request_vehicle_view_id = sample(x = 10, size = 10000, replace = T))

vehicle_views <- data.table(id = sample(x = 10, size = 10000, replace = T),
                            city_id = sample(x = 1:2, size = 10000, replace = T),
                            name = sample(x = c("uberX", "uberXL", "UberBLACK", "UberSUV"), 
                                          size = 10000, replace = T))

driver_shifts <- data.table(city_id= sample(x = 2, size = 10000, replace = T), 
                            driver_id = sample(x = 5, size = 10000, replace = T),
                            seconds_on_shift = sample(x = 43200, size = 10000, replace = T),
                            occurred_at = randomDates(10000))

sqldf("SELECT driver_id, 
COUNT(*) as trips_completed_nyc, 
STRFTIME('%W', request_at) AS week
FROM trips
WHERE city_id IN (SELECT id FROM cities WHERE name = 'new_york')
AND status = 'completed'
AND request_at BETWEEN '2014-03-01' AND '2014-03-31'
GROUP BY week
ORDER BY week desc;")
## [1] driver_id           trips_completed_nyc week               
## <0 rows> (or 0-length row.names)

Question 3

Supply hours could be inflated for cross-dispatched drivers if hours are being doubled counted. For example, if the app is logging time spent for drivers who engage in uberX driving and UberXL, the amount of time could be miscalculated. More likely, though, the hours are being inflated because there is no filter for trips completed in the driver_times table, as there in the driver_fares table. The same holds true for the count of drivers. The final aggregation could be capturing drivers who have logged time but never successfully completed a trip. This has ramifications for the avg_fares_per_hour as it may be logging more hours spent driving than spent collecting fares.

Alongside the problems cited already, the avg_fares_per_hour will be inflated as it is taking a mean-of-means, rather than a true mean. This means that the mean will be skewed high if there are proportionally more drivers engaging in high-fare driver (i.e. UberBLACK) and skewed lower if the converse is true. To get a true mean, the hours and fares for drivers who completed trips during the time window in NYC should be summed across all drivers, then averaged by the number of drivers represented in the sample (the sample being chosen with the same filter parameters.) The figure will be also overestimating the amount made per driver given that the calculation is agnostic of vehicle view or product type. For example, if a driver is driving as a UberBLACK, they will be making higher rates than drivers who are driving as UberX. Conversely, the figure could be underestimating how much some drivers are making as, presumably, most drivers serve as standard UberX vehicles. Aggregate fares may also be inflated given that it may be double counting drivers who cross-dispatch. The proportion of product type in the data set would dictate the direction of the error.

Bonus

There needs to be an alias on driver_id in the COUNT parameter of the driver_times table. Since there are multiple driver_ids resulting from the INNER JOINs, one must specify which driver_id is being called.

Analysis Test

Load the data and pre-process

Before I work with the data, I want to clean it up and get a baseline understanding. Given the structure of the raw data, I’m making the assumption that the data is stored in UTC and needs to be converted to EST. I also add a vector for the corresponding days of the week and any holidays. First, I convert the vector to datetime, get rid of the seconds and minutes (since they will not aide in analysis) without a) rounding or b) substringing to the point of changing the data type, and update timezone. This allows me to get the data down to a date/hour specificity without overly complicating it. I also add vectors for the isolated date, week, hour, hours squared (assuming the trends are not linear), the days of the week, a dummy variable for weekends, and the number of client logins by date/hour combination. These variables will assist analyses as I move forward in the process.

# Load file as a data.table and save a raw copy of the file
logins <- data.table(time = fromJSON(paste(readLines("logins.json"), collapse="")))
loginsRaw <- logins

# coerce to date/time object, get rid of the seconds and minutes without a) rounding or b) substringing to the point of changing the data type, and update timezone
logins[, time := ymd_hms(time)]
minute(logins$time) <- minute(logins$time) * 0
second(logins$time) <- second(logins$time) * 0
attr(logins$time, "tzone") <- "EST"

## make a column just for date, weeks, and hours which will come in handy later
logins[, date := as.Date(substr(time, 1, 10))]
logins[, week := week(time)]
logins[, hour := hour(time)]
logins[, hourSq := hour^2]

## get weekdays, make flag for weekends 
logins[, weekday := weekdays(time)]
logins[, weekend := "weekday"]
logins[weekday %in% c("Saturday", "Sunday"), weekend := "weekend"]
logins[weekday== "Friday" & hour >= 5, weekend := "weekend"]

## set order and get counts 
logins <- logins[order(time)]
logins[, clientLogins := .N, by = time]

I inspect the data before moving forward by looking at the distribution (for variables where applicable) and create a correlation matrix to see if there are correlations that seem to jump out. Already, I can see that client logins are highly, positively correlated with the index (a proxy for date), which means that logins are rising as time passes. Also, there’s a looser, but still notable, association with hour– as it gets later, there are more logins. Relately, there appear to be more logins on the weekend.

# Number of logins by weekday, hour, and week
logins[, .N, by = weekday][order(N)]
##      weekday    N
## 1:   Tuesday 1939
## 2:    Monday 2039
## 3: Wednesday 2421
## 4:  Thursday 3116
## 5:    Sunday 3431
## 6:    Friday 3795
## 7:  Saturday 5706
logins[, .N, by = hour][order(N)]
##     hour    N
##  1:    4  247
##  2:    3  263
##  3:    5  325
##  4:    2  378
##  5:    6  471
##  6:    7  628
##  7:    1  671
##  8:    9  678
##  9:   11  691
## 10:    8  702
## 11:   10  708
## 12:   12  762
## 13:   13  795
## 14:   14  813
## 15:   15  883
## 16:    0  926
## 17:   16 1119
## 18:   23 1235
## 19:   17 1473
## 20:   22 1661
## 21:   19 1667
## 22:   20 1711
## 23:   18 1729
## 24:   21 1911
##     hour    N
logins[, .N, by = week][order(N)]
##     week    N
##  1:   18  596
##  2:    9 1322
##  3:   11 2065
##  4:   10 2228
##  5:   14 2301
##  6:   12 2337
##  7:   13 2443
##  8:   15 2745
##  9:   16 3146
## 10:   17 3264
head(logins[, .N, by = date][order(N)])
##          date   N
## 1: 2012-02-29 123
## 2: 2012-04-30 165
## 3: 2012-03-13 169
## 4: 2012-03-12 187
## 5: 2012-03-19 201
## 6: 2012-04-02 216
toPlot <- logins[, .(time = as.factor(time),
                     week = week,
                     hour = hour,
                     weekday = as.factor(weekday),
                     weekend = weekend,
                     clientLogins = clientLogins)]


# Plots with facet wrapping so we can get a quick at a glance for all variables (integer vars)
toPlot %>%
  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram() + theme_minimal() +
  labs(title = "Distribution across data")

# create an index to act as a proxy for the change in date
logins[, index := seq_len(.N), by = date]

logins[weekend == "weekend", weekendNum := 1]
logins[weekend == "weekday", weekendNum := 0]
# Get a correlation plot 
corData <- cor(logins[, .(index, week, hour, weekendNum, clientLogins)])

corrplot(corData, 
         type = "upper", 
         order = "hclust", 
         tl.col = "black", 
         tl.srt = 45)

## drop index field and weekendNum column
logins[, `:=`(index = NULL,
              weekendNum = NULL)]

Get weather data

In my experience, my decision to take an Uber has been significantly impacted by two factors: 1) weather and 2) other options for transit. To address the first point, using a free API from DarkSky, I gathered daily temperature and rainfall patterns for the dates of interest. While it would be ideal to get hourly data, the data length exceeded the query limit. To the second point, despite my best efforts, there were no records of NYC transit maintenance and delays that were accessible. Ideally, I would also get the average minutes in delays and number of Subway lines/Bus routes that were not operating. Once I cull the data, I calculate an average temperature by day and merged it onto the master logins data.

### I've commented out the API query since it will fail for anyone w/o a key and for me since I've exhausted my requests
## create a frame for the weather data
# weather <- data.table(time = "", 
#                       temperatureHigh = "", 
#                       temperatureLow = "",
#                       precipIntensity = "")

# get a unique list of dates
# dates <- unique(as.POSIXct(loginsRaw$time))
# 
# # loop through all dates
# for (i in seq(length(dates))) {
# then <- get_forecast_for(43.2672, -73.9801, dates[i], add_headers=TRUE)
# then <- as.data.table(then$daily)
# then <- then[, .(time, temperatureHigh, temperatureLow, precipIntensity)]
# weather <- rbind(then, weather,fill = T)
# }
# 
# # save to desktop to ensure that data isn't lost when api credential is maxed out
# write.csv(x = weather, file = "weather.csv")

## pull data in, get average temperature, and subset
weather <- data.table(read.csv("weather.csv"))
weather[, temperature := (temperatureHigh+temperatureLow)/2]
weather <- weather[, .(date = as.Date(time), temperature, precipIntensity)]

# join weather data onto the logins data
setkey(weather, date)
setkey(logins, date)
logins <- weather[logins]

Prepare data and plot

First, I want to generate a pure breakdown of logins by hour across all dates. There seems to be a lull during the day, and an uptick during evening hours. Immedidately, I wonder how these patterns change by weekend and weather. To assess these factors, I create factored variables for weather (whether it was cold) and for time of day (whether it was morning, day, night, or late night).

For this set of week-hour analyses (as opposed to hour and weekly), I’m being more critical because a) there is more data, b) more variance, and c) more factors that could account for the variance. I’ve also employed a set of interactive plots to help discern where some trends are happening to illuminate why.

## aggregate data
hourlyDayBreakdown <- unique(logins[, .(`client logins` = .N,
                                        temperature = temperature,
                                        weekend = weekend,
                                        hour = hour), 
                                    by = time][order(time)])
## make a time of day factor
hourlyDayBreakdown[, timeOfDay := "day"]
hourlyDayBreakdown[hour %in% c(5:10), timeOfDay := "late night"]
hourlyDayBreakdown[hour %in% c(19:23), timeOfDay := "night"]
hourlyDayBreakdown[hour %in% c(0:4), timeOfDay := "late night"]

## make a weather proxy
hourlyDayBreakdown[, weather := "temperate"]
hourlyDayBreakdown[temperature<= 50, weather := "cold"]

## Plots
# 1) Time of day
plot_ly(data = hourlyDayBreakdown, x = ~time, y = ~`client logins`,color = ~timeOfDay, symbols = c('circle','o','x')) %>% 
  layout(title = "Logins over Time by Time of Day", xaxis = list("Time"), yaxis = list("Client Logins"))
# 2) Weekend
plot_ly(data = hourlyDayBreakdown, x = ~time, y = ~`client logins`,color = ~as.factor(weekend), symbols = c('circle','o'))%>% 
  layout(title = "Logins over Time by Weekend Status", xaxis = list("Time"), yaxis = list("Client Logins"))
# 3) Weather
plot_ly(data = hourlyDayBreakdown, x = ~time, y = ~`client logins`,color = ~weather, symbols = c('circle','o'))%>% 
  layout(title = "Logins over Time by Weather", xaxis = list("Time"), yaxis = list("Client Logins"))

Creating a line of best fit using linear models

Prior to plotting the line on the graph, I want to create a true best fit line using a standard linear regression. One can see that the overall trend is of logins is rising, but the rise has not been constant and is rather gradual. The first line I generate is a standard x ~ y, or the date-hour combinatory effect on user check ins. Finding that the \(r^{2}\) is low and the root mean squared errors (RMSE) is high, I want to employ a different model that takes into account other factors.

I employ a stepwise regression. stepAIC chooses the highest quality model avaiable within your data by estimating the bias-variance tradeoff of each model. The model that it chooses only selects the variables of significance and discards the others to do away with noise. Using the step model, the RMSE drops by roughly 7 points and the \(r^{2}\) rises to over 50%. Testing this model using an ANOVA, I do see very low F statistics and p-values (for all variables besides temperature), but the outcome seems suspiciously accurate. Moreover, in looking at the plot of the residuals vs. fitted, there seems to be a slight parabolic shape. Despite adding \(hours^{2}\) into the model, I had a feeling there there is another non-linear relationship not being accounted for. Despite the residuals following a straight line, plot 3 from the model output shows that they are not quite normally distributed and are clustered. There also appears to be some influential outliers skewing the data.

In looking at the success of the fit, we see that the difference between the predicted values and the actual values have an average of 0 and a median of .25, meaning that the typical prediction is within 0.25 of the true client login value for a given date. However, because the data wasn’t divided into a train and test set, overfitting is likely to blame. While I could do further cross-validation (i.e. K-folds), I believe that the data should be fit into training and test sets before furthering validation. I do employ a Tukey HSD test which ensures that the F-statistic is not overestimating the significance across the different hour-week combinations. The HSD confirms that each combination has a signficant impact on client logins, barring a small 3% of the data.

lm1 <- lm(data = logins, formula = clientLogins ~ time)
summary(lm1)
## 
## Call:
## lm(formula = clientLogins ~ time, data = logins)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -31.02 -13.31  -5.53  10.23  75.21 
## 
## Coefficients:
##                     Estimate       Std. Error t value            Pr(>|t|)
## (Intercept) -3180.1991978332   103.6084253906   -30.7 <0.0000000000000002
## time            0.0000024051     0.0000000777    30.9 <0.0000000000000002
##                
## (Intercept) ***
## time        ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.1 on 22445 degrees of freedom
## Multiple R-squared:  0.0409, Adjusted R-squared:  0.0409 
## F-statistic:  958 on 1 and 22445 DF,  p-value: <0.0000000000000002
sqrt(mean(lm1$residuals^2))
## [1] 18.06
## Choose a model using the stepwise method (ignoring the time var for now, since we have date and hour and weekend, since that's account for in weekday)
lmFull <- lm(data = logins[, .(date, temperature, precipIntensity, week, hour, hourSq, weekday, clientLogins)], 
             formula = clientLogins ~ .)
lmStep <- stepAIC(lmFull, direction = "both", 
                  trace = T)
## Start:  AIC=110691
## clientLogins ~ date + temperature + precipIntensity + week + 
##     hour + hourSq + weekday
## 
## 
## Step:  AIC=110691
## clientLogins ~ date + temperature + precipIntensity + hour + 
##     hourSq + weekday
## 
##                   Df Sum of Sq     RSS    AIC
## <none>                         3174670 110691
## - temperature      1      5788 3180458 110729
## - precipIntensity  1     77345 3252016 111226
## - hour             1    149503 3324174 111716
## - date             1    220150 3394820 112185
## - hourSq           1    461286 3635956 113717
## - weekday          6   1831086 5005756 120845
## look at performance and fit of the model
summary(lmStep)
## 
## Call:
## lm(formula = clientLogins ~ date + temperature + precipIntensity + 
##     hour + hourSq + weekday, data = logins[, .(date, temperature, 
##     precipIntensity, week, hour, hourSq, weekday, clientLogins)])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41.13  -6.93  -0.26   6.68  53.37 
## 
## Coefficients:
##                     Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)      -2892.17497    73.89601  -39.14 < 0.0000000000000002 ***
## date                 0.18869     0.00480   39.34 < 0.0000000000000002 ***
## temperature          0.05512     0.00864    6.38        0.00000000018 ***
## precipIntensity    311.93465    13.37909   23.32 < 0.0000000000000002 ***
## hour                -1.55556     0.04799  -32.41 < 0.0000000000000002 ***
## hourSq               0.10875     0.00191   56.94 < 0.0000000000000002 ***
## weekdayMonday      -14.24294     0.33477  -42.55 < 0.0000000000000002 ***
## weekdaySaturday     12.54024     0.25391   49.39 < 0.0000000000000002 ***
## weekdaySunday       -2.64680     0.29228   -9.06 < 0.0000000000000002 ***
## weekdayThursday     -6.28293     0.28954  -21.70 < 0.0000000000000002 ***
## weekdayTuesday     -11.72478     0.33438  -35.06 < 0.0000000000000002 ***
## weekdayWednesday    -9.71557     0.31659  -30.69 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.9 on 22312 degrees of freedom
##   (123 observations deleted due to missingness)
## Multiple R-squared:  0.584,  Adjusted R-squared:  0.583 
## F-statistic: 2.84e+03 on 11 and 22312 DF,  p-value: <0.0000000000000002
sqrt(mean(lmFull$residuals^2))
## [1] 11.93
plot(lmStep)

## testing the reliability of the model
anova <- aov(lm(data = logins, formula = clientLogins ~ as.factor(time))) 
postHoc <- TukeyHSD(anova)
postHoc <- data.frame(postHoc$`as.factor(time)`)
length(which(postHoc$`p.adj`>0))
## [1] 33030
if (length(which(postHoc$`p.adj`>0)) == 0) {
  print("All date/time combinations have statistically significant unique impacts on the number of client logins")
} else {
paste0(round(nrow(postHoc$`p.adj` <= 0.05)/nrow(postHoc), 2)*100, "% of date/time combinations have statistically significant unique impacts on the number of client logins")
}
## [1] "% of date/time combinations have statistically significant unique impacts on the number of client logins"

Making a line of better fit

Now that I’ve seen some weakness and potential bias in my lines, I wonder if I can do better. Given the concern of overfitting, I generate a training and test set from the data. Using the same stepwise regression technique, I train it on the subsetted train. When I fit it against the test, I find the same results, which means that my model was not biased due to overfitting (but is likely still biased due to omitted variables).

## split data
trainData<- floor(0.75 * nrow(logins))
set.seed(8675309)
trainData <- sample(seq_len(nrow(logins)), size = trainData)
train <-logins[trainData]
test <- logins[-trainData]
nrow(train) + nrow(test) == nrow(logins) # gut check
## [1] TRUE
## same model
lmFull <- lm(data = train[, .(date, temperature, precipIntensity, week, hour, hourSq, weekday, weekend, clientLogins)], 
             formula = clientLogins ~ .)
lmStepTrain <- stepAIC(lmFull, direction = "both", 
                  trace = T)
## Start:  AIC=82579
## clientLogins ~ date + temperature + precipIntensity + week + 
##     hour + hourSq + weekday + weekend
## 
## 
## Step:  AIC=82579
## clientLogins ~ date + temperature + precipIntensity + hour + 
##     hourSq + weekday + weekend
## 
##                   Df Sum of Sq     RSS   AIC
## <none>                         2322138 82579
## - temperature      1      4356 2326494 82609
## - weekend          1     25065 2347203 82757
## - precipIntensity  1     55138 2377276 82970
## - hour             1    129132 2451270 83483
## - date             1    163940 2486077 83719
## - hourSq           1    362628 2684766 85006
## - weekday          6    501623 2823761 85841
preds <- predict(lmStepTrain,newdata = test, interval = "prediction")

test <- cbind(test, preds)

Plotting and comparing

Now, we can plot this line of best fit. For comparison, I’ve added a linear model, a curve (using gam for smoothing), the model generated using train and test sets (which includes the prediction intervals from the fit of the model on the test data), and the final, most successful model (the stepwise regression made on the full set of data), which accounts for date, temperature, precipitation, weekday. hour, and \(hour{2}\).

ggplotRegression(fit = lm1) + labs("Model 1: Logins ~ Time")

# curve (gam with smoothing)
ggplot(data = logins, mapping = aes(x = time, y = clientLogins)) + 
  geom_smooth(se = T) + 
  geom_point()

# 3. Add prediction intervals using the model on the test
p <- ggplot(test, aes(time, clientLogins)) +
  geom_point() +
  stat_smooth(method = "lm") + theme_minimal()

p + geom_line(aes(y = mean(lwr, na.rm = T)), color = "red", linetype = "dashed")+
    geom_line(aes(y = mean(upr, na.rm = T)), color = "red", linetype = "dashed") + labs("Model 1: Logins ~ Time w/ Prediction Intervas ")

# Plot 4: Our winning model, the stepwise
ggplotRegression(fit = lmStep)

Analyses by week

Immediately upon graphing the data, I saw that the line was skewed by week 18 because the data cuts off two days in (meaning there were far fewer logins for that week). For this reason, I omit it from the second graph. While I get a high RMSE here, I’m less concerned as I’m not intending to use the line in this case for a predictive capacity, but rather glean trends. Moreover, the model has a high \(r{2}\), but the proportion of variance in client logins explained by week would surely drop if other variables were include in the model. In order to truly predict, I would need a higher sample size size than I have( n= 9). The trend here is clear: as time progresses, more users are logging in. Therefore, I’m more intersted descriptive statistics. The findings are significant

## inspection of week 18
logins[, uniqueN(date), by = week]
##     week V1
##  1:    9  4
##  2:   10  7
##  3:   11  7
##  4:   12  7
##  5:   13  7
##  6:   14  7
##  7:   15  7
##  8:   16  7
##  9:   17  7
## 10:   18  2
max(logins[week==18]$date)
## [1] "2012-04-30"
min(logins[week==18]$date)
## [1] "2012-04-29"
## subset data
weeklyBreakdown <- logins[, .(clientLogin = .N),
                          by = week][order(week)]

## assess 
sqrt(mean(lm(data = weeklyBreakdown[week != 18], formula = clientLogin ~ week)$residuals^2))
## [1] 209.7
anova(lm(data = weeklyBreakdown[week != 18], formula = clientLogin ~ week))
## Analysis of Variance Table
## 
## Response: clientLogin
##           Df  Sum Sq Mean Sq F value  Pr(>F)    
## week       1 2338795 2338795    41.4 0.00036 ***
## Residuals  7  395754   56536                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## plot again
ggplotRegression(fit = lm(data = weeklyBreakdown[week != 18], formula = clientLogin ~ week))

Analysis by hour

Similarily with the case of hour, I see a high \(r{2}\), which is a good indicator, but also a high RMSE, which would generally be a bad indicator if I was concerned about predictive capacity. However, the line in this case fits the data well as shows a positive trend generally: the later it gets, the more client logins occur. The median time to login was around 5pm, though, which shows that the app was most frequently used towards the middle-to-latter half of the day. The most common hour for clients to login is 9pm, which holds true across the months. However, by April, the rate of 9pm logins has risen even more. I was curious to see a distrubtion of logins by hour, which is included in the boxplot. Once again, Because there was inconsistent variance across hours and non-linear trends, I once again employed an HSD post-hoc analysis to ensure that the F-statistic was not overestimating a significant relationship of all hours to dates. Yet, the post-hoc shows that all hours are significant.

hourlyBreakdown <- logins[, .(clientLogin = .N,
                              date = date,
                              weekend = weekend,
                              hourSq = hour^2),
                          by = hour][order(hour)]

## did patterns change by month?
head(hourlyBreakdown[, sum(clientLogin), by = .(hour, month(date))][order(V1,decreasing = T)])
##    hour month      V1
## 1:   21     4 1899534
## 2:   21     3 1681680
## 3:   18     4 1580306
## 4:   20     4 1577542
## 5:   19     4 1478629
## 6:   18     3 1409135
median(hourlyBreakdown$hour)
## [1] 17
boxplot(hourlyBreakdown$hour)

## assess the fit
sqrt(mean(lm(data = hourlyBreakdown, formula = clientLogin ~ hour + hourSq)$residuals^2))
## [1] 255.6
summary(lm(data = hourlyBreakdown, formula = clientLogin ~ hour +  + I(hourSq)))
## 
## Call:
## lm(formula = clientLogin ~ hour + +I(hourSq), data = hourlyBreakdown)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -676.1 -147.4  -34.7  204.4  364.3 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 574.0242     5.6577  101.46 < 0.0000000000000002 ***
## hour         -7.2277     1.0091   -7.16     0.00000000000082 ***
## I(hourSq)     2.8418     0.0402   70.67 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256 on 22444 degrees of freedom
## Multiple R-squared:  0.738,  Adjusted R-squared:  0.738 
## F-statistic: 3.16e+04 on 2 and 22444 DF,  p-value: <0.0000000000000002
aovHour <- aov(lm(data = hourlyBreakdown, formula = clientLogin ~ as.factor(hour)))
postHocHour <- TukeyHSD(aovHour, "as.factor(hour)", ordered = T)
postHocHour <- data.frame(postHocHour$`as.factor(hour)`)
length(which(postHocHour$`p.adj`>0))
## [1] 0
## plot
ggplotRegression(fit = lm(data = hourlyBreakdown, formula = clientLogin ~ hour + I(hour^2)))