Loading Library

library(tidyverse)
library(lubridate)
library(stats)
library(factoextra)

Part I: Collect & Explore Data

Question 1

Loading in data and naming the dataset trips

trips <- read_csv('https://s3.amazonaws.com/notredame.analytics.data/trips.csv')

Getting a look at the data using the glimpse() and head() functions

glimpse(trips)
## Rows: 123
## Columns: 12
## $ ID               <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ Start            <chr> "1/2/19 13:30", "1/2/19 13:30", "1/2/19 13:45", "1/2…
## $ End              <chr> "1/2/19 13:30", "1/2/19 13:45", "1/2/19 14:00", "1/2…
## $ Duration         <dbl> 360, 660, 180, 300, 240, 240, 1020, 420, 300, 420, 6…
## $ Distance         <dbl> 1.0, 1.5, 0.7, 0.8, 0.7, 0.9, 3.7, 0.9, 0.8, 0.8, 1.…
## $ Fare             <dbl> 6.25, 8.50, 5.00, 5.75, 5.25, 5.75, 13.50, 6.50, 5.5…
## $ Total            <dbl> 6.25, 11.00, 7.50, 5.75, 7.75, 8.25, 16.00, 8.00, 8.…
## $ PaymentType      <chr> "Cash", "Credit Card", "Credit Card", "Cash", "Credi…
## $ PickupLatitude   <dbl> 41.89204, 41.88099, 41.89204, 41.89204, 41.89204, 41…
## $ PickupLongitude  <dbl> -87.63186, -87.63275, -87.63186, -87.63186, -87.6318…
## $ DropoffLatitude  <dbl> 41.86790, 41.89503, 41.88499, 41.87926, 41.88099, 41…
## $ DropoffLongitude <dbl> -87.64296, -87.61971, -87.62099, -87.64265, -87.6327…
head(trips)
## # A tibble: 6 x 12
##      ID Start End   Duration Distance  Fare Total PaymentType PickupLatitude
##   <dbl> <chr> <chr>    <dbl>    <dbl> <dbl> <dbl> <chr>                <dbl>
## 1     1 1/2/… 1/2/…      360      1    6.25  6.25 Cash                  41.9
## 2     2 1/2/… 1/2/…      660      1.5  8.5  11    Credit Card           41.9
## 3     3 1/2/… 1/2/…      180      0.7  5     7.5  Credit Card           41.9
## 4     4 1/2/… 1/2/…      300      0.8  5.75  5.75 Cash                  41.9
## 5     5 1/2/… 1/2/…      240      0.7  5.25  7.75 Credit Card           41.9
## 6     6 1/2/… 1/2/…      240      0.9  5.75  8.25 Credit Card           41.9
## # … with 3 more variables: PickupLongitude <dbl>, DropoffLatitude <dbl>,
## #   DropoffLongitude <dbl>

Question 2

Need to convert Start and End to Datetime data type using the parse_date_time() From the glimpse() and head() above I can see that that the format for both Start and End are Month-Date-Year-Hour-Minute therefore when I convert data type I will use order = “mdy HM”

trips <- trips %>%
mutate(Start = parse_date_time(Start,"mdy HM"),
End = parse_date_time(End,"mdy HM")) 

glimpse(trips)
## Rows: 123
## Columns: 12
## $ ID               <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ Start            <dttm> 2019-01-02 13:30:00, 2019-01-02 13:30:00, 2019-01-0…
## $ End              <dttm> 2019-01-02 13:30:00, 2019-01-02 13:45:00, 2019-01-0…
## $ Duration         <dbl> 360, 660, 180, 300, 240, 240, 1020, 420, 300, 420, 6…
## $ Distance         <dbl> 1.0, 1.5, 0.7, 0.8, 0.7, 0.9, 3.7, 0.9, 0.8, 0.8, 1.…
## $ Fare             <dbl> 6.25, 8.50, 5.00, 5.75, 5.25, 5.75, 13.50, 6.50, 5.5…
## $ Total            <dbl> 6.25, 11.00, 7.50, 5.75, 7.75, 8.25, 16.00, 8.00, 8.…
## $ PaymentType      <chr> "Cash", "Credit Card", "Credit Card", "Cash", "Credi…
## $ PickupLatitude   <dbl> 41.89204, 41.88099, 41.89204, 41.89204, 41.89204, 41…
## $ PickupLongitude  <dbl> -87.63186, -87.63275, -87.63186, -87.63186, -87.6318…
## $ DropoffLatitude  <dbl> 41.86790, 41.89503, 41.88499, 41.87926, 41.88099, 41…
## $ DropoffLongitude <dbl> -87.64296, -87.61971, -87.62099, -87.64265, -87.6327…

Question 3

Creating a \(\underline{\small{\text{Distribution Visualization}}}\).

hist(hour(trips$Start), 
     main = 'Trips by Hour',
     xlab = 'Hour')

From the Graph above, we can see that there is not an obvious concentration at and time of the day. However, the times stop around 4 PM

Question 4

Creating a \(\underline{\small{\text{Relationship Visualization}}}\) to see the relationship between Fare and Distance

trips %>%
  ggplot() + 
  geom_point(aes(x=Fare, y=Distance))
## Warning: Removed 9 rows containing missing values (geom_point).

We Can see from above there is a strong Positive relationship to the Fare and Distance

Question 5

Creating another graph of the relationship above

trips %>%
  ggplot() + 
  geom_point(aes(x=Fare, y=Duration))
## Warning: Removed 9 rows containing missing values (geom_point).

The relationship between Duration and Fare is less linear. There seems to be a positive strong relationship at first. Based on the graph above, during the first 15 minutes of the ride the fare increases in a strong linear relationship with Fare, after 15 minutes there is a much weaker postive linear relationship.

Part II: Prepare the Data

Question 1

Getting a quick look at the values that are missing.

summary(trips)
##        ID            Start                          End                     
##  Min.   :  1.0   Min.   :2019-01-02 09:30:00   Min.   :2019-01-02 09:30:00  
##  1st Qu.: 31.5   1st Qu.:2019-01-09 07:22:30   1st Qu.:2019-01-09 07:30:00  
##  Median : 62.0   Median :2019-01-14 15:45:00   Median :2019-01-14 15:45:00  
##  Mean   : 62.0   Mean   :2019-01-14 12:46:20   Mean   :2019-01-14 12:58:10  
##  3rd Qu.: 92.5   3rd Qu.:2019-01-22 10:52:30   3rd Qu.:2019-01-22 11:07:30  
##  Max.   :123.0   Max.   :2019-01-24 15:00:00   Max.   :2019-01-24 15:45:00  
##                                                                             
##     Duration       Distance          Fare           Total       
##  Min.   :   0   Min.   : 0.30   Min.   : 4.50   Min.   : 4.500  
##  1st Qu.: 360   1st Qu.: 0.80   1st Qu.: 6.00   1st Qu.: 7.750  
##  Median : 480   Median : 1.20   Median : 7.00   Median : 9.375  
##  Mean   : 721   Mean   : 3.68   Mean   :12.82   Mean   :16.276  
##  3rd Qu.: 750   3rd Qu.: 1.90   3rd Qu.: 9.00   3rd Qu.:11.938  
##  Max.   :2820   Max.   :21.00   Max.   :50.75   Max.   :66.200  
##                                 NA's   :9       NA's   :9       
##  PaymentType        PickupLatitude  PickupLongitude  DropoffLatitude
##  Length:123         Min.   :41.86   Min.   :-87.91   Min.   :41.79  
##  Class :character   1st Qu.:41.88   1st Qu.:-87.64   1st Qu.:41.88  
##  Mode  :character   Median :41.89   Median :-87.63   Median :41.88  
##                     Mean   :41.89   Mean   :-87.66   Mean   :41.89  
##                     3rd Qu.:41.89   3rd Qu.:-87.63   3rd Qu.:41.89  
##                     Max.   :41.98   Max.   :-87.62   Max.   :41.99  
##                                                                     
##  DropoffLongitude
##  Min.   :-87.90  
##  1st Qu.:-87.64  
##  Median :-87.63  
##  Mean   :-87.65  
##  3rd Qu.:-87.62  
##  Max.   :-87.61  
## 
trips %>%
  filter(is.na(Fare)| is.na(Total))
## # A tibble: 9 x 12
##      ID Start               End                 Duration Distance  Fare Total
##   <dbl> <dttm>              <dttm>                 <dbl>    <dbl> <dbl> <dbl>
## 1    11 2019-01-02 12:15:00 2019-01-02 12:30:00      600      1.3    NA    NA
## 2    34 2019-01-09 09:15:00 2019-01-09 09:15:00      300      0.4    NA    NA
## 3    49 2019-01-10 08:15:00 2019-01-10 08:45:00     1680     17.6    NA    NA
## 4    53 2019-01-14 13:30:00 2019-01-14 13:30:00      360      1.1    NA    NA
## 5    77 2019-01-17 13:45:00 2019-01-17 14:00:00      660      1.7    NA    NA
## 6    95 2019-01-22 15:45:00 2019-01-22 15:45:00      240      0.5    NA    NA
## 7    97 2019-01-22 10:30:00 2019-01-22 10:45:00      540      1.3    NA    NA
## 8   110 2019-01-23 07:30:00 2019-01-23 07:45:00      240      0.7    NA    NA
## 9   120 2019-01-24 10:15:00 2019-01-24 10:15:00      300      0.8    NA    NA
## # … with 5 more variables: PaymentType <chr>, PickupLatitude <dbl>,
## #   PickupLongitude <dbl>, DropoffLatitude <dbl>, DropoffLongitude <dbl>

\(\huge{\text{Conversion Chart}}\)

Average Speed (miles/hour) Price per Mile ($/mile) Price per Hour ($/hour)
0-4 15.6 49.8
5-9 6.5 51.6
10-19 6 65
20-29 2.5 65
30-30 2.5 91
40+ 2.4 117

Need to replace the NA Values in Fare and Total.
In order to achieve this I used nested ifelse() functions inside a mutate(). I first used an ifelse() to only perform the mutate() when Fare was NA. Within that first ifelse(), I then nested a series of ifelse() functions to determine the average miles per hour using Distance/hour (I created hour to be (Duration/60)/60). The final argument then returns Fare if none of the conditions in my ifelse() are true this will save Fare as either my new calculation or the original Fare amount.

To check that it worked I ran summary()

trips <- trips %>%
  mutate(hour = (Duration/60)/60) %>%
  mutate(Fare = ifelse(is.na(Fare),
                       ifelse(((Distance/hour >= 0) & (Distance/hour <5)),
                        ((15.6 * Distance) + (49.8 * hour))/2,
                       ifelse(((Distance/hour >= 5) & (Distance/hour <10)), 
                         ((6.5 * Distance) + (51.6 * hour))/2,
                       ifelse(((Distance/hour >= 10) & (Distance/hour <20)),
                         ((6 * Distance) + (65 * hour))/2,
                       ifelse(((Distance/hour >= 20) & (Distance/hour <30)),
                         ((2.5 * Distance) + (65 * hour))/2,
                       ifelse(((Distance/hour >= 30) & (Distance/hour <40)),
                         ((2.5 * Distance) + (91 * hour))/2,
                       ifelse((Distance/hour >= 40),
                         ((2.4 * Distance) + (117 * hour))/2,0)))))),Fare),
         Total = ifelse(is.na(Total),(Fare*.05)+Fare,Total)) %>%
  select(-hour)

summary(trips)
##        ID            Start                          End                     
##  Min.   :  1.0   Min.   :2019-01-02 09:30:00   Min.   :2019-01-02 09:30:00  
##  1st Qu.: 31.5   1st Qu.:2019-01-09 07:22:30   1st Qu.:2019-01-09 07:30:00  
##  Median : 62.0   Median :2019-01-14 15:45:00   Median :2019-01-14 15:45:00  
##  Mean   : 62.0   Mean   :2019-01-14 12:46:20   Mean   :2019-01-14 12:58:10  
##  3rd Qu.: 92.5   3rd Qu.:2019-01-22 10:52:30   3rd Qu.:2019-01-22 11:07:30  
##  Max.   :123.0   Max.   :2019-01-24 15:00:00   Max.   :2019-01-24 15:45:00  
##     Duration       Distance          Fare            Total       
##  Min.   :   0   Min.   : 0.30   Min.   : 3.345   Min.   : 3.512  
##  1st Qu.: 360   1st Qu.: 0.80   1st Qu.: 5.875   1st Qu.: 7.375  
##  Median : 480   Median : 1.20   Median : 7.000   Median : 9.250  
##  Mean   : 721   Mean   : 3.68   Mean   :12.644   Mean   :15.890  
##  3rd Qu.: 750   3rd Qu.: 1.90   3rd Qu.: 9.000   3rd Qu.:11.625  
##  Max.   :2820   Max.   :21.00   Max.   :50.750   Max.   :66.200  
##  PaymentType        PickupLatitude  PickupLongitude  DropoffLatitude
##  Length:123         Min.   :41.86   Min.   :-87.91   Min.   :41.79  
##  Class :character   1st Qu.:41.88   1st Qu.:-87.64   1st Qu.:41.88  
##  Mode  :character   Median :41.89   Median :-87.63   Median :41.88  
##                     Mean   :41.89   Mean   :-87.66   Mean   :41.89  
##                     3rd Qu.:41.89   3rd Qu.:-87.63   3rd Qu.:41.89  
##                     Max.   :41.98   Max.   :-87.62   Max.   :41.99  
##  DropoffLongitude
##  Min.   :-87.90  
##  1st Qu.:-87.64  
##  Median :-87.63  
##  Mean   :-87.65  
##  3rd Qu.:-87.62  
##  Max.   :-87.61

Question 2

Creating a subset of trips containing only PickupLatitude and PickupLongitude. I saved this new dataset to trips_scaled.

trips_scaled <- trips %>%
  select(PickupLatitude,PickupLongitude)

Question 3

Min-Max Normalization on the trips_scaled Dataset

using the formula:
\(V^1=\frac{V-min_f}{max_f-min_f}*(upper-lower)+lower\)
found on pg. 87 of \(Practical\space Machine\space Learning\space in\space R\)

trips_scaled<- trips_scaled %>%
  mutate(
    PickupLongitude =
      (((PickupLongitude-min(PickupLongitude))/(max(PickupLongitude)-min(PickupLongitude))) *(5-1)+1),
    PickupLatitude = 
      (((PickupLatitude-min(PickupLatitude))/(max(PickupLatitude)-min(PickupLatitude))) *(5-1)+1))

summary(trips_scaled)
##  PickupLatitude  PickupLongitude
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.716   1st Qu.:4.656  
##  Median :1.858   Median :4.792  
##  Mean   :2.101   Mean   :4.438  
##  3rd Qu.:2.082   3rd Qu.:4.804  
##  Max.   :5.000   Max.   :5.000

Part III: Segment Customers

Question 1

Creating Kmeans Clustering based on the PickupLatitude and PickupLongitude Columns
us.

trips_kmeans <- kmeans(trips_scaled, centers=5, nstart = 25)

Question 2

Creating visualization with the fviz_cluster()

fviz_cluster(trips_kmeans,
             data = trips_scaled,
             stand = FALSE,
             ggtheme = theme_minimal())

Question 3

Creating the Charts for cluster analysis

Elbow

fviz_nbclust(trips_scaled, kmeans, method = "wss")

Silhouette

fviz_nbclust(trips_scaled, kmeans, method = "silhouette")

Gap-Stat

fviz_nbclust(trips_scaled, kmeans, method = "gap_stat")

Based on the Graphs above 2 is the optimal amount of clusteres to use for our data.

Question 4

Recreating Kmeans Clustering based on the PickupLatitude and PickupLongitude Columns
us.

trips_kmeans2 <- kmeans(trips_scaled, centers=2, nstart = 100)

Recreating visualization with the fviz_cluster()

fviz_cluster(trips_kmeans2,
             data = trips_scaled,
             stand = FALSE,
             ggtheme = theme_minimal())

Part IV: Interpret the Results

Question 1

trips <- trips %>%
  mutate(cluster = trips_kmeans2$cluster)

Question 2

summarize clusters

trips %>%
  group_by(cluster) %>%
  mutate(Duration_minutes = Duration/60) %>%
  summarize(Duration = mean(Duration_minutes), 
            Distance = mean(Distance), 
            Fare = mean(Fare), 
            Total = mean(Total),
            count = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 6
##   cluster Duration Distance  Fare Total count
##     <int>    <dbl>    <dbl> <dbl> <dbl> <int>
## 1       1    35.9     16.8  42.5   56.4    11
## 2       2     9.67     2.39  9.71  11.9   112
summary(trips)
##        ID            Start                          End                     
##  Min.   :  1.0   Min.   :2019-01-02 09:30:00   Min.   :2019-01-02 09:30:00  
##  1st Qu.: 31.5   1st Qu.:2019-01-09 07:22:30   1st Qu.:2019-01-09 07:30:00  
##  Median : 62.0   Median :2019-01-14 15:45:00   Median :2019-01-14 15:45:00  
##  Mean   : 62.0   Mean   :2019-01-14 12:46:20   Mean   :2019-01-14 12:58:10  
##  3rd Qu.: 92.5   3rd Qu.:2019-01-22 10:52:30   3rd Qu.:2019-01-22 11:07:30  
##  Max.   :123.0   Max.   :2019-01-24 15:00:00   Max.   :2019-01-24 15:45:00  
##     Duration       Distance          Fare            Total       
##  Min.   :   0   Min.   : 0.30   Min.   : 3.345   Min.   : 3.512  
##  1st Qu.: 360   1st Qu.: 0.80   1st Qu.: 5.875   1st Qu.: 7.375  
##  Median : 480   Median : 1.20   Median : 7.000   Median : 9.250  
##  Mean   : 721   Mean   : 3.68   Mean   :12.644   Mean   :15.890  
##  3rd Qu.: 750   3rd Qu.: 1.90   3rd Qu.: 9.000   3rd Qu.:11.625  
##  Max.   :2820   Max.   :21.00   Max.   :50.750   Max.   :66.200  
##  PaymentType        PickupLatitude  PickupLongitude  DropoffLatitude
##  Length:123         Min.   :41.86   Min.   :-87.91   Min.   :41.79  
##  Class :character   1st Qu.:41.88   1st Qu.:-87.64   1st Qu.:41.88  
##  Mode  :character   Median :41.89   Median :-87.63   Median :41.88  
##                     Mean   :41.89   Mean   :-87.66   Mean   :41.89  
##                     3rd Qu.:41.89   3rd Qu.:-87.63   3rd Qu.:41.89  
##                     Max.   :41.98   Max.   :-87.62   Max.   :41.99  
##  DropoffLongitude    cluster     
##  Min.   :-87.90   Min.   :1.000  
##  1st Qu.:-87.64   1st Qu.:2.000  
##  Median :-87.63   Median :2.000  
##  Mean   :-87.65   Mean   :1.911  
##  3rd Qu.:-87.62   3rd Qu.:2.000  
##  Max.   :-87.61   Max.   :2.000

Question 3

If I were a cab driver I would choose cluster 2. All around there are better statistics for this cluster. There are far more trips to choose from, additionally the Duration is shorter, with the Distance being shorting. This means that the drivers can get a lot of short trips going short distances, this will maxamize their time for the day to get the most money.