Loading Library
library(tidyverse)
library(lubridate)
library(stats)
library(factoextra)
Loading in data and naming the dataset trips
trips <- read_csv('https://s3.amazonaws.com/notredame.analytics.data/trips.csv')
Getting a look at the data using the glimpse() and head() functions
glimpse(trips)
## Rows: 123
## Columns: 12
## $ ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ Start <chr> "1/2/19 13:30", "1/2/19 13:30", "1/2/19 13:45", "1/2…
## $ End <chr> "1/2/19 13:30", "1/2/19 13:45", "1/2/19 14:00", "1/2…
## $ Duration <dbl> 360, 660, 180, 300, 240, 240, 1020, 420, 300, 420, 6…
## $ Distance <dbl> 1.0, 1.5, 0.7, 0.8, 0.7, 0.9, 3.7, 0.9, 0.8, 0.8, 1.…
## $ Fare <dbl> 6.25, 8.50, 5.00, 5.75, 5.25, 5.75, 13.50, 6.50, 5.5…
## $ Total <dbl> 6.25, 11.00, 7.50, 5.75, 7.75, 8.25, 16.00, 8.00, 8.…
## $ PaymentType <chr> "Cash", "Credit Card", "Credit Card", "Cash", "Credi…
## $ PickupLatitude <dbl> 41.89204, 41.88099, 41.89204, 41.89204, 41.89204, 41…
## $ PickupLongitude <dbl> -87.63186, -87.63275, -87.63186, -87.63186, -87.6318…
## $ DropoffLatitude <dbl> 41.86790, 41.89503, 41.88499, 41.87926, 41.88099, 41…
## $ DropoffLongitude <dbl> -87.64296, -87.61971, -87.62099, -87.64265, -87.6327…
head(trips)
## # A tibble: 6 x 12
## ID Start End Duration Distance Fare Total PaymentType PickupLatitude
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1 1/2/… 1/2/… 360 1 6.25 6.25 Cash 41.9
## 2 2 1/2/… 1/2/… 660 1.5 8.5 11 Credit Card 41.9
## 3 3 1/2/… 1/2/… 180 0.7 5 7.5 Credit Card 41.9
## 4 4 1/2/… 1/2/… 300 0.8 5.75 5.75 Cash 41.9
## 5 5 1/2/… 1/2/… 240 0.7 5.25 7.75 Credit Card 41.9
## 6 6 1/2/… 1/2/… 240 0.9 5.75 8.25 Credit Card 41.9
## # … with 3 more variables: PickupLongitude <dbl>, DropoffLatitude <dbl>,
## # DropoffLongitude <dbl>
Need to convert Start and End to Datetime data type using the parse_date_time() From the glimpse() and head() above I can see that that the format for both Start and End are Month-Date-Year-Hour-Minute therefore when I convert data type I will use order = “mdy HM”
trips <- trips %>%
mutate(Start = parse_date_time(Start,"mdy HM"),
End = parse_date_time(End,"mdy HM"))
glimpse(trips)
## Rows: 123
## Columns: 12
## $ ID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ Start <dttm> 2019-01-02 13:30:00, 2019-01-02 13:30:00, 2019-01-0…
## $ End <dttm> 2019-01-02 13:30:00, 2019-01-02 13:45:00, 2019-01-0…
## $ Duration <dbl> 360, 660, 180, 300, 240, 240, 1020, 420, 300, 420, 6…
## $ Distance <dbl> 1.0, 1.5, 0.7, 0.8, 0.7, 0.9, 3.7, 0.9, 0.8, 0.8, 1.…
## $ Fare <dbl> 6.25, 8.50, 5.00, 5.75, 5.25, 5.75, 13.50, 6.50, 5.5…
## $ Total <dbl> 6.25, 11.00, 7.50, 5.75, 7.75, 8.25, 16.00, 8.00, 8.…
## $ PaymentType <chr> "Cash", "Credit Card", "Credit Card", "Cash", "Credi…
## $ PickupLatitude <dbl> 41.89204, 41.88099, 41.89204, 41.89204, 41.89204, 41…
## $ PickupLongitude <dbl> -87.63186, -87.63275, -87.63186, -87.63186, -87.6318…
## $ DropoffLatitude <dbl> 41.86790, 41.89503, 41.88499, 41.87926, 41.88099, 41…
## $ DropoffLongitude <dbl> -87.64296, -87.61971, -87.62099, -87.64265, -87.6327…
Creating a \(\underline{\small{\text{Distribution Visualization}}}\).
hist(hour(trips$Start),
main = 'Trips by Hour',
xlab = 'Hour')
From the Graph above, we can see that there is not an obvious concentration at and time of the day. However, the times stop around 4 PM
Creating a \(\underline{\small{\text{Relationship Visualization}}}\) to see the relationship between Fare and Distance
trips %>%
ggplot() +
geom_point(aes(x=Fare, y=Distance))
## Warning: Removed 9 rows containing missing values (geom_point).
We Can see from above there is a strong Positive relationship to the Fare and Distance
Creating another graph of the relationship above
trips %>%
ggplot() +
geom_point(aes(x=Fare, y=Duration))
## Warning: Removed 9 rows containing missing values (geom_point).
The relationship between Duration and Fare is less linear. There seems to be a positive strong relationship at first. Based on the graph above, during the first 15 minutes of the ride the fare increases in a strong linear relationship with Fare, after 15 minutes there is a much weaker postive linear relationship.
Getting a quick look at the values that are missing.
summary(trips)
## ID Start End
## Min. : 1.0 Min. :2019-01-02 09:30:00 Min. :2019-01-02 09:30:00
## 1st Qu.: 31.5 1st Qu.:2019-01-09 07:22:30 1st Qu.:2019-01-09 07:30:00
## Median : 62.0 Median :2019-01-14 15:45:00 Median :2019-01-14 15:45:00
## Mean : 62.0 Mean :2019-01-14 12:46:20 Mean :2019-01-14 12:58:10
## 3rd Qu.: 92.5 3rd Qu.:2019-01-22 10:52:30 3rd Qu.:2019-01-22 11:07:30
## Max. :123.0 Max. :2019-01-24 15:00:00 Max. :2019-01-24 15:45:00
##
## Duration Distance Fare Total
## Min. : 0 Min. : 0.30 Min. : 4.50 Min. : 4.500
## 1st Qu.: 360 1st Qu.: 0.80 1st Qu.: 6.00 1st Qu.: 7.750
## Median : 480 Median : 1.20 Median : 7.00 Median : 9.375
## Mean : 721 Mean : 3.68 Mean :12.82 Mean :16.276
## 3rd Qu.: 750 3rd Qu.: 1.90 3rd Qu.: 9.00 3rd Qu.:11.938
## Max. :2820 Max. :21.00 Max. :50.75 Max. :66.200
## NA's :9 NA's :9
## PaymentType PickupLatitude PickupLongitude DropoffLatitude
## Length:123 Min. :41.86 Min. :-87.91 Min. :41.79
## Class :character 1st Qu.:41.88 1st Qu.:-87.64 1st Qu.:41.88
## Mode :character Median :41.89 Median :-87.63 Median :41.88
## Mean :41.89 Mean :-87.66 Mean :41.89
## 3rd Qu.:41.89 3rd Qu.:-87.63 3rd Qu.:41.89
## Max. :41.98 Max. :-87.62 Max. :41.99
##
## DropoffLongitude
## Min. :-87.90
## 1st Qu.:-87.64
## Median :-87.63
## Mean :-87.65
## 3rd Qu.:-87.62
## Max. :-87.61
##
trips %>%
filter(is.na(Fare)| is.na(Total))
## # A tibble: 9 x 12
## ID Start End Duration Distance Fare Total
## <dbl> <dttm> <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 11 2019-01-02 12:15:00 2019-01-02 12:30:00 600 1.3 NA NA
## 2 34 2019-01-09 09:15:00 2019-01-09 09:15:00 300 0.4 NA NA
## 3 49 2019-01-10 08:15:00 2019-01-10 08:45:00 1680 17.6 NA NA
## 4 53 2019-01-14 13:30:00 2019-01-14 13:30:00 360 1.1 NA NA
## 5 77 2019-01-17 13:45:00 2019-01-17 14:00:00 660 1.7 NA NA
## 6 95 2019-01-22 15:45:00 2019-01-22 15:45:00 240 0.5 NA NA
## 7 97 2019-01-22 10:30:00 2019-01-22 10:45:00 540 1.3 NA NA
## 8 110 2019-01-23 07:30:00 2019-01-23 07:45:00 240 0.7 NA NA
## 9 120 2019-01-24 10:15:00 2019-01-24 10:15:00 300 0.8 NA NA
## # … with 5 more variables: PaymentType <chr>, PickupLatitude <dbl>,
## # PickupLongitude <dbl>, DropoffLatitude <dbl>, DropoffLongitude <dbl>
\(\huge{\text{Conversion Chart}}\)
| Average Speed (miles/hour) | Price per Mile ($/mile) | Price per Hour ($/hour) |
|---|---|---|
| 0-4 | 15.6 | 49.8 |
| 5-9 | 6.5 | 51.6 |
| 10-19 | 6 | 65 |
| 20-29 | 2.5 | 65 |
| 30-30 | 2.5 | 91 |
| 40+ | 2.4 | 117 |
Need to replace the NA Values in Fare and Total.
In order to achieve this I used nested ifelse() functions inside a mutate(). I first used an ifelse() to only perform the mutate() when Fare was NA. Within that first ifelse(), I then nested a series of ifelse() functions to determine the average miles per hour using Distance/hour (I created hour to be (Duration/60)/60). The final argument then returns Fare if none of the conditions in my ifelse() are true this will save Fare as either my new calculation or the original Fare amount.
To check that it worked I ran summary()
trips <- trips %>%
mutate(hour = (Duration/60)/60) %>%
mutate(Fare = ifelse(is.na(Fare),
ifelse(((Distance/hour >= 0) & (Distance/hour <5)),
((15.6 * Distance) + (49.8 * hour))/2,
ifelse(((Distance/hour >= 5) & (Distance/hour <10)),
((6.5 * Distance) + (51.6 * hour))/2,
ifelse(((Distance/hour >= 10) & (Distance/hour <20)),
((6 * Distance) + (65 * hour))/2,
ifelse(((Distance/hour >= 20) & (Distance/hour <30)),
((2.5 * Distance) + (65 * hour))/2,
ifelse(((Distance/hour >= 30) & (Distance/hour <40)),
((2.5 * Distance) + (91 * hour))/2,
ifelse((Distance/hour >= 40),
((2.4 * Distance) + (117 * hour))/2,0)))))),Fare),
Total = ifelse(is.na(Total),(Fare*.05)+Fare,Total)) %>%
select(-hour)
summary(trips)
## ID Start End
## Min. : 1.0 Min. :2019-01-02 09:30:00 Min. :2019-01-02 09:30:00
## 1st Qu.: 31.5 1st Qu.:2019-01-09 07:22:30 1st Qu.:2019-01-09 07:30:00
## Median : 62.0 Median :2019-01-14 15:45:00 Median :2019-01-14 15:45:00
## Mean : 62.0 Mean :2019-01-14 12:46:20 Mean :2019-01-14 12:58:10
## 3rd Qu.: 92.5 3rd Qu.:2019-01-22 10:52:30 3rd Qu.:2019-01-22 11:07:30
## Max. :123.0 Max. :2019-01-24 15:00:00 Max. :2019-01-24 15:45:00
## Duration Distance Fare Total
## Min. : 0 Min. : 0.30 Min. : 3.345 Min. : 3.512
## 1st Qu.: 360 1st Qu.: 0.80 1st Qu.: 5.875 1st Qu.: 7.375
## Median : 480 Median : 1.20 Median : 7.000 Median : 9.250
## Mean : 721 Mean : 3.68 Mean :12.644 Mean :15.890
## 3rd Qu.: 750 3rd Qu.: 1.90 3rd Qu.: 9.000 3rd Qu.:11.625
## Max. :2820 Max. :21.00 Max. :50.750 Max. :66.200
## PaymentType PickupLatitude PickupLongitude DropoffLatitude
## Length:123 Min. :41.86 Min. :-87.91 Min. :41.79
## Class :character 1st Qu.:41.88 1st Qu.:-87.64 1st Qu.:41.88
## Mode :character Median :41.89 Median :-87.63 Median :41.88
## Mean :41.89 Mean :-87.66 Mean :41.89
## 3rd Qu.:41.89 3rd Qu.:-87.63 3rd Qu.:41.89
## Max. :41.98 Max. :-87.62 Max. :41.99
## DropoffLongitude
## Min. :-87.90
## 1st Qu.:-87.64
## Median :-87.63
## Mean :-87.65
## 3rd Qu.:-87.62
## Max. :-87.61
Creating a subset of trips containing only PickupLatitude and PickupLongitude. I saved this new dataset to trips_scaled.
trips_scaled <- trips %>%
select(PickupLatitude,PickupLongitude)
Min-Max Normalization on the trips_scaled Dataset
using the formula:
\(V^1=\frac{V-min_f}{max_f-min_f}*(upper-lower)+lower\)
found on pg. 87 of \(Practical\space Machine\space Learning\space in\space R\)
trips_scaled<- trips_scaled %>%
mutate(
PickupLongitude =
(((PickupLongitude-min(PickupLongitude))/(max(PickupLongitude)-min(PickupLongitude))) *(5-1)+1),
PickupLatitude =
(((PickupLatitude-min(PickupLatitude))/(max(PickupLatitude)-min(PickupLatitude))) *(5-1)+1))
summary(trips_scaled)
## PickupLatitude PickupLongitude
## Min. :1.000 Min. :1.000
## 1st Qu.:1.716 1st Qu.:4.656
## Median :1.858 Median :4.792
## Mean :2.101 Mean :4.438
## 3rd Qu.:2.082 3rd Qu.:4.804
## Max. :5.000 Max. :5.000
Creating Kmeans Clustering based on the PickupLatitude and PickupLongitude Columns
us.
trips_kmeans <- kmeans(trips_scaled, centers=5, nstart = 25)
Creating visualization with the fviz_cluster()
fviz_cluster(trips_kmeans,
data = trips_scaled,
stand = FALSE,
ggtheme = theme_minimal())
Creating the Charts for cluster analysis
Elbow
fviz_nbclust(trips_scaled, kmeans, method = "wss")
Silhouette
fviz_nbclust(trips_scaled, kmeans, method = "silhouette")
Gap-Stat
fviz_nbclust(trips_scaled, kmeans, method = "gap_stat")
Based on the Graphs above 2 is the optimal amount of clusteres to use for our data.
Recreating Kmeans Clustering based on the PickupLatitude and PickupLongitude Columns
us.
trips_kmeans2 <- kmeans(trips_scaled, centers=2, nstart = 100)
Recreating visualization with the fviz_cluster()
fviz_cluster(trips_kmeans2,
data = trips_scaled,
stand = FALSE,
ggtheme = theme_minimal())
trips <- trips %>%
mutate(cluster = trips_kmeans2$cluster)
summarize clusters
trips %>%
group_by(cluster) %>%
mutate(Duration_minutes = Duration/60) %>%
summarize(Duration = mean(Duration_minutes),
Distance = mean(Distance),
Fare = mean(Fare),
Total = mean(Total),
count = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 6
## cluster Duration Distance Fare Total count
## <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1 35.9 16.8 42.5 56.4 11
## 2 2 9.67 2.39 9.71 11.9 112
summary(trips)
## ID Start End
## Min. : 1.0 Min. :2019-01-02 09:30:00 Min. :2019-01-02 09:30:00
## 1st Qu.: 31.5 1st Qu.:2019-01-09 07:22:30 1st Qu.:2019-01-09 07:30:00
## Median : 62.0 Median :2019-01-14 15:45:00 Median :2019-01-14 15:45:00
## Mean : 62.0 Mean :2019-01-14 12:46:20 Mean :2019-01-14 12:58:10
## 3rd Qu.: 92.5 3rd Qu.:2019-01-22 10:52:30 3rd Qu.:2019-01-22 11:07:30
## Max. :123.0 Max. :2019-01-24 15:00:00 Max. :2019-01-24 15:45:00
## Duration Distance Fare Total
## Min. : 0 Min. : 0.30 Min. : 3.345 Min. : 3.512
## 1st Qu.: 360 1st Qu.: 0.80 1st Qu.: 5.875 1st Qu.: 7.375
## Median : 480 Median : 1.20 Median : 7.000 Median : 9.250
## Mean : 721 Mean : 3.68 Mean :12.644 Mean :15.890
## 3rd Qu.: 750 3rd Qu.: 1.90 3rd Qu.: 9.000 3rd Qu.:11.625
## Max. :2820 Max. :21.00 Max. :50.750 Max. :66.200
## PaymentType PickupLatitude PickupLongitude DropoffLatitude
## Length:123 Min. :41.86 Min. :-87.91 Min. :41.79
## Class :character 1st Qu.:41.88 1st Qu.:-87.64 1st Qu.:41.88
## Mode :character Median :41.89 Median :-87.63 Median :41.88
## Mean :41.89 Mean :-87.66 Mean :41.89
## 3rd Qu.:41.89 3rd Qu.:-87.63 3rd Qu.:41.89
## Max. :41.98 Max. :-87.62 Max. :41.99
## DropoffLongitude cluster
## Min. :-87.90 Min. :1.000
## 1st Qu.:-87.64 1st Qu.:2.000
## Median :-87.63 Median :2.000
## Mean :-87.65 Mean :1.911
## 3rd Qu.:-87.62 3rd Qu.:2.000
## Max. :-87.61 Max. :2.000
If I were a cab driver I would choose cluster 2. All around there are better statistics for this cluster. There are far more trips to choose from, additionally the Duration is shorter, with the Distance being shorting. This means that the drivers can get a lot of short trips going short distances, this will maxamize their time for the day to get the most money.