For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.
We are going to use tidyverse a collection of R packages designed for data science.
## Loading required package: tidyverse
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.7.2 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Name your dataset ‘mydata’ so it easy to work with.
Commands: read_csv() head() mean() sub() as.numeric()
mydata = read.csv("data/Taxi_Trips_Sample.csv")
mydata = na.omit(mydata)
head(mydata)
## Trip.ID
## 1 3e7d6d8ccf1425ae1dcd584f5c3ca303cf6362ed
## 2 3e7d6e5c4e87f01a475c8200b33777e85497da89
## 4 3e7d6efe43222b0ebc698583916674c648dd4520
## 5 3e7d6f001e9bcda8478a489cb53293d26328ac85
## 8 3e7d6f35332ed1069a218b63cef47c3be42a26fb
## 14 3e7d706fe60689dc646772c09e128e19ab022472
## Taxi.ID
## 1 b47c583b142d75b42882975eaab19c6cb98d82686016576cce6e305b1b99eb16aacfb9a21ff61c84873a6c3dde282756c162c538c8b69554fd8f811f3a8f60a2
## 2 bc1c0381e3bca623e6c04f3410f7b67201a9fc85c6b66d0f420a88099d38448f9b9874e246da49cf2ef32ea3d027eec9c5b484fe77dbfc033c389b5576ac66bd
## 4 0f831bff43d83f396f2e4950126c6137dcdb60fb4c8580ffe860203747a83a789b22f2f9e4fbdd0dd8ed8c310366d8935228ddbcadf708fb9691ca5dd1b6c802
## 5 e5274d6c103515af3ce705182d0bbbea7ca077a6f23b1736254f2de8ba3e1687dd77f5fb541b7f00b1ebc24cfde54caf5a9562f046a0559acbfe1e7159e17c1a
## 8 5cb1d0f30a10dc0c80525ef9cbf1fd1a0f2c57e7ae2a5efca20c69229e52ada226b39071f69d9e8bef81722dc96234dca57b26694b6ec4773eb79b1beeac5c83
## 14 631de3a4ad35b7d7d75949b33ad59e09820b51b3f58225df9f3f1c9aa65554e97ba9d6f868c94b5c2f170cf086a0ca7089dc2f6dab2c9096283f482659071447
## Trip.Start.Timestamp Trip.End.Timestamp Trip.Seconds Trip.Miles
## 1 8/19/14 14:45 8/19/14 14:45 480 0.15
## 2 9/23/13 17:15 9/23/13 17:15 420 0.00
## 4 5/10/13 20:00 5/10/13 20:45 2340 13.80
## 5 2/21/16 19:15 2/21/16 19:15 300 0.70
## 8 3/8/14 15:45 3/8/14 16:30 2220 13.30
## 14 5/5/14 9:30 5/5/14 9:45 900 0.10
## Pickup.Census.Tract Dropoff.Census.Tract Pickup.Community.Area
## 1 17031280100 17031839100 28
## 2 17031081800 17031281900 8
## 4 17031980000 17031060400 76
## 5 17031081500 17031081500 8
## 8 17031980100 17031242300 56
## 14 17031832600 17031081401 7
## Dropoff.Community.Area Fare Tips Tolls Extras Trip.Total
## 1 32 $7.05 $0.00 $0.00 $1.50 $8.55
## 2 28 $6.05 $0.00 $0.00 $0.00 $6.05
## 4 6 $31.25 $0.00 $0.00 $3.00 $34.25
## 5 8 $5.50 $0.00 $0.00 $0.00 $5.50
## 8 24 $30.45 $6.49 $0.00 $2.00 $38.94
## 14 8 $10.45 $0.00 $0.00 $0.00 $10.45
## Payment.Type Company Pickup.Centroid.Latitude
## 1 Cash 41.88530
## 2 Cash Taxi Affiliation Services 41.89322
## 4 Cash 41.97907
## 5 Cash 41.89251
## 8 Credit Card 41.78600
## 14 Cash Taxi Affiliation Services 41.91475
## Pickup.Centroid.Longitude Pickup.Centroid.Location
## 1 -87.64281 POINT (-87.642808 41.8853)
## 2 -87.63784 POINT (-87.637844 41.893216)
## 4 -87.90304 POINT (-87.90304 41.979071)
## 5 -87.62621 POINT (-87.626215 41.892508)
## 8 -87.75093 POINT (-87.750934 41.785999)
## 14 -87.65401 POINT (-87.654007 41.914747)
## Dropoff.Centroid.Latitude Dropoff.Centroid.Longitude
## 1 41.88099 -87.63275
## 2 41.87926 -87.64265
## 4 41.95067 -87.66654
## 5 41.89251 -87.62621
## 8 41.89951 -87.67960
## 14 41.89503 -87.61971
## Dropoff.Centroid..Location Community.Areas
## 1 POINT (-87.632746 41.880994) 29
## 2 POINT (-87.642649 41.879255) 37
## 4 POINT (-87.666536 41.950673) 75
## 5 POINT (-87.626215 41.892508) 37
## 8 POINT (-87.6796 41.899507) 53
## 14 POINT (-87.619711 41.895033) 68
fare = mydata$Fare
tips = mydata$Tips
tolls = mydata$Tolls
extras = mydata$Extras
head(fare)
## [1] $7.05 $6.05 $31.25 $5.50 $30.45 $10.45
## 892 Levels: $0.00 $0.01 $0.03 $0.05 $0.10 $0.11 $0.28 ... $99.99
head(tips)
## [1] $0.00 $0.00 $0.00 $0.00 $6.49 $0.00
## 1081 Levels: $0.00 $0.01 $0.02 $0.03 $0.04 $0.05 $0.06 ... $9.98
head(tolls)
## [1] $0.00 $0.00 $0.00 $0.00 $0.00 $0.00
## 25 Levels: $0.00 $0.60 $0.95 $1.00 $1.11 $1.20 $1.50 ... $82.50
head(extras)
## [1] $1.50 $0.00 $3.00 $0.00 $2.00 $0.00
## 183 Levels: $0.00 $0.01 $0.03 $0.08 $0.10 $0.11 $0.20 ... $9.50
** What happens when we try to use the function? **
mean(fare)
## Warning in mean.default(fare): argument is not numeric or logical:
## returning NA
## [1] NA
mean(tips)
## Warning in mean.default(tips): argument is not numeric or logical:
## returning NA
## [1] NA
mean(tolls)
## Warning in mean.default(tolls): argument is not numeric or logical:
## returning NA
## [1] NA
mean(extras)
## Warning in mean.default(extras): argument is not numeric or logical:
## returning NA
## [1] NA
To resolve the error, check if the feature data type is correct
Notice that comma ‘1,234’ in some values
Also the data type and dollar sign ‘$’ symbol
To remove the any character in this case the comma from “1,234”. We must substitute it with just an empty space.
** substitute comma with “” **
# VARIABLE <- sub("," , "", VARIABLE)
fare_clean = sub(pattern = ",", replacement = "", x = fare)
head(fare_clean)
## [1] "$7.05 " "$6.05 " "$31.25 " "$5.50 " "$30.45 " "$10.45 "
tips_clean = sub(pattern = ",", replacement = "", x = tips)
head(tips_clean)
## [1] "$0.00 " "$0.00 " "$0.00 " "$0.00 " "$6.49 " "$0.00 "
tolls_clean = sub(pattern = ",", replacement = "", x = tolls)
head(tolls_clean)
## [1] "$0.00 " "$0.00 " "$0.00 " "$0.00 " "$0.00 " "$0.00 "
extras_clean = sub(pattern = ",", replacement = "", x = extras)
head(extras_clean)
## [1] "$1.50 " "$0.00 " "$3.00 " "$0.00 " "$2.00 " "$0.00 "
** substitute dollar sign with “” **
fare_clean = sub(pattern = "\\$", replacement = "", x = fare)
head(fare_clean)
## [1] "7.05 " "6.05 " "31.25 " "5.50 " "30.45 " "10.45 "
tips_clean = sub(pattern = "\\$", replacement = "", x = tips)
head(tips_clean)
## [1] "0.00 " "0.00 " "0.00 " "0.00 " "6.49 " "0.00 "
tolls_clean = sub(pattern = "\\$", replacement = "", x = tolls)
head(tolls_clean)
## [1] "0.00 " "0.00 " "0.00 " "0.00 " "0.00 " "0.00 "
extras_clean = sub(pattern = "\\$", replacement = "", x = extras)
head(extras_clean)
## [1] "1.50 " "0.00 " "3.00 " "0.00 " "2.00 " "0.00 "
** character to numeric - as.numeric()**
fare = as.numeric(fare_clean)
## Warning: NAs introduced by coercion
tips = as.numeric(tips_clean)
tolls = as.numeric(tolls_clean)
extras = as.numeric(extras_clean)
** mean with NA removed **
# VARIABLE_MEAN <- mean( YOUR_VARIABLE , na.rm = TRUE)
fare_mean = mean(fare, na.rm = TRUE)
fare_mean
## [1] 10.75363
tips_mean = mean(tips, na.rm = TRUE)
tips_mean
## [1] 1.096804
tolls_mean = mean(tolls, na.rm = TRUE)
tolls_mean
## [1] 0.005630209
extras_mean = mean(extras, na.rm = TRUE)
extras_mean
## [1] 0.7317954
** Clean variables to dataset **
# mydata$VARIABLE <- VARIABLE
mydata$Fare = fare
mydata$Tips = tips
mydata$Tolls = tolls
mydata$Extras = extras
head(mydata)
## Trip.ID
## 1 3e7d6d8ccf1425ae1dcd584f5c3ca303cf6362ed
## 2 3e7d6e5c4e87f01a475c8200b33777e85497da89
## 4 3e7d6efe43222b0ebc698583916674c648dd4520
## 5 3e7d6f001e9bcda8478a489cb53293d26328ac85
## 8 3e7d6f35332ed1069a218b63cef47c3be42a26fb
## 14 3e7d706fe60689dc646772c09e128e19ab022472
## Taxi.ID
## 1 b47c583b142d75b42882975eaab19c6cb98d82686016576cce6e305b1b99eb16aacfb9a21ff61c84873a6c3dde282756c162c538c8b69554fd8f811f3a8f60a2
## 2 bc1c0381e3bca623e6c04f3410f7b67201a9fc85c6b66d0f420a88099d38448f9b9874e246da49cf2ef32ea3d027eec9c5b484fe77dbfc033c389b5576ac66bd
## 4 0f831bff43d83f396f2e4950126c6137dcdb60fb4c8580ffe860203747a83a789b22f2f9e4fbdd0dd8ed8c310366d8935228ddbcadf708fb9691ca5dd1b6c802
## 5 e5274d6c103515af3ce705182d0bbbea7ca077a6f23b1736254f2de8ba3e1687dd77f5fb541b7f00b1ebc24cfde54caf5a9562f046a0559acbfe1e7159e17c1a
## 8 5cb1d0f30a10dc0c80525ef9cbf1fd1a0f2c57e7ae2a5efca20c69229e52ada226b39071f69d9e8bef81722dc96234dca57b26694b6ec4773eb79b1beeac5c83
## 14 631de3a4ad35b7d7d75949b33ad59e09820b51b3f58225df9f3f1c9aa65554e97ba9d6f868c94b5c2f170cf086a0ca7089dc2f6dab2c9096283f482659071447
## Trip.Start.Timestamp Trip.End.Timestamp Trip.Seconds Trip.Miles
## 1 8/19/14 14:45 8/19/14 14:45 480 0.15
## 2 9/23/13 17:15 9/23/13 17:15 420 0.00
## 4 5/10/13 20:00 5/10/13 20:45 2340 13.80
## 5 2/21/16 19:15 2/21/16 19:15 300 0.70
## 8 3/8/14 15:45 3/8/14 16:30 2220 13.30
## 14 5/5/14 9:30 5/5/14 9:45 900 0.10
## Pickup.Census.Tract Dropoff.Census.Tract Pickup.Community.Area
## 1 17031280100 17031839100 28
## 2 17031081800 17031281900 8
## 4 17031980000 17031060400 76
## 5 17031081500 17031081500 8
## 8 17031980100 17031242300 56
## 14 17031832600 17031081401 7
## Dropoff.Community.Area Fare Tips Tolls Extras Trip.Total Payment.Type
## 1 32 7.05 0.00 0 1.5 $8.55 Cash
## 2 28 6.05 0.00 0 0.0 $6.05 Cash
## 4 6 31.25 0.00 0 3.0 $34.25 Cash
## 5 8 5.50 0.00 0 0.0 $5.50 Cash
## 8 24 30.45 6.49 0 2.0 $38.94 Credit Card
## 14 8 10.45 0.00 0 0.0 $10.45 Cash
## Company Pickup.Centroid.Latitude
## 1 41.88530
## 2 Taxi Affiliation Services 41.89322
## 4 41.97907
## 5 41.89251
## 8 41.78600
## 14 Taxi Affiliation Services 41.91475
## Pickup.Centroid.Longitude Pickup.Centroid.Location
## 1 -87.64281 POINT (-87.642808 41.8853)
## 2 -87.63784 POINT (-87.637844 41.893216)
## 4 -87.90304 POINT (-87.90304 41.979071)
## 5 -87.62621 POINT (-87.626215 41.892508)
## 8 -87.75093 POINT (-87.750934 41.785999)
## 14 -87.65401 POINT (-87.654007 41.914747)
## Dropoff.Centroid.Latitude Dropoff.Centroid.Longitude
## 1 41.88099 -87.63275
## 2 41.87926 -87.64265
## 4 41.95067 -87.66654
## 5 41.89251 -87.62621
## 8 41.89951 -87.67960
## 14 41.89503 -87.61971
## Dropoff.Centroid..Location Community.Areas
## 1 POINT (-87.632746 41.880994) 29
## 2 POINT (-87.642649 41.879255) 37
## 4 POINT (-87.666536 41.950673) 75
## 5 POINT (-87.626215 41.892508) 37
## 8 POINT (-87.6796 41.899507) 53
## 14 POINT (-87.619711 41.895033) 68
** Save clean data **
# write_csv(mydata, path = "data/mydata_clean.csv")
write.csv(mydata, file = "data/clean_Taxi_Trips_Sample.csv")
In this task we must calculate the mean, standard deviation, maximum, and minimum for the given feature.
mydata = read.csv("data/clean_Taxi_Trips_Sample.csv")
trip_seconds = mydata$Trip.Seconds
head(mydata)
## X Trip.ID
## 1 1 3e7d6d8ccf1425ae1dcd584f5c3ca303cf6362ed
## 2 2 3e7d6e5c4e87f01a475c8200b33777e85497da89
## 3 4 3e7d6efe43222b0ebc698583916674c648dd4520
## 4 5 3e7d6f001e9bcda8478a489cb53293d26328ac85
## 5 8 3e7d6f35332ed1069a218b63cef47c3be42a26fb
## 6 14 3e7d706fe60689dc646772c09e128e19ab022472
## Taxi.ID
## 1 b47c583b142d75b42882975eaab19c6cb98d82686016576cce6e305b1b99eb16aacfb9a21ff61c84873a6c3dde282756c162c538c8b69554fd8f811f3a8f60a2
## 2 bc1c0381e3bca623e6c04f3410f7b67201a9fc85c6b66d0f420a88099d38448f9b9874e246da49cf2ef32ea3d027eec9c5b484fe77dbfc033c389b5576ac66bd
## 3 0f831bff43d83f396f2e4950126c6137dcdb60fb4c8580ffe860203747a83a789b22f2f9e4fbdd0dd8ed8c310366d8935228ddbcadf708fb9691ca5dd1b6c802
## 4 e5274d6c103515af3ce705182d0bbbea7ca077a6f23b1736254f2de8ba3e1687dd77f5fb541b7f00b1ebc24cfde54caf5a9562f046a0559acbfe1e7159e17c1a
## 5 5cb1d0f30a10dc0c80525ef9cbf1fd1a0f2c57e7ae2a5efca20c69229e52ada226b39071f69d9e8bef81722dc96234dca57b26694b6ec4773eb79b1beeac5c83
## 6 631de3a4ad35b7d7d75949b33ad59e09820b51b3f58225df9f3f1c9aa65554e97ba9d6f868c94b5c2f170cf086a0ca7089dc2f6dab2c9096283f482659071447
## Trip.Start.Timestamp Trip.End.Timestamp Trip.Seconds Trip.Miles
## 1 8/19/14 14:45 8/19/14 14:45 480 0.15
## 2 9/23/13 17:15 9/23/13 17:15 420 0.00
## 3 5/10/13 20:00 5/10/13 20:45 2340 13.80
## 4 2/21/16 19:15 2/21/16 19:15 300 0.70
## 5 3/8/14 15:45 3/8/14 16:30 2220 13.30
## 6 5/5/14 9:30 5/5/14 9:45 900 0.10
## Pickup.Census.Tract Dropoff.Census.Tract Pickup.Community.Area
## 1 17031280100 17031839100 28
## 2 17031081800 17031281900 8
## 3 17031980000 17031060400 76
## 4 17031081500 17031081500 8
## 5 17031980100 17031242300 56
## 6 17031832600 17031081401 7
## Dropoff.Community.Area Fare Tips Tolls Extras Trip.Total Payment.Type
## 1 32 7.05 0.00 0 1.5 $8.55 Cash
## 2 28 6.05 0.00 0 0.0 $6.05 Cash
## 3 6 31.25 0.00 0 3.0 $34.25 Cash
## 4 8 5.50 0.00 0 0.0 $5.50 Cash
## 5 24 30.45 6.49 0 2.0 $38.94 Credit Card
## 6 8 10.45 0.00 0 0.0 $10.45 Cash
## Company Pickup.Centroid.Latitude
## 1 41.88530
## 2 Taxi Affiliation Services 41.89322
## 3 41.97907
## 4 41.89251
## 5 41.78600
## 6 Taxi Affiliation Services 41.91475
## Pickup.Centroid.Longitude Pickup.Centroid.Location
## 1 -87.64281 POINT (-87.642808 41.8853)
## 2 -87.63784 POINT (-87.637844 41.893216)
## 3 -87.90304 POINT (-87.90304 41.979071)
## 4 -87.62621 POINT (-87.626215 41.892508)
## 5 -87.75093 POINT (-87.750934 41.785999)
## 6 -87.65401 POINT (-87.654007 41.914747)
## Dropoff.Centroid.Latitude Dropoff.Centroid.Longitude
## 1 41.88099 -87.63275
## 2 41.87926 -87.64265
## 3 41.95067 -87.66654
## 4 41.89251 -87.62621
## 5 41.89951 -87.67960
## 6 41.89503 -87.61971
## Dropoff.Centroid..Location Community.Areas
## 1 POINT (-87.632746 41.880994) 29
## 2 POINT (-87.642649 41.879255) 37
## 3 POINT (-87.666536 41.950673) 75
## 4 POINT (-87.626215 41.892508) 37
## 5 POINT (-87.6796 41.899507) 53
## 6 POINT (-87.619711 41.895033) 68
** calculate the average **
mean(mydata$Trip.Seconds, na.rm = TRUE)
## [1] 720.8911
** calculate the standard deviation **
sd(mydata$Trip.Seconds, na.rm = TRUE)
## [1] 1005.201
** calculate the min **
min(trip_seconds, na.rm = TRUE)
## [1] 0
** calculate the max **
max(trip_seconds, na.rm = TRUE)
## [1] 74340
summary(mydata)
## X Trip.ID
## Min. : 1 3e7d6d8ccf1425ae1dcd584f5c3ca303cf6362ed: 1
## 1st Qu.:24772 3e7d6e5c4e87f01a475c8200b33777e85497da89: 1
## Median :49770 3e7d6efe43222b0ebc698583916674c648dd4520: 1
## Mean :49852 3e7d6f001e9bcda8478a489cb53293d26328ac85: 1
## 3rd Qu.:74846 3e7d6f35332ed1069a218b63cef47c3be42a26fb: 1
## Max. :99999 3e7d706fe60689dc646772c09e128e19ab022472: 1
## (Other) :60705
## Taxi.ID
## 4f189764b8d9b6f71f7936ab414cac07634be0a00790ca179f9460521b7c9c3e5e102f5ba4e1c9cd18cdd9856dbf4f66ae8f13d8c82f8d2d4872f74b96938a24: 45
## 1b2865284986db761b656ee64ad35fa46df8f7da125dde86ddb79d798917d2a8ac1bc1e7865a340f797d7cc7ba275d9397e4ce03c88da96b088fab3088c07dc8: 38
## ad33dffdd6cd00795ea1a00a6a6db1a38482075d532b55e712741e9b4a2541375fcf642d01d35b51646d2a07b49376f167b5ddd1f7b0a5354afe07f514108365: 37
## 0861cb74337c620cb9ec639af7dc3aa99173b768caf750a2fd1ff17a8d9db86cad36772c7ff6ddaf2fda48de41bc82981145fe46693ed147d86ae194ee15c703: 36
## 21e6058fef096e0d45f7b9d47433974c9c4e1820c3e5bee61670db5e58eb32e58432f1b6cf09adf73fe8a36a25cfe880f2e797158b2c890ec52603c3635d759f: 36
## 2a29bcc02d98eb9e748a8427ffd773934047f77e20f70325b01f515ba28cb6e379194d8d65b7783c6791f390ea0d9833daf89a537be9be1db5b81e1db23a07c5: 36
## (Other) :60483
## Trip.Start.Timestamp Trip.End.Timestamp Trip.Seconds
## 2/27/15 8:45 : 7 7/25/14 18:45: 8 Min. : 0.0
## 2/5/15 19:15 : 7 2/28/14 20:45: 7 1st Qu.: 300.0
## 4/25/14 18:45 : 7 3/11/16 20:15: 7 Median : 540.0
## 5/17/14 23:30 : 7 10/6/13 1:00 : 6 Mean : 720.9
## 7/25/14 18:45 : 7 11/1/14 1:15 : 6 3rd Qu.: 840.0
## 10/15/14 11:00: 6 2/5/15 19:45 : 6 Max. :74340.0
## (Other) :60670 (Other) :60671
## Trip.Miles Pickup.Census.Tract Dropoff.Census.Tract
## Min. : 0.000 Min. :1.703e+10 Min. :1.703e+10
## 1st Qu.: 0.000 1st Qu.:1.703e+10 1st Qu.:1.703e+10
## Median : 0.900 Median :1.703e+10 Median :1.703e+10
## Mean : 2.478 Mean :1.703e+10 Mean :1.703e+10
## 3rd Qu.: 1.900 3rd Qu.:1.703e+10 3rd Qu.:1.703e+10
## Max. :1770.000 Max. :1.703e+10 Max. :1.703e+10
##
## Pickup.Community.Area Dropoff.Community.Area Fare
## Min. : 1.00 Min. : 1.00 Min. : 0.00
## 1st Qu.: 8.00 1st Qu.: 8.00 1st Qu.: 5.65
## Median : 8.00 Median : 8.00 Median : 7.25
## Mean :21.58 Mean :21.19 Mean : 10.75
## 3rd Qu.:32.00 3rd Qu.:32.00 3rd Qu.: 10.45
## Max. :77.00 Max. :77.00 Max. :162.44
## NA's :5
## Tips Tolls Extras Trip.Total
## Min. : 0.000 Min. : 0.00000 Min. : 0.0000 $7.25 : 1540
## 1st Qu.: 0.000 1st Qu.: 0.00000 1st Qu.: 0.0000 $6.25 : 1455
## Median : 0.000 Median : 0.00000 Median : 0.0000 $6.65 : 1361
## Mean : 1.097 Mean : 0.00563 Mean : 0.7318 $7.05 : 1253
## 3rd Qu.: 2.000 3rd Qu.: 0.00000 3rd Qu.: 1.0000 $6.45 : 1251
## Max. :108.100 Max. :50.00000 Max. :213.0000 $7.45 : 1228
## NA's :1 NA's :1 NA's :1 (Other):52623
## Payment.Type Company
## Cash :39135 :20771
## Credit Card:21147 Taxi Affiliation Services :20403
## Dispute : 37 Dispatch Taxi Affiliation : 6685
## No Charge : 305 Blue Ribbon Taxi Association Inc.: 4589
## Pcard : 9 Choice Taxi Association : 3784
## Prcard : 3 Northwest Management LLC : 2236
## Unknown : 75 (Other) : 2243
## Pickup.Centroid.Latitude Pickup.Centroid.Longitude
## Min. :41.74 Min. :-87.90
## 1st Qu.:41.88 1st Qu.:-87.64
## Median :41.89 Median :-87.63
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.90 3rd Qu.:-87.62
## Max. :42.02 Max. :-87.58
##
## Pickup.Centroid.Location Dropoff.Centroid.Latitude
## POINT (-87.632746 41.880994): 8461 Min. :41.74
## POINT (-87.620993 41.884987): 4963 1st Qu.:41.88
## POINT (-87.626215 41.892508): 3787 Median :41.89
## POINT (-87.631864 41.892042): 3647 Mean :41.90
## POINT (-87.90304 41.979071) : 3016 3rd Qu.:41.90
## POINT (-87.642649 41.879255): 2780 Max. :42.02
## (Other) :34057
## Dropoff.Centroid.Longitude Dropoff.Centroid..Location
## Min. :-87.90 POINT (-87.632746 41.880994): 7602
## 1st Qu.:-87.64 POINT (-87.620993 41.884987): 4394
## Median :-87.63 POINT (-87.631864 41.892042): 3062
## Mean :-87.65 POINT (-87.626215 41.892508): 3057
## 3rd Qu.:-87.62 POINT (-87.642649 41.879255): 2696
## Max. :-87.58 POINT (-87.90304 41.979071) : 2435
## (Other) :37465
## Community.Areas
## Min. : 1.00
## 1st Qu.:37.00
## Median :37.00
## Mean :40.71
## 3rd Qu.:38.00
## Max. :76.00
##
To find the outliers we are going to look at the upper and lower limits
An outlier is value that “lies outside” most of the other values in a set of data.
A method to find upper and lower thresholds involves finding the interquartile range.
** quantile calculation for the give feature**
quantile(trip_seconds, na.rm = TRUE)
## 0% 25% 50% 75% 100%
## 0 300 540 840 74340
** Lower and upper quantile calculation **
# lowerq = quantile(VARIABLE)[2]
# upperq = quantile(VARIABLE)[4]
lowerq = quantile(trip_seconds, na.rm = TRUE)[2]
upperq = quantile(trip_seconds, na.rm = TRUE)[4]
Interquantile calculation
# iqr = upperq - lowerq
iqr = upperq - lowerq
The threshold is the boundaries that determine if a value is an outlier.
If the value falls above the upper threshold or below the lower threshold, it is an outlier.
** Calculation the upper threshold **
# upper_threshold = (iqr * 1.5) + upperq
upper_threshold = (iqr * 1.5) + upperq
** Calculation the lower threshold **
# lower_threshold = lowerq - (iqr * 1.5)
lower_threshold = lowerq - (iqr * 1.5)
** Identify outliers **
# VARIABLE[ VARIABLE > upper_threshold][1:10]
# VARIABLE[ VARIABLE > lower_threshold][1:10]
trip_seconds[ trip_seconds > upper_threshold][1:10]
## [1] 2340 2220 3300 1680 1860 1980 3360 4500 2760 1860
trip_seconds[ trip_seconds < lower_threshold][1:10]
## [1] NA NA NA NA NA NA NA NA NA NA
** Finding outliers records **
# mydata[ VARIABLE > upper_threshold, ]
# mydata[ VARIABLE > lower_threshold, ]
# mydata[ trip_seconds > upper_threshold, ]
count( mydata[ trip_seconds > upper_threshold, ] )
## # A tibble: 1 x 1
## n
## <int>
## 1 4837
There are 4,837 outliers in this data set.
It can also be useful to visualize the data using a box and whisker plot.
The boxplot supports the IQR also shows the upper and lower thresholds
# p <- ggplot(data = Scoring, aes(x = "", y = Finrat)) + geom_boxplot() + coord_flip()
# p
p = ggplot(data = mydata, aes(x = "", y = trip_seconds)) + geom_boxplot() + coord_flip()
p
Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq
Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released
knitr::include_graphics("imgs/taxidata.png")
The file size is 38.1MB. There are 24 columns and 10,000 rows of data. The data set contains both quantitative and categorical data types. Some of the fields include TaxiID, trip miles and payment type.
summary(mydata)
## X Trip.ID
## Min. : 1 3e7d6d8ccf1425ae1dcd584f5c3ca303cf6362ed: 1
## 1st Qu.:24772 3e7d6e5c4e87f01a475c8200b33777e85497da89: 1
## Median :49770 3e7d6efe43222b0ebc698583916674c648dd4520: 1
## Mean :49852 3e7d6f001e9bcda8478a489cb53293d26328ac85: 1
## 3rd Qu.:74846 3e7d6f35332ed1069a218b63cef47c3be42a26fb: 1
## Max. :99999 3e7d706fe60689dc646772c09e128e19ab022472: 1
## (Other) :60705
## Taxi.ID
## 4f189764b8d9b6f71f7936ab414cac07634be0a00790ca179f9460521b7c9c3e5e102f5ba4e1c9cd18cdd9856dbf4f66ae8f13d8c82f8d2d4872f74b96938a24: 45
## 1b2865284986db761b656ee64ad35fa46df8f7da125dde86ddb79d798917d2a8ac1bc1e7865a340f797d7cc7ba275d9397e4ce03c88da96b088fab3088c07dc8: 38
## ad33dffdd6cd00795ea1a00a6a6db1a38482075d532b55e712741e9b4a2541375fcf642d01d35b51646d2a07b49376f167b5ddd1f7b0a5354afe07f514108365: 37
## 0861cb74337c620cb9ec639af7dc3aa99173b768caf750a2fd1ff17a8d9db86cad36772c7ff6ddaf2fda48de41bc82981145fe46693ed147d86ae194ee15c703: 36
## 21e6058fef096e0d45f7b9d47433974c9c4e1820c3e5bee61670db5e58eb32e58432f1b6cf09adf73fe8a36a25cfe880f2e797158b2c890ec52603c3635d759f: 36
## 2a29bcc02d98eb9e748a8427ffd773934047f77e20f70325b01f515ba28cb6e379194d8d65b7783c6791f390ea0d9833daf89a537be9be1db5b81e1db23a07c5: 36
## (Other) :60483
## Trip.Start.Timestamp Trip.End.Timestamp Trip.Seconds
## 2/27/15 8:45 : 7 7/25/14 18:45: 8 Min. : 0.0
## 2/5/15 19:15 : 7 2/28/14 20:45: 7 1st Qu.: 300.0
## 4/25/14 18:45 : 7 3/11/16 20:15: 7 Median : 540.0
## 5/17/14 23:30 : 7 10/6/13 1:00 : 6 Mean : 720.9
## 7/25/14 18:45 : 7 11/1/14 1:15 : 6 3rd Qu.: 840.0
## 10/15/14 11:00: 6 2/5/15 19:45 : 6 Max. :74340.0
## (Other) :60670 (Other) :60671
## Trip.Miles Pickup.Census.Tract Dropoff.Census.Tract
## Min. : 0.000 Min. :1.703e+10 Min. :1.703e+10
## 1st Qu.: 0.000 1st Qu.:1.703e+10 1st Qu.:1.703e+10
## Median : 0.900 Median :1.703e+10 Median :1.703e+10
## Mean : 2.478 Mean :1.703e+10 Mean :1.703e+10
## 3rd Qu.: 1.900 3rd Qu.:1.703e+10 3rd Qu.:1.703e+10
## Max. :1770.000 Max. :1.703e+10 Max. :1.703e+10
##
## Pickup.Community.Area Dropoff.Community.Area Fare
## Min. : 1.00 Min. : 1.00 Min. : 0.00
## 1st Qu.: 8.00 1st Qu.: 8.00 1st Qu.: 5.65
## Median : 8.00 Median : 8.00 Median : 7.25
## Mean :21.58 Mean :21.19 Mean : 10.75
## 3rd Qu.:32.00 3rd Qu.:32.00 3rd Qu.: 10.45
## Max. :77.00 Max. :77.00 Max. :162.44
## NA's :5
## Tips Tolls Extras Trip.Total
## Min. : 0.000 Min. : 0.00000 Min. : 0.0000 $7.25 : 1540
## 1st Qu.: 0.000 1st Qu.: 0.00000 1st Qu.: 0.0000 $6.25 : 1455
## Median : 0.000 Median : 0.00000 Median : 0.0000 $6.65 : 1361
## Mean : 1.097 Mean : 0.00563 Mean : 0.7318 $7.05 : 1253
## 3rd Qu.: 2.000 3rd Qu.: 0.00000 3rd Qu.: 1.0000 $6.45 : 1251
## Max. :108.100 Max. :50.00000 Max. :213.0000 $7.45 : 1228
## NA's :1 NA's :1 NA's :1 (Other):52623
## Payment.Type Company
## Cash :39135 :20771
## Credit Card:21147 Taxi Affiliation Services :20403
## Dispute : 37 Dispatch Taxi Affiliation : 6685
## No Charge : 305 Blue Ribbon Taxi Association Inc.: 4589
## Pcard : 9 Choice Taxi Association : 3784
## Prcard : 3 Northwest Management LLC : 2236
## Unknown : 75 (Other) : 2243
## Pickup.Centroid.Latitude Pickup.Centroid.Longitude
## Min. :41.74 Min. :-87.90
## 1st Qu.:41.88 1st Qu.:-87.64
## Median :41.89 Median :-87.63
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.90 3rd Qu.:-87.62
## Max. :42.02 Max. :-87.58
##
## Pickup.Centroid.Location Dropoff.Centroid.Latitude
## POINT (-87.632746 41.880994): 8461 Min. :41.74
## POINT (-87.620993 41.884987): 4963 1st Qu.:41.88
## POINT (-87.626215 41.892508): 3787 Median :41.89
## POINT (-87.631864 41.892042): 3647 Mean :41.90
## POINT (-87.90304 41.979071) : 3016 3rd Qu.:41.90
## POINT (-87.642649 41.879255): 2780 Max. :42.02
## (Other) :34057
## Dropoff.Centroid.Longitude Dropoff.Centroid..Location
## Min. :-87.90 POINT (-87.632746 41.880994): 7602
## 1st Qu.:-87.64 POINT (-87.620993 41.884987): 4394
## Median :-87.63 POINT (-87.631864 41.892042): 3062
## Mean :-87.65 POINT (-87.626215 41.892508): 3057
## 3rd Qu.:-87.62 POINT (-87.642649 41.879255): 2696
## Max. :-87.58 POINT (-87.90304 41.979071) : 2435
## (Other) :37465
## Community.Areas
## Min. : 1.00
## 1st Qu.:37.00
## Median :37.00
## Mean :40.71
## 3rd Qu.:38.00
## Max. :76.00
##
There is relationship integrity in this data set between fare, tips, tolls, extras and trip total, because if any fare, tips, tools on their own or combined together exceed trip total then there would be a violation of relational busienss logic. This data set also has field integrity because each column has the appropriate values in it. For example, trip total contains a numerical value while payment type contains a non-defined value.
knitr::include_graphics("imgs/taxistarschema.png")