Notebook Instructions


Load Packages in R/RStudio

We are going to use tidyverse a collection of R packages designed for data science.

## Loading required package: tidyverse
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.7.2     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Task 1: Data Prepartion and Cleaning


1A) Read the csv file into R Studio and display the dataset.

  • Name your dataset ‘mydata’ so it easy to work with.

  • Commands: read_csv() head() mean() sub() as.numeric()

Extract the assigned features (columns) to perform some analytics

mydata = read.csv("data/Taxi_Trips_Sample.csv")

mydata = na.omit(mydata)

head(mydata)
##                                     Trip.ID
## 1  3e7d6d8ccf1425ae1dcd584f5c3ca303cf6362ed
## 2  3e7d6e5c4e87f01a475c8200b33777e85497da89
## 4  3e7d6efe43222b0ebc698583916674c648dd4520
## 5  3e7d6f001e9bcda8478a489cb53293d26328ac85
## 8  3e7d6f35332ed1069a218b63cef47c3be42a26fb
## 14 3e7d706fe60689dc646772c09e128e19ab022472
##                                                                                                                             Taxi.ID
## 1  b47c583b142d75b42882975eaab19c6cb98d82686016576cce6e305b1b99eb16aacfb9a21ff61c84873a6c3dde282756c162c538c8b69554fd8f811f3a8f60a2
## 2  bc1c0381e3bca623e6c04f3410f7b67201a9fc85c6b66d0f420a88099d38448f9b9874e246da49cf2ef32ea3d027eec9c5b484fe77dbfc033c389b5576ac66bd
## 4  0f831bff43d83f396f2e4950126c6137dcdb60fb4c8580ffe860203747a83a789b22f2f9e4fbdd0dd8ed8c310366d8935228ddbcadf708fb9691ca5dd1b6c802
## 5  e5274d6c103515af3ce705182d0bbbea7ca077a6f23b1736254f2de8ba3e1687dd77f5fb541b7f00b1ebc24cfde54caf5a9562f046a0559acbfe1e7159e17c1a
## 8  5cb1d0f30a10dc0c80525ef9cbf1fd1a0f2c57e7ae2a5efca20c69229e52ada226b39071f69d9e8bef81722dc96234dca57b26694b6ec4773eb79b1beeac5c83
## 14 631de3a4ad35b7d7d75949b33ad59e09820b51b3f58225df9f3f1c9aa65554e97ba9d6f868c94b5c2f170cf086a0ca7089dc2f6dab2c9096283f482659071447
##    Trip.Start.Timestamp Trip.End.Timestamp Trip.Seconds Trip.Miles
## 1         8/19/14 14:45      8/19/14 14:45          480       0.15
## 2         9/23/13 17:15      9/23/13 17:15          420       0.00
## 4         5/10/13 20:00      5/10/13 20:45         2340      13.80
## 5         2/21/16 19:15      2/21/16 19:15          300       0.70
## 8          3/8/14 15:45       3/8/14 16:30         2220      13.30
## 14          5/5/14 9:30        5/5/14 9:45          900       0.10
##    Pickup.Census.Tract Dropoff.Census.Tract Pickup.Community.Area
## 1          17031280100          17031839100                    28
## 2          17031081800          17031281900                     8
## 4          17031980000          17031060400                    76
## 5          17031081500          17031081500                     8
## 8          17031980100          17031242300                    56
## 14         17031832600          17031081401                     7
##    Dropoff.Community.Area    Fare   Tips  Tolls Extras Trip.Total
## 1                      32  $7.05  $0.00  $0.00  $1.50      $8.55 
## 2                      28  $6.05  $0.00  $0.00  $0.00      $6.05 
## 4                       6 $31.25  $0.00  $0.00  $3.00     $34.25 
## 5                       8  $5.50  $0.00  $0.00  $0.00      $5.50 
## 8                      24 $30.45  $6.49  $0.00  $2.00     $38.94 
## 14                      8 $10.45  $0.00  $0.00  $0.00     $10.45 
##    Payment.Type                   Company Pickup.Centroid.Latitude
## 1          Cash                                           41.88530
## 2          Cash Taxi Affiliation Services                 41.89322
## 4          Cash                                           41.97907
## 5          Cash                                           41.89251
## 8   Credit Card                                           41.78600
## 14         Cash Taxi Affiliation Services                 41.91475
##    Pickup.Centroid.Longitude     Pickup.Centroid.Location
## 1                  -87.64281   POINT (-87.642808 41.8853)
## 2                  -87.63784 POINT (-87.637844 41.893216)
## 4                  -87.90304  POINT (-87.90304 41.979071)
## 5                  -87.62621 POINT (-87.626215 41.892508)
## 8                  -87.75093 POINT (-87.750934 41.785999)
## 14                 -87.65401 POINT (-87.654007 41.914747)
##    Dropoff.Centroid.Latitude Dropoff.Centroid.Longitude
## 1                   41.88099                  -87.63275
## 2                   41.87926                  -87.64265
## 4                   41.95067                  -87.66654
## 5                   41.89251                  -87.62621
## 8                   41.89951                  -87.67960
## 14                  41.89503                  -87.61971
##      Dropoff.Centroid..Location Community.Areas
## 1  POINT (-87.632746 41.880994)              29
## 2  POINT (-87.642649 41.879255)              37
## 4  POINT (-87.666536 41.950673)              75
## 5  POINT (-87.626215 41.892508)              37
## 8    POINT (-87.6796 41.899507)              53
## 14 POINT (-87.619711 41.895033)              68
fare = mydata$Fare
tips = mydata$Tips
tolls = mydata$Tolls
extras = mydata$Extras

head(fare)
## [1] $7.05   $6.05   $31.25  $5.50   $30.45  $10.45 
## 892 Levels:  $0.00  $0.01  $0.03  $0.05  $0.10  $0.11  $0.28  ... $99.99
head(tips)
## [1] $0.00  $0.00  $0.00  $0.00  $6.49  $0.00 
## 1081 Levels:  $0.00  $0.01  $0.02  $0.03  $0.04  $0.05  $0.06  ... $9.98
head(tolls)
## [1] $0.00  $0.00  $0.00  $0.00  $0.00  $0.00 
## 25 Levels:  $0.00  $0.60  $0.95  $1.00  $1.11  $1.20  $1.50  ... $82.50
head(extras)
## [1] $1.50  $0.00  $3.00  $0.00  $2.00  $0.00 
## 183 Levels:  $0.00  $0.01  $0.03  $0.08  $0.10  $0.11  $0.20  ... $9.50

Find the average for each assigned feature

** What happens when we try to use the function? **

mean(fare)
## Warning in mean.default(fare): argument is not numeric or logical:
## returning NA
## [1] NA
mean(tips)
## Warning in mean.default(tips): argument is not numeric or logical:
## returning NA
## [1] NA
mean(tolls)
## Warning in mean.default(tolls): argument is not numeric or logical:
## returning NA
## [1] NA
mean(extras)
## Warning in mean.default(extras): argument is not numeric or logical:
## returning NA
## [1] NA
  • To resolve the error, check if the feature data type is correct

  • Notice that comma ‘1,234’ in some values

  • Also the data type and dollar sign ‘$’ symbol

  • To remove the any character in this case the comma from “1,234”. We must substitute it with just an empty space.

1B) Substitute the characters causing the issue with empty character (“”) on all features

** substitute comma with “” **

# VARIABLE <- sub("," , "", VARIABLE)

fare_clean = sub(pattern = ",", replacement = "", x = fare)
head(fare_clean)
## [1] "$7.05 "  "$6.05 "  "$31.25 " "$5.50 "  "$30.45 " "$10.45 "
tips_clean = sub(pattern = ",", replacement = "", x = tips)
head(tips_clean)
## [1] "$0.00 " "$0.00 " "$0.00 " "$0.00 " "$6.49 " "$0.00 "
tolls_clean = sub(pattern = ",", replacement = "", x = tolls)
head(tolls_clean)
## [1] "$0.00 " "$0.00 " "$0.00 " "$0.00 " "$0.00 " "$0.00 "
extras_clean = sub(pattern = ",", replacement = "", x = extras)
head(extras_clean)
## [1] "$1.50 " "$0.00 " "$3.00 " "$0.00 " "$2.00 " "$0.00 "

** substitute dollar sign with “” **

fare_clean = sub(pattern = "\\$", replacement = "", x = fare)
head(fare_clean)
## [1] "7.05 "  "6.05 "  "31.25 " "5.50 "  "30.45 " "10.45 "
tips_clean = sub(pattern = "\\$", replacement = "", x = tips)
head(tips_clean)
## [1] "0.00 " "0.00 " "0.00 " "0.00 " "6.49 " "0.00 "
tolls_clean = sub(pattern = "\\$", replacement = "", x = tolls)
head(tolls_clean)
## [1] "0.00 " "0.00 " "0.00 " "0.00 " "0.00 " "0.00 "
extras_clean = sub(pattern = "\\$", replacement = "", x = extras)
head(extras_clean)
## [1] "1.50 " "0.00 " "3.00 " "0.00 " "2.00 " "0.00 "

1C) Change the features data type from character to numeric

** character to numeric - as.numeric()**

fare = as.numeric(fare_clean)
## Warning: NAs introduced by coercion
tips = as.numeric(tips_clean)

tolls = as.numeric(tolls_clean)

extras = as.numeric(extras_clean)

** mean with NA removed **

# VARIABLE_MEAN <- mean( YOUR_VARIABLE , na.rm = TRUE)

fare_mean = mean(fare, na.rm = TRUE)
fare_mean
## [1] 10.75363
tips_mean = mean(tips, na.rm = TRUE)
tips_mean
## [1] 1.096804
tolls_mean = mean(tolls, na.rm = TRUE)
tolls_mean
## [1] 0.005630209
extras_mean = mean(extras, na.rm = TRUE)
extras_mean
## [1] 0.7317954

1D) Assign new clean variables to dataset and save your clean new data

** Clean variables to dataset **

# mydata$VARIABLE <- VARIABLE

mydata$Fare = fare


mydata$Tips = tips


mydata$Tolls = tolls


mydata$Extras = extras
head(mydata)
##                                     Trip.ID
## 1  3e7d6d8ccf1425ae1dcd584f5c3ca303cf6362ed
## 2  3e7d6e5c4e87f01a475c8200b33777e85497da89
## 4  3e7d6efe43222b0ebc698583916674c648dd4520
## 5  3e7d6f001e9bcda8478a489cb53293d26328ac85
## 8  3e7d6f35332ed1069a218b63cef47c3be42a26fb
## 14 3e7d706fe60689dc646772c09e128e19ab022472
##                                                                                                                             Taxi.ID
## 1  b47c583b142d75b42882975eaab19c6cb98d82686016576cce6e305b1b99eb16aacfb9a21ff61c84873a6c3dde282756c162c538c8b69554fd8f811f3a8f60a2
## 2  bc1c0381e3bca623e6c04f3410f7b67201a9fc85c6b66d0f420a88099d38448f9b9874e246da49cf2ef32ea3d027eec9c5b484fe77dbfc033c389b5576ac66bd
## 4  0f831bff43d83f396f2e4950126c6137dcdb60fb4c8580ffe860203747a83a789b22f2f9e4fbdd0dd8ed8c310366d8935228ddbcadf708fb9691ca5dd1b6c802
## 5  e5274d6c103515af3ce705182d0bbbea7ca077a6f23b1736254f2de8ba3e1687dd77f5fb541b7f00b1ebc24cfde54caf5a9562f046a0559acbfe1e7159e17c1a
## 8  5cb1d0f30a10dc0c80525ef9cbf1fd1a0f2c57e7ae2a5efca20c69229e52ada226b39071f69d9e8bef81722dc96234dca57b26694b6ec4773eb79b1beeac5c83
## 14 631de3a4ad35b7d7d75949b33ad59e09820b51b3f58225df9f3f1c9aa65554e97ba9d6f868c94b5c2f170cf086a0ca7089dc2f6dab2c9096283f482659071447
##    Trip.Start.Timestamp Trip.End.Timestamp Trip.Seconds Trip.Miles
## 1         8/19/14 14:45      8/19/14 14:45          480       0.15
## 2         9/23/13 17:15      9/23/13 17:15          420       0.00
## 4         5/10/13 20:00      5/10/13 20:45         2340      13.80
## 5         2/21/16 19:15      2/21/16 19:15          300       0.70
## 8          3/8/14 15:45       3/8/14 16:30         2220      13.30
## 14          5/5/14 9:30        5/5/14 9:45          900       0.10
##    Pickup.Census.Tract Dropoff.Census.Tract Pickup.Community.Area
## 1          17031280100          17031839100                    28
## 2          17031081800          17031281900                     8
## 4          17031980000          17031060400                    76
## 5          17031081500          17031081500                     8
## 8          17031980100          17031242300                    56
## 14         17031832600          17031081401                     7
##    Dropoff.Community.Area  Fare Tips Tolls Extras Trip.Total Payment.Type
## 1                      32  7.05 0.00     0    1.5     $8.55          Cash
## 2                      28  6.05 0.00     0    0.0     $6.05          Cash
## 4                       6 31.25 0.00     0    3.0    $34.25          Cash
## 5                       8  5.50 0.00     0    0.0     $5.50          Cash
## 8                      24 30.45 6.49     0    2.0    $38.94   Credit Card
## 14                      8 10.45 0.00     0    0.0    $10.45          Cash
##                      Company Pickup.Centroid.Latitude
## 1                                            41.88530
## 2  Taxi Affiliation Services                 41.89322
## 4                                            41.97907
## 5                                            41.89251
## 8                                            41.78600
## 14 Taxi Affiliation Services                 41.91475
##    Pickup.Centroid.Longitude     Pickup.Centroid.Location
## 1                  -87.64281   POINT (-87.642808 41.8853)
## 2                  -87.63784 POINT (-87.637844 41.893216)
## 4                  -87.90304  POINT (-87.90304 41.979071)
## 5                  -87.62621 POINT (-87.626215 41.892508)
## 8                  -87.75093 POINT (-87.750934 41.785999)
## 14                 -87.65401 POINT (-87.654007 41.914747)
##    Dropoff.Centroid.Latitude Dropoff.Centroid.Longitude
## 1                   41.88099                  -87.63275
## 2                   41.87926                  -87.64265
## 4                   41.95067                  -87.66654
## 5                   41.89251                  -87.62621
## 8                   41.89951                  -87.67960
## 14                  41.89503                  -87.61971
##      Dropoff.Centroid..Location Community.Areas
## 1  POINT (-87.632746 41.880994)              29
## 2  POINT (-87.642649 41.879255)              37
## 4  POINT (-87.666536 41.950673)              75
## 5  POINT (-87.626215 41.892508)              37
## 8    POINT (-87.6796 41.899507)              53
## 14 POINT (-87.619711 41.895033)              68

** Save clean data **

# write_csv(mydata, path = "data/mydata_clean.csv")

write.csv(mydata, file = "data/clean_Taxi_Trips_Sample.csv")

Task 2: Outlier Detection


In this task we must calculate the mean, standard deviation, maximum, and minimum for the given feature.

2A) Extract the assigned features (columns) to perform some analytics

mydata = read.csv("data/clean_Taxi_Trips_Sample.csv")
trip_seconds = mydata$Trip.Seconds
head(mydata)
##    X                                  Trip.ID
## 1  1 3e7d6d8ccf1425ae1dcd584f5c3ca303cf6362ed
## 2  2 3e7d6e5c4e87f01a475c8200b33777e85497da89
## 3  4 3e7d6efe43222b0ebc698583916674c648dd4520
## 4  5 3e7d6f001e9bcda8478a489cb53293d26328ac85
## 5  8 3e7d6f35332ed1069a218b63cef47c3be42a26fb
## 6 14 3e7d706fe60689dc646772c09e128e19ab022472
##                                                                                                                            Taxi.ID
## 1 b47c583b142d75b42882975eaab19c6cb98d82686016576cce6e305b1b99eb16aacfb9a21ff61c84873a6c3dde282756c162c538c8b69554fd8f811f3a8f60a2
## 2 bc1c0381e3bca623e6c04f3410f7b67201a9fc85c6b66d0f420a88099d38448f9b9874e246da49cf2ef32ea3d027eec9c5b484fe77dbfc033c389b5576ac66bd
## 3 0f831bff43d83f396f2e4950126c6137dcdb60fb4c8580ffe860203747a83a789b22f2f9e4fbdd0dd8ed8c310366d8935228ddbcadf708fb9691ca5dd1b6c802
## 4 e5274d6c103515af3ce705182d0bbbea7ca077a6f23b1736254f2de8ba3e1687dd77f5fb541b7f00b1ebc24cfde54caf5a9562f046a0559acbfe1e7159e17c1a
## 5 5cb1d0f30a10dc0c80525ef9cbf1fd1a0f2c57e7ae2a5efca20c69229e52ada226b39071f69d9e8bef81722dc96234dca57b26694b6ec4773eb79b1beeac5c83
## 6 631de3a4ad35b7d7d75949b33ad59e09820b51b3f58225df9f3f1c9aa65554e97ba9d6f868c94b5c2f170cf086a0ca7089dc2f6dab2c9096283f482659071447
##   Trip.Start.Timestamp Trip.End.Timestamp Trip.Seconds Trip.Miles
## 1        8/19/14 14:45      8/19/14 14:45          480       0.15
## 2        9/23/13 17:15      9/23/13 17:15          420       0.00
## 3        5/10/13 20:00      5/10/13 20:45         2340      13.80
## 4        2/21/16 19:15      2/21/16 19:15          300       0.70
## 5         3/8/14 15:45       3/8/14 16:30         2220      13.30
## 6          5/5/14 9:30        5/5/14 9:45          900       0.10
##   Pickup.Census.Tract Dropoff.Census.Tract Pickup.Community.Area
## 1         17031280100          17031839100                    28
## 2         17031081800          17031281900                     8
## 3         17031980000          17031060400                    76
## 4         17031081500          17031081500                     8
## 5         17031980100          17031242300                    56
## 6         17031832600          17031081401                     7
##   Dropoff.Community.Area  Fare Tips Tolls Extras Trip.Total Payment.Type
## 1                     32  7.05 0.00     0    1.5     $8.55          Cash
## 2                     28  6.05 0.00     0    0.0     $6.05          Cash
## 3                      6 31.25 0.00     0    3.0    $34.25          Cash
## 4                      8  5.50 0.00     0    0.0     $5.50          Cash
## 5                     24 30.45 6.49     0    2.0    $38.94   Credit Card
## 6                      8 10.45 0.00     0    0.0    $10.45          Cash
##                     Company Pickup.Centroid.Latitude
## 1                                           41.88530
## 2 Taxi Affiliation Services                 41.89322
## 3                                           41.97907
## 4                                           41.89251
## 5                                           41.78600
## 6 Taxi Affiliation Services                 41.91475
##   Pickup.Centroid.Longitude     Pickup.Centroid.Location
## 1                 -87.64281   POINT (-87.642808 41.8853)
## 2                 -87.63784 POINT (-87.637844 41.893216)
## 3                 -87.90304  POINT (-87.90304 41.979071)
## 4                 -87.62621 POINT (-87.626215 41.892508)
## 5                 -87.75093 POINT (-87.750934 41.785999)
## 6                 -87.65401 POINT (-87.654007 41.914747)
##   Dropoff.Centroid.Latitude Dropoff.Centroid.Longitude
## 1                  41.88099                  -87.63275
## 2                  41.87926                  -87.64265
## 3                  41.95067                  -87.66654
## 4                  41.89251                  -87.62621
## 5                  41.89951                  -87.67960
## 6                  41.89503                  -87.61971
##     Dropoff.Centroid..Location Community.Areas
## 1 POINT (-87.632746 41.880994)              29
## 2 POINT (-87.642649 41.879255)              37
## 3 POINT (-87.666536 41.950673)              75
## 4 POINT (-87.626215 41.892508)              37
## 5   POINT (-87.6796 41.899507)              53
## 6 POINT (-87.619711 41.895033)              68

Calculate the average and standard deviation of the assigned feature (column)

** calculate the average **

mean(mydata$Trip.Seconds, na.rm = TRUE)
## [1] 720.8911

** calculate the standard deviation **

sd(mydata$Trip.Seconds, na.rm = TRUE)
## [1] 1005.201

2B) Calculate the min and max of the assigned feature (column)

** calculate the min **

min(trip_seconds, na.rm = TRUE)
## [1] 0

** calculate the max **

max(trip_seconds, na.rm = TRUE)
## [1] 74340
summary(mydata)
##        X                                             Trip.ID     
##  Min.   :    1   3e7d6d8ccf1425ae1dcd584f5c3ca303cf6362ed:    1  
##  1st Qu.:24772   3e7d6e5c4e87f01a475c8200b33777e85497da89:    1  
##  Median :49770   3e7d6efe43222b0ebc698583916674c648dd4520:    1  
##  Mean   :49852   3e7d6f001e9bcda8478a489cb53293d26328ac85:    1  
##  3rd Qu.:74846   3e7d6f35332ed1069a218b63cef47c3be42a26fb:    1  
##  Max.   :99999   3e7d706fe60689dc646772c09e128e19ab022472:    1  
##                  (Other)                                 :60705  
##                                                                                                                              Taxi.ID     
##  4f189764b8d9b6f71f7936ab414cac07634be0a00790ca179f9460521b7c9c3e5e102f5ba4e1c9cd18cdd9856dbf4f66ae8f13d8c82f8d2d4872f74b96938a24:   45  
##  1b2865284986db761b656ee64ad35fa46df8f7da125dde86ddb79d798917d2a8ac1bc1e7865a340f797d7cc7ba275d9397e4ce03c88da96b088fab3088c07dc8:   38  
##  ad33dffdd6cd00795ea1a00a6a6db1a38482075d532b55e712741e9b4a2541375fcf642d01d35b51646d2a07b49376f167b5ddd1f7b0a5354afe07f514108365:   37  
##  0861cb74337c620cb9ec639af7dc3aa99173b768caf750a2fd1ff17a8d9db86cad36772c7ff6ddaf2fda48de41bc82981145fe46693ed147d86ae194ee15c703:   36  
##  21e6058fef096e0d45f7b9d47433974c9c4e1820c3e5bee61670db5e58eb32e58432f1b6cf09adf73fe8a36a25cfe880f2e797158b2c890ec52603c3635d759f:   36  
##  2a29bcc02d98eb9e748a8427ffd773934047f77e20f70325b01f515ba28cb6e379194d8d65b7783c6791f390ea0d9833daf89a537be9be1db5b81e1db23a07c5:   36  
##  (Other)                                                                                                                         :60483  
##      Trip.Start.Timestamp     Trip.End.Timestamp  Trip.Seconds    
##  2/27/15 8:45  :    7     7/25/14 18:45:    8    Min.   :    0.0  
##  2/5/15 19:15  :    7     2/28/14 20:45:    7    1st Qu.:  300.0  
##  4/25/14 18:45 :    7     3/11/16 20:15:    7    Median :  540.0  
##  5/17/14 23:30 :    7     10/6/13 1:00 :    6    Mean   :  720.9  
##  7/25/14 18:45 :    7     11/1/14 1:15 :    6    3rd Qu.:  840.0  
##  10/15/14 11:00:    6     2/5/15 19:45 :    6    Max.   :74340.0  
##  (Other)       :60670     (Other)      :60671                     
##    Trip.Miles       Pickup.Census.Tract Dropoff.Census.Tract
##  Min.   :   0.000   Min.   :1.703e+10   Min.   :1.703e+10   
##  1st Qu.:   0.000   1st Qu.:1.703e+10   1st Qu.:1.703e+10   
##  Median :   0.900   Median :1.703e+10   Median :1.703e+10   
##  Mean   :   2.478   Mean   :1.703e+10   Mean   :1.703e+10   
##  3rd Qu.:   1.900   3rd Qu.:1.703e+10   3rd Qu.:1.703e+10   
##  Max.   :1770.000   Max.   :1.703e+10   Max.   :1.703e+10   
##                                                             
##  Pickup.Community.Area Dropoff.Community.Area      Fare       
##  Min.   : 1.00         Min.   : 1.00          Min.   :  0.00  
##  1st Qu.: 8.00         1st Qu.: 8.00          1st Qu.:  5.65  
##  Median : 8.00         Median : 8.00          Median :  7.25  
##  Mean   :21.58         Mean   :21.19          Mean   : 10.75  
##  3rd Qu.:32.00         3rd Qu.:32.00          3rd Qu.: 10.45  
##  Max.   :77.00         Max.   :77.00          Max.   :162.44  
##                                               NA's   :5       
##       Tips             Tolls              Extras           Trip.Total   
##  Min.   :  0.000   Min.   : 0.00000   Min.   :  0.0000   $7.25  : 1540  
##  1st Qu.:  0.000   1st Qu.: 0.00000   1st Qu.:  0.0000   $6.25  : 1455  
##  Median :  0.000   Median : 0.00000   Median :  0.0000   $6.65  : 1361  
##  Mean   :  1.097   Mean   : 0.00563   Mean   :  0.7318   $7.05  : 1253  
##  3rd Qu.:  2.000   3rd Qu.: 0.00000   3rd Qu.:  1.0000   $6.45  : 1251  
##  Max.   :108.100   Max.   :50.00000   Max.   :213.0000   $7.45  : 1228  
##  NA's   :1         NA's   :1          NA's   :1          (Other):52623  
##       Payment.Type                                Company     
##  Cash       :39135                                    :20771  
##  Credit Card:21147   Taxi Affiliation Services        :20403  
##  Dispute    :   37   Dispatch Taxi Affiliation        : 6685  
##  No Charge  :  305   Blue Ribbon Taxi Association Inc.: 4589  
##  Pcard      :    9   Choice Taxi Association          : 3784  
##  Prcard     :    3   Northwest Management LLC         : 2236  
##  Unknown    :   75   (Other)                          : 2243  
##  Pickup.Centroid.Latitude Pickup.Centroid.Longitude
##  Min.   :41.74            Min.   :-87.90           
##  1st Qu.:41.88            1st Qu.:-87.64           
##  Median :41.89            Median :-87.63           
##  Mean   :41.90            Mean   :-87.65           
##  3rd Qu.:41.90            3rd Qu.:-87.62           
##  Max.   :42.02            Max.   :-87.58           
##                                                    
##                  Pickup.Centroid.Location Dropoff.Centroid.Latitude
##  POINT (-87.632746 41.880994): 8461       Min.   :41.74            
##  POINT (-87.620993 41.884987): 4963       1st Qu.:41.88            
##  POINT (-87.626215 41.892508): 3787       Median :41.89            
##  POINT (-87.631864 41.892042): 3647       Mean   :41.90            
##  POINT (-87.90304 41.979071) : 3016       3rd Qu.:41.90            
##  POINT (-87.642649 41.879255): 2780       Max.   :42.02            
##  (Other)                     :34057                                
##  Dropoff.Centroid.Longitude                Dropoff.Centroid..Location
##  Min.   :-87.90             POINT (-87.632746 41.880994): 7602       
##  1st Qu.:-87.64             POINT (-87.620993 41.884987): 4394       
##  Median :-87.63             POINT (-87.631864 41.892042): 3062       
##  Mean   :-87.65             POINT (-87.626215 41.892508): 3057       
##  3rd Qu.:-87.62             POINT (-87.642649 41.879255): 2696       
##  Max.   :-87.58             POINT (-87.90304 41.979071) : 2435       
##                             (Other)                     :37465       
##  Community.Areas
##  Min.   : 1.00  
##  1st Qu.:37.00  
##  Median :37.00  
##  Mean   :40.71  
##  3rd Qu.:38.00  
##  Max.   :76.00  
## 

2C) Quantile calculation to find outliers of the assigned feature (column)

  • To find the outliers we are going to look at the upper and lower limits

  • An outlier is value that “lies outside” most of the other values in a set of data.

  • A method to find upper and lower thresholds involves finding the interquartile range.

** quantile calculation for the give feature**

quantile(trip_seconds, na.rm = TRUE)
##    0%   25%   50%   75%  100% 
##     0   300   540   840 74340

** Lower and upper quantile calculation **

# lowerq = quantile(VARIABLE)[2]
# upperq = quantile(VARIABLE)[4]

lowerq = quantile(trip_seconds, na.rm = TRUE)[2]
upperq = quantile(trip_seconds, na.rm = TRUE)[4]

Interquantile calculation

# iqr = upperq - lowerq
iqr = upperq - lowerq

2D) Finding upper and lower thresholds

  • The threshold is the boundaries that determine if a value is an outlier.

  • If the value falls above the upper threshold or below the lower threshold, it is an outlier.

** Calculation the upper threshold **

# upper_threshold = (iqr * 1.5) + upperq 

upper_threshold = (iqr * 1.5) + upperq

** Calculation the lower threshold **

# lower_threshold = lowerq - (iqr * 1.5)

lower_threshold = lowerq - (iqr * 1.5)

** Identify outliers **

# VARIABLE[ VARIABLE > upper_threshold][1:10]
# VARIABLE[ VARIABLE > lower_threshold][1:10]

trip_seconds[ trip_seconds > upper_threshold][1:10]
##  [1] 2340 2220 3300 1680 1860 1980 3360 4500 2760 1860
trip_seconds[ trip_seconds < lower_threshold][1:10]
##  [1] NA NA NA NA NA NA NA NA NA NA

** Finding outliers records **

# mydata[ VARIABLE > upper_threshold, ]
# mydata[ VARIABLE > lower_threshold, ]

# mydata[ trip_seconds > upper_threshold, ]

Are there any outliers, if not explain the lack of outliers? if any explain what the outliers represent and how many records are outliers? Use the count() function to find the number of outliers.

count( mydata[ trip_seconds > upper_threshold, ] )
## # A tibble: 1 x 1
##       n
##   <int>
## 1  4837

There are 4,837 outliers in this data set.

Plotting Outliers

  • It can also be useful to visualize the data using a box and whisker plot.

  • The boxplot supports the IQR also shows the upper and lower thresholds

# p <- ggplot(data = Scoring, aes(x = "", y = Finrat)) + geom_boxplot() + coord_flip()
# p 

p = ggplot(data = mydata, aes(x = "", y = trip_seconds)) + geom_boxplot() + coord_flip()
p


Task 3: Data Modeling - Chicago Taxi Data


3A) Go and explore the interactive dashboard. Take a screenshot of something that you find interesting in the dashboard and write a short summary of your findings

knitr::include_graphics("imgs/taxidata.png")

3B) Write the Metadata for the Taxi data (description), note the size of the file its dimension (number of rows and columns). Use the function summary() to help you identify unique entities, fields, number of rows and data types.

The file size is 38.1MB. There are 24 columns and 10,000 rows of data. The data set contains both quantitative and categorical data types. Some of the fields include TaxiID, trip miles and payment type.

summary(mydata)
##        X                                             Trip.ID     
##  Min.   :    1   3e7d6d8ccf1425ae1dcd584f5c3ca303cf6362ed:    1  
##  1st Qu.:24772   3e7d6e5c4e87f01a475c8200b33777e85497da89:    1  
##  Median :49770   3e7d6efe43222b0ebc698583916674c648dd4520:    1  
##  Mean   :49852   3e7d6f001e9bcda8478a489cb53293d26328ac85:    1  
##  3rd Qu.:74846   3e7d6f35332ed1069a218b63cef47c3be42a26fb:    1  
##  Max.   :99999   3e7d706fe60689dc646772c09e128e19ab022472:    1  
##                  (Other)                                 :60705  
##                                                                                                                              Taxi.ID     
##  4f189764b8d9b6f71f7936ab414cac07634be0a00790ca179f9460521b7c9c3e5e102f5ba4e1c9cd18cdd9856dbf4f66ae8f13d8c82f8d2d4872f74b96938a24:   45  
##  1b2865284986db761b656ee64ad35fa46df8f7da125dde86ddb79d798917d2a8ac1bc1e7865a340f797d7cc7ba275d9397e4ce03c88da96b088fab3088c07dc8:   38  
##  ad33dffdd6cd00795ea1a00a6a6db1a38482075d532b55e712741e9b4a2541375fcf642d01d35b51646d2a07b49376f167b5ddd1f7b0a5354afe07f514108365:   37  
##  0861cb74337c620cb9ec639af7dc3aa99173b768caf750a2fd1ff17a8d9db86cad36772c7ff6ddaf2fda48de41bc82981145fe46693ed147d86ae194ee15c703:   36  
##  21e6058fef096e0d45f7b9d47433974c9c4e1820c3e5bee61670db5e58eb32e58432f1b6cf09adf73fe8a36a25cfe880f2e797158b2c890ec52603c3635d759f:   36  
##  2a29bcc02d98eb9e748a8427ffd773934047f77e20f70325b01f515ba28cb6e379194d8d65b7783c6791f390ea0d9833daf89a537be9be1db5b81e1db23a07c5:   36  
##  (Other)                                                                                                                         :60483  
##      Trip.Start.Timestamp     Trip.End.Timestamp  Trip.Seconds    
##  2/27/15 8:45  :    7     7/25/14 18:45:    8    Min.   :    0.0  
##  2/5/15 19:15  :    7     2/28/14 20:45:    7    1st Qu.:  300.0  
##  4/25/14 18:45 :    7     3/11/16 20:15:    7    Median :  540.0  
##  5/17/14 23:30 :    7     10/6/13 1:00 :    6    Mean   :  720.9  
##  7/25/14 18:45 :    7     11/1/14 1:15 :    6    3rd Qu.:  840.0  
##  10/15/14 11:00:    6     2/5/15 19:45 :    6    Max.   :74340.0  
##  (Other)       :60670     (Other)      :60671                     
##    Trip.Miles       Pickup.Census.Tract Dropoff.Census.Tract
##  Min.   :   0.000   Min.   :1.703e+10   Min.   :1.703e+10   
##  1st Qu.:   0.000   1st Qu.:1.703e+10   1st Qu.:1.703e+10   
##  Median :   0.900   Median :1.703e+10   Median :1.703e+10   
##  Mean   :   2.478   Mean   :1.703e+10   Mean   :1.703e+10   
##  3rd Qu.:   1.900   3rd Qu.:1.703e+10   3rd Qu.:1.703e+10   
##  Max.   :1770.000   Max.   :1.703e+10   Max.   :1.703e+10   
##                                                             
##  Pickup.Community.Area Dropoff.Community.Area      Fare       
##  Min.   : 1.00         Min.   : 1.00          Min.   :  0.00  
##  1st Qu.: 8.00         1st Qu.: 8.00          1st Qu.:  5.65  
##  Median : 8.00         Median : 8.00          Median :  7.25  
##  Mean   :21.58         Mean   :21.19          Mean   : 10.75  
##  3rd Qu.:32.00         3rd Qu.:32.00          3rd Qu.: 10.45  
##  Max.   :77.00         Max.   :77.00          Max.   :162.44  
##                                               NA's   :5       
##       Tips             Tolls              Extras           Trip.Total   
##  Min.   :  0.000   Min.   : 0.00000   Min.   :  0.0000   $7.25  : 1540  
##  1st Qu.:  0.000   1st Qu.: 0.00000   1st Qu.:  0.0000   $6.25  : 1455  
##  Median :  0.000   Median : 0.00000   Median :  0.0000   $6.65  : 1361  
##  Mean   :  1.097   Mean   : 0.00563   Mean   :  0.7318   $7.05  : 1253  
##  3rd Qu.:  2.000   3rd Qu.: 0.00000   3rd Qu.:  1.0000   $6.45  : 1251  
##  Max.   :108.100   Max.   :50.00000   Max.   :213.0000   $7.45  : 1228  
##  NA's   :1         NA's   :1          NA's   :1          (Other):52623  
##       Payment.Type                                Company     
##  Cash       :39135                                    :20771  
##  Credit Card:21147   Taxi Affiliation Services        :20403  
##  Dispute    :   37   Dispatch Taxi Affiliation        : 6685  
##  No Charge  :  305   Blue Ribbon Taxi Association Inc.: 4589  
##  Pcard      :    9   Choice Taxi Association          : 3784  
##  Prcard     :    3   Northwest Management LLC         : 2236  
##  Unknown    :   75   (Other)                          : 2243  
##  Pickup.Centroid.Latitude Pickup.Centroid.Longitude
##  Min.   :41.74            Min.   :-87.90           
##  1st Qu.:41.88            1st Qu.:-87.64           
##  Median :41.89            Median :-87.63           
##  Mean   :41.90            Mean   :-87.65           
##  3rd Qu.:41.90            3rd Qu.:-87.62           
##  Max.   :42.02            Max.   :-87.58           
##                                                    
##                  Pickup.Centroid.Location Dropoff.Centroid.Latitude
##  POINT (-87.632746 41.880994): 8461       Min.   :41.74            
##  POINT (-87.620993 41.884987): 4963       1st Qu.:41.88            
##  POINT (-87.626215 41.892508): 3787       Median :41.89            
##  POINT (-87.631864 41.892042): 3647       Mean   :41.90            
##  POINT (-87.90304 41.979071) : 3016       3rd Qu.:41.90            
##  POINT (-87.642649 41.879255): 2780       Max.   :42.02            
##  (Other)                     :34057                                
##  Dropoff.Centroid.Longitude                Dropoff.Centroid..Location
##  Min.   :-87.90             POINT (-87.632746 41.880994): 7602       
##  1st Qu.:-87.64             POINT (-87.620993 41.884987): 4394       
##  Median :-87.63             POINT (-87.631864 41.892042): 3062       
##  Mean   :-87.65             POINT (-87.626215 41.892508): 3057       
##  3rd Qu.:-87.62             POINT (-87.642649 41.879255): 2696       
##  Max.   :-87.58             POINT (-87.90304 41.979071) : 2435       
##                             (Other)                     :37465       
##  Community.Areas
##  Min.   : 1.00  
##  1st Qu.:37.00  
##  Median :37.00  
##  Mean   :40.71  
##  3rd Qu.:38.00  
##  Max.   :76.00  
## 

3C) Write a description of any relational business logic and field integrity

There is relationship integrity in this data set between fare, tips, tolls, extras and trip total, because if any fare, tips, tools on their own or combined together exceed trip total then there would be a violation of relational busienss logic. This data set also has field integrity because each column has the appropriate values in it. For example, trip total contains a numerical value while payment type contains a non-defined value.

3D) Draw a star schema of the Chicago Taxi Data

knitr::include_graphics("imgs/taxistarschema.png")