For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.
We are going to use tidyverse a collection of R packages designed for data science.
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.8.0 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Name your dataset ‘mydata’ so it easy to work with.
Commands: read_csv() head() mean() sub() as.numeric()
mydata <- read_csv('Taxi_Trips_Sample.csv')
## Parsed with column specification:
## cols(
## .default = col_character(),
## `Trip Seconds` = col_integer(),
## `Trip Miles` = col_double(),
## `Pickup Census Tract` = col_double(),
## `Dropoff Census Tract` = col_double(),
## `Pickup Community Area` = col_integer(),
## `Dropoff Community Area` = col_integer(),
## `Pickup Centroid Latitude` = col_double(),
## `Pickup Centroid Longitude` = col_double(),
## `Dropoff Centroid Latitude` = col_double(),
## `Dropoff Centroid Longitude` = col_double(),
## `Community Areas` = col_integer()
## )
## See spec(...) for full column specifications.
fare = mydata$Fare
tips = mydata$Tips
tolls = mydata$Tolls
extras = mydata$Extras
head(fare)
## [1] "$7.05" "$6.05" "$7.05" "$31.25" "$5.50" "$9.25"
head(tips)
## [1] "$0.00" "$0.00" "$0.00" "$0.00" "$0.00" "$0.00"
head(tolls)
## [1] "$0.00" "$0.00" "$0.00" "$0.00" "$0.00" "$0.00"
head(extras)
## [1] "$1.50" "$0.00" "$0.00" "$3.00" "$0.00" "$1.00"
** What happens when we try to use the function? **
mean(fare)
## Warning in mean.default(fare): argument is not numeric or logical:
## returning NA
## [1] NA
To resolve the error, check if the feature data type is correct
Notice that comma ‘1,234’ in some values
Also the data type and dollar sign ‘$’ symbol
To remove the any character in this case the comma from “1,234”. We must substitute it with just an empty space.
** substitute comma with “” **
# VARIABLE <- sub("," , "", VARIABLE)
fare= sub(pattern = ",", replacement = "", x = fare)
head(fare)
## [1] "$7.05" "$6.05" "$7.05" "$31.25" "$5.50" "$9.25"
tips = sub(pattern = ",", replacement = "", x = tips)
head(tips)
## [1] "$0.00" "$0.00" "$0.00" "$0.00" "$0.00" "$0.00"
tolls = sub(pattern = ",", replacement = "", x = tolls)
head(tolls)
## [1] "$0.00" "$0.00" "$0.00" "$0.00" "$0.00" "$0.00"
extras = sub(pattern = ",", replacement = "", x = extras)
head(extras)
## [1] "$1.50" "$0.00" "$0.00" "$3.00" "$0.00" "$1.00"
** substitute dollar sign with “” **
fare= sub(pattern = "\\$", replacement = "", x = fare)
head(fare)
## [1] "7.05" "6.05" "7.05" "31.25" "5.50" "9.25"
tips = sub(pattern = "\\$", replacement = "", x = tips)
head(tips)
## [1] "0.00" "0.00" "0.00" "0.00" "0.00" "0.00"
tolls = sub(pattern = "\\$", replacement = "", x = tolls)
head(tolls)
## [1] "0.00" "0.00" "0.00" "0.00" "0.00" "0.00"
extras = sub(pattern = "\\$", replacement = "", x = extras)
head(extras)
## [1] "1.50" "0.00" "0.00" "3.00" "0.00" "1.00"
** character to numeric - as.numeric()**
fare = as.numeric(fare)
tips = as.numeric(tips)
tolls = as.numeric(tolls)
extras = as.numeric(extras)
** mean with NA removed **
# VARIABLE_MEAN <- mean( YOUR_VARIABLE , na.rm = TRUE)
fare_mean = mean(fare, na.rm = TRUE)
fare_mean
## [1] 12.56912
tips_mean = mean(tips, na.rm = TRUE)
tips_mean
## [1] 1.249043
tolls_mean = mean(tolls, na.rm = TRUE)
tolls_mean
## [1] 0.007330247
extras_mean = mean(extras, na.rm = TRUE)
extras_mean
## [1] 0.7976405
** Clean variables to dataset **
# mydata$VARIABLE <- VARIABLE
mydata$Fare = fare
mydata$Tips = tips
mydata$Tolls = tolls
mydata$Extras = extras
head(mydata)
## # A tibble: 6 x 24
## `Trip ID` `Taxi ID` `Trip Start Tim… `Trip End Times… `Trip Seconds`
## <chr> <chr> <chr> <chr> <int>
## 1 3e7d6d8cc… b47c583b142… 08/19/2014 02:4… 08/19/2014 02:4… 480
## 2 3e7d6e5c4… bc1c0381e3b… 09/23/2013 05:1… 09/23/2013 05:1… 420
## 3 3e7d6e69c… f529487ccf3… 02/16/2014 10:1… 02/16/2014 10:3… 420
## 4 3e7d6efe4… 0f831bff43d… 05/10/2013 08:0… 05/10/2013 08:4… 2340
## 5 3e7d6f001… e5274d6c103… 02/21/2016 07:1… 02/21/2016 07:1… 300
## 6 3e7d6f2a0… 329d9f0b72c… 12/10/2015 05:3… 12/10/2015 06:0… 1020
## # ... with 19 more variables: `Trip Miles` <dbl>, `Pickup Census
## # Tract` <dbl>, `Dropoff Census Tract` <dbl>, `Pickup Community
## # Area` <int>, `Dropoff Community Area` <int>, Fare <dbl>, Tips <dbl>,
## # Tolls <dbl>, Extras <dbl>, `Trip Total` <chr>, `Payment Type` <chr>,
## # Company <chr>, `Pickup Centroid Latitude` <dbl>, `Pickup Centroid
## # Longitude` <dbl>, `Pickup Centroid Location` <chr>, `Dropoff Centroid
## # Latitude` <dbl>, `Dropoff Centroid Longitude` <dbl>, `Dropoff Centroid
## # Location` <chr>, `Community Areas` <int>
** Save clean data **
write_csv(mydata, path = "data/mydata_clean.csv")
In this task we must calculate the mean, standard deviation, maximum, and minimum for the given feature.
mydata = read.csv("data/mydata_clean.csv")
trip_seconds = mydata$Trip.Seconds
** calculate the average **
mean(mydata$Trip.Seconds, na.rm = TRUE)
## [1] 739.2153
** calculate the standard deviation **
sd(mydata$Trip.Seconds, na.rm = TRUE)
## [1] 1167.484
** calculate the min **
min(trip_seconds, na.rm = TRUE)
## [1] 0
** calculate the max **
max(trip_seconds, na.rm = TRUE)
## [1] 74340
To find the outliers we are going to look at the upper and lower limits
An outlier is value that “lies outside” most of the other values in a set of data.
A method to find upper and lower thresholds involves finding the interquartile range.
** quantile calculation for the give feature**
quantile(trip_seconds, na.rm = TRUE)
## 0% 25% 50% 75% 100%
## 0 300 540 900 74340
** Lower and upper quantile calculation **
lowerq = quantile(trip_seconds, na.rm = TRUE)[2]
upperq = quantile(trip_seconds, na.rm = TRUE)[4]
Interquantile calculation
iqr = upperq - lowerq
The threshold is the boundaries that determine if a value is an outlier.
If the value falls above the upper threshold or below the lower threshold, it is an outlier.
** Calculation the upper threshold **
upper_threshold = (iqr * 1.5) + upperq
** Calculation the lower threshold **
lower_threshold = lowerq - (iqr * 1.5)
** Identify outliers **
# VARIABLE[ VARIABLE > upper_threshold][1:10]
# VARIABLE[ VARIABLE > lower_threshold][1:10]
trip_seconds[ trip_seconds > upper_threshold][1:10]
## [1] 2340 2220 3300 2520 2220 1860 1980 NA 3360 4500
trip_seconds[ trip_seconds < lower_threshold][1:10]
## [1] NA NA NA NA NA NA NA NA NA NA
** Finding outliers records **
#mydata[ trip_seconds > upper_threshold, ][1:10]
#mydata[ trip_seconds > lower_threshold, ][1:10]
count( mydata[ trip_seconds > upper_threshold, ] )
## # A tibble: 1 x 1
## n
## <int>
## 1 8160
count( mydata[ trip_seconds < lower_threshold, ] )
## # A tibble: 1 x 1
## n
## <int>
## 1 1327
It can also be useful to visualize the data using a box and whisker plot.
The boxplot supports the IQR also shows the upper and lower thresholds
p = ggplot(data = mydata, aes(x = "", y = trip_seconds)) + geom_boxplot() + coord_flip()
p
## Warning: Removed 1327 rows containing non-finite values (stat_boxplot).
Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq
Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released
knitr::include_graphics("data/img.png")
summary(mydata)
## Trip.ID
## 3e7d6d8ccf1425ae1dcd584f5c3ca303cf6362ed: 1
## 3e7d6e5c4e87f01a475c8200b33777e85497da89: 1
## 3e7d6e69c1d6755d9e7484a453cd93a3ee9fed4c: 1
## 3e7d6efe43222b0ebc698583916674c648dd4520: 1
## 3e7d6f001e9bcda8478a489cb53293d26328ac85: 1
## 3e7d6f2a03527d63dc01b95e829fdfdd706102da: 1
## (Other) :99993
## Taxi.ID
## aebf720288b80a8ee36860541db64951c696c749f1a392d312fa4d2a8cd3f95dfb0be580fda7eb63455f809a1be9b3acad19a3ca167073126d0350b50f30741a: 58
## 4f189764b8d9b6f71f7936ab414cac07634be0a00790ca179f9460521b7c9c3e5e102f5ba4e1c9cd18cdd9856dbf4f66ae8f13d8c82f8d2d4872f74b96938a24: 57
## f737a9a31b07650672910268d7cceb9c06a379c0e75070c0dc0366db8132b06ba2800c5e63c5e56f821a591fc78a92c1c60fb5f48e01aa02e62ff10d18ececd0: 55
## 1158f25979ad78fd3dafc867a540ad761b65922c312e6170ccee63c3f14adea37317d3cf4e2053d2bdb1531d17670872e0411e496905ef9cb4821e0e96056139: 53
## 0861cb74337c620cb9ec639af7dc3aa99173b768caf750a2fd1ff17a8d9db86cad36772c7ff6ddaf2fda48de41bc82981145fe46693ed147d86ae194ee15c703: 52
## (Other) :99720
## NA's : 4
## Trip.Start.Timestamp Trip.End.Timestamp
## 07/25/2014 06:45:00 PM: 9 02/10/2014 10:30:00 AM: 9
## 02/05/2015 07:15:00 PM: 8 02/05/2015 07:45:00 PM: 8
## 02/27/2015 08:45:00 AM: 8 03/03/2014 06:45:00 PM: 8
## 04/25/2014 06:45:00 PM: 8 03/22/2014 08:15:00 PM: 8
## 09/18/2013 07:30:00 PM: 8 03/24/2016 07:30:00 PM: 8
## 03/15/2014 07:00:00 PM: 7 (Other) :99942
## (Other) :99951 NA's : 16
## Trip.Seconds Trip.Miles Pickup.Census.Tract
## Min. : 0.0 Min. : 0.000 Min. :1.703e+10
## 1st Qu.: 300.0 1st Qu.: 0.000 1st Qu.:1.703e+10
## Median : 540.0 Median : 0.900 Median :1.703e+10
## Mean : 739.2 Mean : 2.686 Mean :1.703e+10
## 3rd Qu.: 900.0 3rd Qu.: 2.400 3rd Qu.:1.703e+10
## Max. :74340.0 Max. :1830.000 Max. :1.703e+10
## NA's :1327 NA's :1 NA's :38042
## Dropoff.Census.Tract Pickup.Community.Area Dropoff.Community.Area
## Min. :1.703e+10 Min. : 1.00 Min. : 1.00
## 1st Qu.:1.703e+10 1st Qu.: 8.00 1st Qu.: 8.00
## Median :1.703e+10 Median : 8.00 Median :14.00
## Mean :1.703e+10 Mean :22.04 Mean :21.14
## 3rd Qu.:1.703e+10 3rd Qu.:32.00 3rd Qu.:32.00
## Max. :1.703e+10 Max. :77.00 Max. :77.00
## NA's :38775 NA's :15534 NA's :17532
## Fare Tips Tolls Extras
## Min. : 0.00 Min. : 0.000 Min. : 0.00000 Min. : 0.000
## 1st Qu.: 6.00 1st Qu.: 0.000 1st Qu.: 0.00000 1st Qu.: 0.000
## Median : 8.05 Median : 0.000 Median : 0.00000 Median : 0.000
## Mean : 12.57 Mean : 1.249 Mean : 0.00733 Mean : 0.798
## 3rd Qu.: 13.05 3rd Qu.: 2.000 3rd Qu.: 0.00000 3rd Qu.: 1.000
## Max. :9825.04 Max. :108.100 Max. :82.50000 Max. :3452.340
## NA's :1 NA's :1 NA's :1 NA's :1
## Trip.Total Payment.Type
## $7.25 : 2010 Cash :60760
## $6.25 : 1908 Credit Card:38322
## $3.25 : 1889 Dispute : 58
## $6.65 : 1762 No Charge : 622
## $8.25 : 1729 Pcard : 18
## (Other):90700 Prcard : 6
## NA's : 1 Unknown : 213
## Company Pickup.Centroid.Latitude
## Taxi Affiliation Services :29911 Min. :41.66
## Dispatch Taxi Affiliation : 9417 1st Qu.:41.88
## Blue Ribbon Taxi Association Inc. : 6766 Median :41.89
## Choice Taxi Association : 5185 Mean :41.90
## Chicago Elite Cab Corp. (Chicago Carriag: 5091 3rd Qu.:41.92
## (Other) : 8218 Max. :42.02
## NA's :35411 NA's :15533
## Pickup.Centroid.Longitude Pickup.Centroid.Location
## Min. :-87.91 POINT (-87.632746 41.880994): 8572
## 1st Qu.:-87.66 POINT (-87.620993 41.884987): 5034
## Median :-87.63 POINT (-87.633308 41.899602): 3850
## Mean :-87.66 POINT (-87.626215 41.892508): 3832
## 3rd Qu.:-87.63 POINT (-87.631864 41.892042): 3692
## Max. :-87.54 (Other) :59486
## NA's :15533 NA's :15533
## Dropoff.Centroid.Latitude Dropoff.Centroid.Longitude
## Min. :41.67 Min. :-87.91
## 1st Qu.:41.88 1st Qu.:-87.66
## Median :41.89 Median :-87.63
## Mean :41.90 Mean :-87.66
## 3rd Qu.:41.92 3rd Qu.:-87.63
## Max. :42.02 Max. :-87.53
## NA's :17376 NA's :17376
## Dropoff.Centroid..Location Community.Areas
## POINT (-87.632746 41.880994): 7644 Min. : 1.00
## POINT (-87.620993 41.884987): 4412 1st Qu.:37.00
## POINT (-87.626215 41.892508): 3073 Median :37.00
## POINT (-87.631864 41.892042): 3072 Mean :41.18
## POINT (-87.655998 41.944227): 2850 3rd Qu.:38.00
## (Other) :61572 Max. :77.00
## NA's :17376 NA's :15533
Integrity lies within all of the money related areas as each one finds itself in the total cost, thus if all the costs were totaled together and exceeded or went under the total then there would be a violation. The field integrity is clear in that each type of value has the correct format for those values, such as full words in the company name and currency values in cost.
knitr::include_graphics("data/img2.png")