For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.
We are going to use tidyverse a collection of R packages designed for data science.
## Loading required package: tidyverse
## -- Attaching packages ------------------------------------------------------------------------------------------ tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.7.2 v stringr 1.2.0
## v readr 1.1.1 v forcats 0.2.0
## -- Conflicts --------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Name your dataset ‘mydata’ so it easy to work with.
Commands: read_csv() head() mean() sub() as.numeric()
mydata <- read_csv(file="Taxi_Trips_Sample.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## `Trip Seconds` = col_integer(),
## `Trip Miles` = col_double(),
## `Pickup Census Tract` = col_double(),
## `Dropoff Census Tract` = col_double(),
## `Pickup Community Area` = col_integer(),
## `Dropoff Community Area` = col_integer(),
## `Pickup Centroid Latitude` = col_double(),
## `Pickup Centroid Longitude` = col_double(),
## `Dropoff Centroid Latitude` = col_double(),
## `Dropoff Centroid Longitude` = col_double(),
## `Community Areas` = col_integer()
## )
## See spec(...) for full column specifications.
head(mydata)
## # A tibble: 6 x 24
## `Trip ID` `Taxi ID` `Trip Start Tim~ `Trip End Times~ `Trip Seconds`
## <chr> <chr> <chr> <chr> <int>
## 1 3e7d6d8cc~ b47c583b142~ 08/19/2014 02:4~ 08/19/2014 02:4~ 480
## 2 3e7d6e5c4~ bc1c0381e3b~ 09/23/2013 05:1~ 09/23/2013 05:1~ 420
## 3 3e7d6e69c~ f529487ccf3~ 02/16/2014 10:1~ 02/16/2014 10:3~ 420
## 4 3e7d6efe4~ 0f831bff43d~ 05/10/2013 08:0~ 05/10/2013 08:4~ 2340
## 5 3e7d6f001~ e5274d6c103~ 02/21/2016 07:1~ 02/21/2016 07:1~ 300
## 6 3e7d6f2a0~ 329d9f0b72c~ 12/10/2015 05:3~ 12/10/2015 06:0~ 1020
## # ... with 19 more variables: `Trip Miles` <dbl>, `Pickup Census
## # Tract` <dbl>, `Dropoff Census Tract` <dbl>, `Pickup Community
## # Area` <int>, `Dropoff Community Area` <int>, Fare <chr>, Tips <chr>,
## # Tolls <chr>, Extras <chr>, `Trip Total` <chr>, `Payment Type` <chr>,
## # Company <chr>, `Pickup Centroid Latitude` <dbl>, `Pickup Centroid
## # Longitude` <dbl>, `Pickup Centroid Location` <chr>, `Dropoff Centroid
## # Latitude` <dbl>, `Dropoff Centroid Longitude` <dbl>, `Dropoff Centroid
## # Location` <chr>, `Community Areas` <int>
mydata <- na.omit(mydata)
Fare <- mydata$Fare
Tips <- mydata$Tips
Tolls <- mydata$Tolls
Extras <- mydata$Extras
** What happens when we try to use the function? **
mean(Fare)
## Warning in mean.default(Fare): argument is not numeric or logical:
## returning NA
## [1] NA
mean(Tips)
## Warning in mean.default(Tips): argument is not numeric or logical:
## returning NA
## [1] NA
mean(Tolls)
## Warning in mean.default(Tolls): argument is not numeric or logical:
## returning NA
## [1] NA
mean(Extras)
## Warning in mean.default(Extras): argument is not numeric or logical:
## returning NA
## [1] NA
To resolve the error, check if the feature data type is correct
Notice that comma ‘1,234’ in some values
Also the data type and dollar sign ‘$’ symbol
To remove the any character in this case the comma from “1,234”. We must substitute it with just an empty space.
** substitute comma with “” **
# VARIABLE <- sub("," , "", VARIABLE)
Fare <- sub(",", "", Fare)
Tips <- sub(",", "", Tips)
Tolls <- sub(",", "", Tolls)
Extras <- sub(",", "", Extras)
** substitute dollar sign with “” **
Fare <- sub("$", "", Fare)
Tips <- sub("$", "", Tips)
Tolls <- sub("$", "", Tolls)
Extras <- sub("$", "", Extras)
** character to numeric - as.numeric()**
Fare <- as.numeric(Fare)
## Warning: NAs introduced by coercion
Tips <- as.numeric(Tips)
## Warning: NAs introduced by coercion
Tolls <-as.numeric(Tolls)
## Warning: NAs introduced by coercion
Extras <- as.numeric(Extras)
## Warning: NAs introduced by coercion
** mean with NA removed **
# VARIABLE_MEAN <- mean( YOUR_VARIABLE , na.rm = TRUE)
Fare_Mean <- mean(Fare, na.rm = TRUE)
Tips_Mean <- mean(Tips, na.rm = TRUE)
Tolls_Mean <- mean(Tolls, na.rm = TRUE)
Extras_Mean <- mean(Extras, na.rm = TRUE)
** Clean variables to dataset **
# mydata$VARIABLE <- VARIABLE
mydata$Fare <- Fare
mydata$Tips <- Tips
mydata$Tolls <- Tolls
mydata$Extras <- Extras
** Save clean data **
# write_csv(mydata, path = "data/mydata_clean.csv")
write_csv(mydata, path = "data/mydata_clean.csv")
In this task we must calculate the mean, standard deviation, maximum, and minimum for the given feature.
mydata <- read_csv(file= "data/mydata_clean.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## `Trip Seconds` = col_integer(),
## `Trip Miles` = col_double(),
## `Pickup Census Tract` = col_double(),
## `Dropoff Census Tract` = col_double(),
## `Pickup Community Area` = col_integer(),
## `Dropoff Community Area` = col_integer(),
## `Pickup Centroid Latitude` = col_double(),
## `Pickup Centroid Longitude` = col_double(),
## `Dropoff Centroid Latitude` = col_double(),
## `Dropoff Centroid Longitude` = col_double(),
## `Community Areas` = col_integer()
## )
## See spec(...) for full column specifications.
Trip_Seconds <- mydata$`Trip Seconds`
** calculate the average **
mean_TripSeconds <- mean(mydata$"Trip Seconds")
mean_TripSeconds
## [1] 709.3322
** calculate the standard deviation **
sd_TripSeconds <- sd(mydata$"Trip Seconds")
sd_TripSeconds
## [1] 659.8013
** calculate the min **
min_TripSeconds <- min(mydata$"Trip Seconds")
min_TripSeconds
## [1] 0
** calculate the max **
max_TripSeconds <- max(mydata$"Trip Seconds")
max_TripSeconds
## [1] 8940
To find the outliers we are going to look at the upper and lower limits
An outlier is value that “lies outside” most of the other values in a set of data.
A method to find upper and lower thresholds involves finding the interquartile range.
** quantile calculation for the give feature**
quantile(Trip_Seconds, na.rm = TRUE)
## 0% 25% 50% 75% 100%
## 0 360 540 840 8940
** Lower and upper quantile calculation **
# lowerq = quantile(VARIABLE)[2]
# upperq = quantile(VARIABLE)[4]
lowerq = quantile(Trip_Seconds)[2]
upperq = quantile(Trip_Seconds)[4]
lowerq
## 25%
## 360
upperq
## 75%
## 840
Interquantile calculation
# iqr = upperq - lowerq
iqr = upperq - lowerq
iqr
## 75%
## 480
The threshold is the boundaries that determine if a value is an outlier.
If the value falls above the upper threshold or below the lower threshold, it is an outlier.
** Calculation the upper threshold **
# upper_threshold = (iqr * 1.5) + upperq
upper_threshold = (iqr * 1.5) + upperq
upper_threshold
## 75%
## 1560
** Calculation the lower threshold **
# lower_threshold = lowerq - (iqr * 1.5)
lower_threshold = lowerq - (iqr * 1.5)
lower_threshold
## 25%
## -360
** Identify outliers **
# VARIABLE[ VARIABLE > upper_threshold][1:10]
# VARIABLE[ VARIABLE > lower_threshold][1:10]
Trip_Seconds[Trip_Seconds > upper_threshold][1:10]
## [1] 1680 1860 1980 3360 4500 2760 1860 2700 2940 3300
Trip_Seconds[Trip_Seconds < lower_threshold][1:10]
## [1] NA NA NA NA NA NA NA NA NA NA
** Finding outliers records **
# mydata[ VARIABLE > upper_threshold, ][1:10]
# mydata[ VARIABLE > lower_threshold, ][1:10]
mydata[Trip_Seconds > upper_threshold, ]
## # A tibble: 3,385 x 24
## `Trip ID` `Taxi ID` `Trip Start Tim~ `Trip End Times~ `Trip Seconds`
## <chr> <chr> <chr> <chr> <int>
## 1 3e7d78b99~ b80ae25954~ 06/06/2014 03:1~ 06/06/2014 03:4~ 1680
## 2 3e7d851f8~ 259ec57ee4~ 09/18/2014 12:0~ 09/18/2014 12:3~ 1860
## 3 3e7d87103~ 78fb99d332~ 03/24/2013 04:1~ 03/24/2013 04:4~ 1980
## 4 3e7d8b799~ 74605d6aa0~ 07/23/2015 04:0~ 07/23/2015 04:4~ 3360
## 5 3e7d8f714~ 97c6416993~ 09/30/2015 04:4~ 09/30/2015 06:0~ 4500
## 6 3e7d91de6~ 0de60e5618~ 10/30/2015 08:0~ 10/30/2015 08:4~ 2760
## 7 3e7d96aa7~ b918186326~ 02/05/2014 09:3~ 02/05/2014 10:0~ 1860
## 8 3e7d99a42~ 5cfc2dce21~ 08/19/2015 03:4~ 08/19/2015 04:3~ 2700
## 9 3e7d9d2f5~ 24b87270f3~ 06/08/2016 04:4~ 06/08/2016 05:3~ 2940
## 10 3e7da54c9~ 8d13012158~ 02/14/2014 07:4~ 02/14/2014 08:4~ 3300
## # ... with 3,375 more rows, and 19 more variables: `Trip Miles` <dbl>,
## # `Pickup Census Tract` <dbl>, `Dropoff Census Tract` <dbl>, `Pickup
## # Community Area` <int>, `Dropoff Community Area` <int>, Fare <chr>,
## # Tips <chr>, Tolls <chr>, Extras <chr>, `Trip Total` <chr>, `Payment
## # Type` <chr>, Company <chr>, `Pickup Centroid Latitude` <dbl>, `Pickup
## # Centroid Longitude` <dbl>, `Pickup Centroid Location` <chr>, `Dropoff
## # Centroid Latitude` <dbl>, `Dropoff Centroid Longitude` <dbl>, `Dropoff
## # Centroid Location` <chr>, `Community Areas` <int>
Yes there are outliers, they represent data above the upper and lower threshhold limits. There are 3385 outliers that serve to skew the data.
count(mydata[Trip_Seconds > upper_threshold, ])
## # A tibble: 1 x 1
## n
## <int>
## 1 3385
It can also be useful to visualize the data using a box and whisker plot.
The boxplot supports the IQR also shows the upper and lower thresholds
# p <- ggplot(data = Scoring, aes(x = "", y = Finrat)) + geom_boxplot() + coord_flip()
# p
p <- ggplot(data = mydata, aes(x = "", y = Trip_Seconds)) + geom_boxplot() + coord_flip()
p
Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq
Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released
While the company, Uber was started in 2009, it is not until 2014 than the ridesharing industry began to seriously impact the taxi industry. It is as if it took several years for the public to trust the service.
The data consists of 24 columns and 100,000 rows of data. The excel file is 38, 7272 KB It is interesting to note that the median trip miles is only 540 seconds, or 9 minutes, a shorter range than one might think. Whereas the first quartile is only 6 minutes (360 seconds). Unique entities incluce Trip ID, Taxi ID, and Community Areas. Tip ID, Taxi ID, Trip Start Timestamp, Trip End TimeStamp, Fare, Trip Total, Payment Type, Company, Pickup Centroid Location and Dropoff Centroid Location are character values. The length of Trip ID, Trip Total, Payment Type, Company PickUp Centroid Locatoin,and Dropoff Centroid location is 39,939.
Relational logic includes requiring that the length of the trip is positive in number of seconds and miles. One cannot travel for negative distance or time. If the start time is after the end time or the number of miles is negative, then there may have been a mistake in the data entry. Field intengrity involves all value: Fare, Tips, Tools, Extras, and Trip Total are in dollars, and all values in Trip Miles are numeric and in miles.
knitr::include_graphics('erdplus-diagram.png')