For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.
We are going to use tidyverse a collection of R packages designed for data science.
## Loading required package: tidyverse
## -- Attaching packages ------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.7.2 v stringr 1.2.0
## v readr 1.1.1 v forcats 0.2.0
## -- Conflicts ---------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Name your dataset ‘mydata’ so it easy to work with.
Commands: read_csv() head() mean() sub() as.numeric()
mydata = read_csv("data/Taxi_Trips_Sample.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## `Trip Seconds` = col_integer(),
## `Trip Miles` = col_double(),
## `Pickup Census Tract` = col_double(),
## `Dropoff Census Tract` = col_double(),
## `Pickup Community Area` = col_integer(),
## `Dropoff Community Area` = col_integer(),
## `Pickup Centroid Latitude` = col_double(),
## `Pickup Centroid Longitude` = col_double(),
## `Dropoff Centroid Latitude` = col_double(),
## `Dropoff Centroid Longitude` = col_double(),
## `Community Areas` = col_integer()
## )
## See spec(...) for full column specifications.
mydata = na.omit(mydata)
head(mydata)
## # A tibble: 6 x 24
## `Trip ID` `Taxi ID` `Trip Start Tim~ `Trip End Times~ `Trip Seconds`
## <chr> <chr> <chr> <chr> <int>
## 1 3e7d6e5c4~ bc1c0381e3b~ 09/23/2013 05:1~ 09/23/2013 05:1~ 420
## 2 3e7d706fe~ 631de3a4ad3~ 05/05/2014 09:3~ 05/05/2014 09:4~ 900
## 3 3e7d70d4a~ 623c452c15d~ 03/26/2016 01:3~ 03/26/2016 01:3~ 300
## 4 3e7d7115e~ 288345b3a02~ 05/13/2015 10:3~ 05/13/2015 10:4~ 540
## 5 3e7d712c0~ dd9b35afedd~ 06/17/2016 04:0~ 06/17/2016 04:3~ 1140
## 6 3e7d71a6e~ 19a9ad5b3b7~ 03/15/2013 10:1~ 03/15/2013 10:4~ 1200
## # ... with 19 more variables: `Trip Miles` <dbl>, `Pickup Census
## # Tract` <dbl>, `Dropoff Census Tract` <dbl>, `Pickup Community
## # Area` <int>, `Dropoff Community Area` <int>, Fare <chr>, Tips <chr>,
## # Tolls <chr>, Extras <chr>, `Trip Total` <chr>, `Payment Type` <chr>,
## # Company <chr>, `Pickup Centroid Latitude` <dbl>, `Pickup Centroid
## # Longitude` <dbl>, `Pickup Centroid Location` <chr>, `Dropoff Centroid
## # Latitude` <dbl>, `Dropoff Centroid Longitude` <dbl>, `Dropoff Centroid
## # Location` <chr>, `Community Areas` <int>
#assigning and viewing variables
fare = mydata$Fare
head(fare)
## [1] "$6.05" "$10.45" "$6.25" "$7.25" "$13.25" "$35.25"
tips = mydata$Tips
head(tips)
## [1] "$0.00" "$0.00" "$0.00" "$0.00" "$2.00" "$11.15"
tolls = mydata$Tolls
head(tolls)
## [1] "$0.00" "$0.00" "$0.00" "$0.00" "$0.00" "$0.00"
extras = mydata$Extras
head(extras)
## [1] "$0.00" "$0.00" "$0.00" "$0.00" "$0.00" "$2.00"
** What happens when we try to use the function? **
mean(fare)
## Warning in mean.default(fare): argument is not numeric or logical:
## returning NA
## [1] NA
mean(tips)
## Warning in mean.default(tips): argument is not numeric or logical:
## returning NA
## [1] NA
mean(tolls)
## Warning in mean.default(tolls): argument is not numeric or logical:
## returning NA
## [1] NA
mean(extras)
## Warning in mean.default(extras): argument is not numeric or logical:
## returning NA
## [1] NA
#non-numeric values do not allow for mean calculation. Characters in vars gets in the way
To resolve the error, check if the feature data type is correct
Notice that comma ‘1,234’ in some values
Also the data type and dollar sign ‘$’ symbol
To remove the any character in this case the comma from “1,234”. We must substitute it with just an empty space.
** substitute comma with “” **
# VARIABLE <- sub("," , "", VARIABLE)
#no commas in variables
** substitute dollar sign with “” **
# use \\ because of special characters
fare <- sub("\\$" , "", fare)
tips <- sub("\\$" , "", tips)
tolls <- sub("\\$" , "", tolls)
extras <- sub("\\$" , "", extras)
head(fare)
## [1] "6.05" "10.45" "6.25" "7.25" "13.25" "35.25"
** character to numeric - as.numeric()**
fare <- as.numeric(fare)
tips <- as.numeric(tips)
tolls <- as.numeric(tolls)
extras <- as.numeric(extras)
** mean with NA removed **
# VARIABLE_MEAN <- mean( YOUR_VARIABLE , na.rm = TRUE)
fare_mean = mean(fare)
fare_mean
## [1] 10.73069
tips_mean = mean(tips)
tips_mean
## [1] 1.105968
tolls_mean = mean(tolls)
tolls_mean
## [1] 0.0003457773
extras_mean = mean(extras)
extras_mean
## [1] 0.7472891
** Clean variables to dataset **
# mydata$VARIABLE <- VARIABLE
mydata$Fare <- fare
mydata$Tips <- tips
mydata$Tolls <- tolls
mydata$Extras <- extras
** Save clean data **
# write_csv(mydata, path = "data/mydata_clean.csv")
write_csv (mydata, path = "data/mydata_clean.csv")
In this task we must calculate the mean, standard deviation, maximum, and minimum for the given feature.
#reread clean data to be safe.
mydata <- read_csv("data/mydata_clean.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## `Trip ID` = col_character(),
## `Taxi ID` = col_character(),
## `Trip Start Timestamp` = col_character(),
## `Trip End Timestamp` = col_character(),
## `Trip Seconds` = col_integer(),
## `Pickup Community Area` = col_integer(),
## `Dropoff Community Area` = col_integer(),
## `Trip Total` = col_character(),
## `Payment Type` = col_character(),
## Company = col_character(),
## `Pickup Centroid Location` = col_character(),
## `Dropoff Centroid Location` = col_character(),
## `Community Areas` = col_integer()
## )
## See spec(...) for full column specifications.
duration = mydata$`Trip Seconds`
head(duration)
## [1] 420 900 300 540 1140 1200
** calculate the average **
duration_mean = mean(duration)
duration_mean
## [1] 709.3322
** calculate the standard deviation **
duration_sd = sd(duration)
duration_sd
## [1] 659.8013
** calculate the min **
min_duration = min(duration)
min_duration
## [1] 0
** calculate the max **
max_duration = max(duration)
max_duration
## [1] 8940
To find the outliers we are going to look at the upper and lower limits
An outlier is value that “lies outside” most of the other values in a set of data.
A method to find upper and lower thresholds involves finding the interquartile range.
** quantile calculation for the give feature**
quantile(duration)
## 0% 25% 50% 75% 100%
## 0 360 540 840 8940
** Lower and upper quantile calculation **
lowerq = quantile(duration)[2]
upperq = quantile(duration)[4]
lowerq
## 25%
## 360
upperq
## 75%
## 840
Interquantile calculation
# iqr = upperq - lowerq
iqr = upperq - lowerq
iqr
## 75%
## 480
The threshold is the boundaries that determine if a value is an outlier.
If the value falls above the upper threshold or below the lower threshold, it is an outlier.
** Calculation the upper threshold **
upper_threshold = (iqr * 1.5) + upperq
upper_threshold
## 75%
## 1560
** Calculation the lower threshold **
lower_threshold = lowerq - (iqr * 1.5)
lower_threshold
## 25%
## -360
** Identify outliers **
# VARIABLE[ VARIABLE > upper_threshold]
# VARIABLE[ VARIABLE < lower_threshold]
#find values above threshold
duration[ duration > upper_threshold][1:10]
## [1] 1680 1860 1980 3360 4500 2760 1860 2700 2940 3300
#find values below threshold
duration[duration < lower_threshold][1:10]
## [1] NA NA NA NA NA NA NA NA NA NA
** Finding outliers records **
# mydata[ VARIABLE > upper_threshold, ]
# mydata[ VARIABLE > lower_threshold, ]
mydata[ duration > upper_threshold, ]
## # A tibble: 3,385 x 24
## `Trip ID` `Taxi ID` `Trip Start Tim~ `Trip End Times~ `Trip Seconds`
## <chr> <chr> <chr> <chr> <int>
## 1 3e7d78b99~ b80ae25954~ 06/06/2014 03:1~ 06/06/2014 03:4~ 1680
## 2 3e7d851f8~ 259ec57ee4~ 09/18/2014 12:0~ 09/18/2014 12:3~ 1860
## 3 3e7d87103~ 78fb99d332~ 03/24/2013 04:1~ 03/24/2013 04:4~ 1980
## 4 3e7d8b799~ 74605d6aa0~ 07/23/2015 04:0~ 07/23/2015 04:4~ 3360
## 5 3e7d8f714~ 97c6416993~ 09/30/2015 04:4~ 09/30/2015 06:0~ 4500
## 6 3e7d91de6~ 0de60e5618~ 10/30/2015 08:0~ 10/30/2015 08:4~ 2760
## 7 3e7d96aa7~ b918186326~ 02/05/2014 09:3~ 02/05/2014 10:0~ 1860
## 8 3e7d99a42~ 5cfc2dce21~ 08/19/2015 03:4~ 08/19/2015 04:3~ 2700
## 9 3e7d9d2f5~ 24b87270f3~ 06/08/2016 04:4~ 06/08/2016 05:3~ 2940
## 10 3e7da54c9~ 8d13012158~ 02/14/2014 07:4~ 02/14/2014 08:4~ 3300
## # ... with 3,375 more rows, and 19 more variables: `Trip Miles` <dbl>,
## # `Pickup Census Tract` <dbl>, `Dropoff Census Tract` <dbl>, `Pickup
## # Community Area` <int>, `Dropoff Community Area` <int>, Fare <dbl>,
## # Tips <dbl>, Tolls <dbl>, Extras <dbl>, `Trip Total` <chr>, `Payment
## # Type` <chr>, Company <chr>, `Pickup Centroid Latitude` <dbl>, `Pickup
## # Centroid Longitude` <dbl>, `Pickup Centroid Location` <chr>, `Dropoff
## # Centroid Latitude` <dbl>, `Dropoff Centroid Longitude` <dbl>, `Dropoff
## # Centroid Location` <chr>, `Community Areas` <int>
mydata[ duration < lower_threshold, ]
## # A tibble: 0 x 24
## # ... with 24 variables: `Trip ID` <chr>, `Taxi ID` <chr>, `Trip Start
## # Timestamp` <chr>, `Trip End Timestamp` <chr>, `Trip Seconds` <int>,
## # `Trip Miles` <dbl>, `Pickup Census Tract` <dbl>, `Dropoff Census
## # Tract` <dbl>, `Pickup Community Area` <int>, `Dropoff Community
## # Area` <int>, Fare <dbl>, Tips <dbl>, Tolls <dbl>, Extras <dbl>, `Trip
## # Total` <chr>, `Payment Type` <chr>, Company <chr>, `Pickup Centroid
## # Latitude` <dbl>, `Pickup Centroid Longitude` <dbl>, `Pickup Centroid
## # Location` <chr>, `Dropoff Centroid Latitude` <dbl>, `Dropoff Centroid
## # Longitude` <dbl>, `Dropoff Centroid Location` <chr>, `Community
## # Areas` <int>
There are many outliers in this dataset. 3,385 are above the threshold. This may be because it is less frequent for people to take taxis over great distance. In a city it is uncommon to have to to take trips outside the city limits so this may be why there is such a small iqr.
It can also be useful to visualize the data using a box and whisker plot.
The boxplot supports the IQR also shows the upper and lower thresholds
# p <- ggplot(data = mydata, aes(x = "", y = VARIABLE)) + geom_boxplot() + coord_flip()
#no x values because we are using box plots
p <- ggplot( data = mydata, aes(x = "", y = duration)) + geom_boxplot() + coord_flip()
p
Now, we will look at Chicago taxi data.
Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq
Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released
##Size of file: 38.7KB. The file includes 24 col and 100K rows. It was cleaned before it was released to exclude unrealistic information such as trips that were less than one second.
summary(mydata)
## Trip ID Taxi ID Trip Start Timestamp
## Length:39939 Length:39939 Length:39939
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Trip End Timestamp Trip Seconds Trip Miles Pickup Census Tract
## Length:39939 Min. : 0.0 Min. : 0.00 Min. :1.703e+10
## Class :character 1st Qu.: 360.0 1st Qu.: 0.00 1st Qu.:1.703e+10
## Mode :character Median : 540.0 Median : 0.50 Median :1.703e+10
## Mean : 709.3 Mean : 1.95 Mean :1.703e+10
## 3rd Qu.: 840.0 3rd Qu.: 1.50 3rd Qu.:1.703e+10
## Max. :8940.0 Max. :1115.30 Max. :1.703e+10
## Dropoff Census Tract Pickup Community Area Dropoff Community Area
## Min. :1.703e+10 Min. : 1.00 Min. : 1.00
## 1st Qu.:1.703e+10 1st Qu.: 8.00 1st Qu.: 8.00
## Median :1.703e+10 Median : 8.00 Median : 8.00
## Mean :1.703e+10 Mean :21.58 Mean :21.19
## 3rd Qu.:1.703e+10 3rd Qu.:32.00 3rd Qu.:32.00
## Max. :1.703e+10 Max. :77.00 Max. :77.00
## Fare Tips Tolls Extras
## Min. : 0.00 Min. : 0.000 Min. :0.0000000 Min. : 0.0000
## 1st Qu.: 5.65 1st Qu.: 0.000 1st Qu.:0.0000000 1st Qu.: 0.0000
## Median : 7.25 Median : 0.000 Median :0.0000000 Median : 0.0000
## Mean : 10.73 Mean : 1.106 Mean :0.0003458 Mean : 0.7473
## 3rd Qu.: 10.45 3rd Qu.: 2.000 3rd Qu.:0.0000000 3rd Qu.: 1.0000
## Max. :106.25 Max. :50.000 Max. :3.0000000 Max. :213.0000
## Trip Total Payment Type Company
## Length:39939 Length:39939 Length:39939
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Pickup Centroid Latitude Pickup Centroid Longitude
## Min. :41.79 Min. :-87.90
## 1st Qu.:41.88 1st Qu.:-87.64
## Median :41.89 Median :-87.63
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.90 3rd Qu.:-87.62
## Max. :42.02 Max. :-87.58
## Pickup Centroid Location Dropoff Centroid Latitude
## Length:39939 Min. :41.78
## Class :character 1st Qu.:41.88
## Mode :character Median :41.89
## Mean :41.90
## 3rd Qu.:41.90
## Max. :42.02
## Dropoff Centroid Longitude Dropoff Centroid Location Community Areas
## Min. :-87.90 Length:39939 Min. : 1.00
## 1st Qu.:-87.64 Class :character 1st Qu.:37.00
## Median :-87.63 Mode :character Median :37.00
## Mean :-87.65 Mean :40.63
## 3rd Qu.:-87.62 3rd Qu.:38.00
## Max. :-87.58 Max. :76.00
** Business logic - Pickup Times must be before dropoff times. Taxi ID cannot be used in multiple trips at the same time. Community area for dropoff or pickups must correspond to one communuty area in the Community areas field. Field integrity - Numeric fields (monetary related, time related, lat/long) may not contain numbers. **