Business Discovery and Data Sourcing

Notebook Instructions

For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.

Load Packages in R/RStudio

We are going to use tidyverse a collection of R packages designed for data science.

Info: https://www.tidyverse.org/

## Loading required package: tidyverse

## -- Attaching packages ------------------------------------------------------------------------------------------ tidyverse 1.2.1 --

## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0

## -- Conflicts --------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Task 1: Data Prepartion and Cleaning

1A) Read the csv file into R Studio and display the dataset.

Name your dataset ‘mydata’ so it easy to work with.
Commands: read_csv() head() mean() sub() as.numeric()

Extract the assigned features (columns) to perform some analytics

mydata <- read_csv(file="Taxi_Trips_Sample.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Trip Seconds` = col_integer(),
##   `Trip Miles` = col_double(),
##   `Pickup Census Tract` = col_double(),
##   `Dropoff Census Tract` = col_double(),
##   `Pickup Community Area` = col_integer(),
##   `Dropoff Community Area` = col_integer(),
##   `Pickup Centroid Latitude` = col_double(),
##   `Pickup Centroid Longitude` = col_double(),
##   `Dropoff Centroid Latitude` = col_double(),
##   `Dropoff Centroid Longitude` = col_double(),
##   `Community Areas` = col_integer()
## )

## See spec(...) for full column specifications.

head(mydata)

## # A tibble: 6 x 24
##   `Trip ID`  `Taxi ID`    `Trip Start Tim~ `Trip End Times~ `Trip Seconds`
##   <chr>      <chr>        <chr>            <chr>                     <int>
## 1 3e7d6d8cc~ b47c583b142~ 08/19/2014 02:4~ 08/19/2014 02:4~            480
## 2 3e7d6e5c4~ bc1c0381e3b~ 09/23/2013 05:1~ 09/23/2013 05:1~            420
## 3 3e7d6e69c~ f529487ccf3~ 02/16/2014 10:1~ 02/16/2014 10:3~            420
## 4 3e7d6efe4~ 0f831bff43d~ 05/10/2013 08:0~ 05/10/2013 08:4~           2340
## 5 3e7d6f001~ e5274d6c103~ 02/21/2016 07:1~ 02/21/2016 07:1~            300
## 6 3e7d6f2a0~ 329d9f0b72c~ 12/10/2015 05:3~ 12/10/2015 06:0~           1020
## # ... with 19 more variables: `Trip Miles` <dbl>, `Pickup Census
## #   Tract` <dbl>, `Dropoff Census Tract` <dbl>, `Pickup Community
## #   Area` <int>, `Dropoff Community Area` <int>, Fare <chr>, Tips <chr>,
## #   Tolls <chr>, Extras <chr>, `Trip Total` <chr>, `Payment Type` <chr>,
## #   Company <chr>, `Pickup Centroid Latitude` <dbl>, `Pickup Centroid
## #   Longitude` <dbl>, `Pickup Centroid Location` <chr>, `Dropoff Centroid
## #   Latitude` <dbl>, `Dropoff Centroid Longitude` <dbl>, `Dropoff Centroid
## #   Location` <chr>, `Community Areas` <int>

mydata <- na.omit(mydata)
Fare <- mydata$Fare
Tips <- mydata$Tips
Tolls <- mydata$Tolls
Extras <- mydata$Extras

Find the average for each assigned feature

** What happens when we try to use the function? **

mean(Fare)

## Warning in mean.default(Fare): argument is not numeric or logical:
## returning NA

## [1] NA

mean(Tips)

## Warning in mean.default(Tips): argument is not numeric or logical:
## returning NA

## [1] NA

mean(Tolls)

## Warning in mean.default(Tolls): argument is not numeric or logical:
## returning NA

## [1] NA

mean(Extras)

## Warning in mean.default(Extras): argument is not numeric or logical:
## returning NA

## [1] NA

To resolve the error, check if the feature data type is correct
Notice that comma ‘1,234’ in some values
Also the data type and dollar sign ‘$’ symbol
To remove the any character in this case the comma from “1,234”. We must substitute it with just an empty space.

1B) Substitute the characters causing the issue with empty character (“”) on all features

** substitute comma with “” **

# VARIABLE <- sub("," , "", VARIABLE)
Fare <- sub(",", "", Fare)
Tips <- sub(",", "", Tips)
Tolls <- sub(",", "", Tolls)
Extras <- sub(",", "", Extras)

** substitute dollar sign with “” **

Fare <- sub("$", "", Fare)
Tips <- sub("$", "", Tips)
Tolls <- sub("$", "", Tolls)
Extras <- sub("$", "", Extras)

1C) Change the features data type from character to numeric

** character to numeric - as.numeric()**

Fare <- as.numeric(Fare)

## Warning: NAs introduced by coercion

Tips <- as.numeric(Tips)

## Warning: NAs introduced by coercion

Tolls <-as.numeric(Tolls)

## Warning: NAs introduced by coercion

Extras <- as.numeric(Extras)

## Warning: NAs introduced by coercion

** mean with NA removed **

# VARIABLE_MEAN <- mean( YOUR_VARIABLE , na.rm = TRUE)

Fare_Mean <- mean(Fare, na.rm = TRUE)
Tips_Mean <- mean(Tips, na.rm = TRUE)
Tolls_Mean <- mean(Tolls, na.rm = TRUE)
Extras_Mean <- mean(Extras, na.rm = TRUE)

1D) Assign new clean variables to dataset and save your clean new data

** Clean variables to dataset **

# mydata$VARIABLE <- VARIABLE

mydata$Fare <- Fare
mydata$Tips <- Tips
mydata$Tolls <- Tolls
mydata$Extras <- Extras

** Save clean data **

# write_csv(mydata, path = "data/mydata_clean.csv")

write_csv(mydata, path = "data/mydata_clean.csv")

Task 2: Outlier Detection

Commands: mean() sd() min() max() quantile()

In this task we must calculate the mean, standard deviation, maximum, and minimum for the given feature.

2A) Extract the assigned features (columns) to perform some analytics

mydata <- read_csv(file= "data/mydata_clean.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Trip Seconds` = col_integer(),
##   `Trip Miles` = col_double(),
##   `Pickup Census Tract` = col_double(),
##   `Dropoff Census Tract` = col_double(),
##   `Pickup Community Area` = col_integer(),
##   `Dropoff Community Area` = col_integer(),
##   `Pickup Centroid Latitude` = col_double(),
##   `Pickup Centroid Longitude` = col_double(),
##   `Dropoff Centroid Latitude` = col_double(),
##   `Dropoff Centroid Longitude` = col_double(),
##   `Community Areas` = col_integer()
## )

## See spec(...) for full column specifications.

Trip_Seconds <- mydata$`Trip Seconds`

Calculate the average and standard deviation of the assigned feature (column)

** calculate the average **

mean_TripSeconds <- mean(mydata$"Trip Seconds")
mean_TripSeconds

## [1] 709.3322

** calculate the standard deviation **

sd_TripSeconds <- sd(mydata$"Trip Seconds")
sd_TripSeconds

## [1] 659.8013

2B) Calculate the min and max of the assigned feature (column)

** calculate the min **

min_TripSeconds <- min(mydata$"Trip Seconds")
min_TripSeconds

## [1] 0

** calculate the max **

max_TripSeconds <- max(mydata$"Trip Seconds")
max_TripSeconds

## [1] 8940

2C) Quantile calculation to find outliers of the assigned feature (column)

To find the outliers we are going to look at the upper and lower limits
An outlier is value that “lies outside” most of the other values in a set of data.
A method to find upper and lower thresholds involves finding the interquartile range.

** quantile calculation for the give feature**

quantile(Trip_Seconds, na.rm = TRUE)

##   0%  25%  50%  75% 100% 
##    0  360  540  840 8940

** Lower and upper quantile calculation **

# lowerq = quantile(VARIABLE)[2]
# upperq = quantile(VARIABLE)[4]

lowerq = quantile(Trip_Seconds)[2]
upperq = quantile(Trip_Seconds)[4]

lowerq

## 25% 
## 360

upperq

## 75% 
## 840

Interquantile calculation

# iqr = upperq - lowerq
iqr = upperq - lowerq
iqr

## 75% 
## 480

2D) Finding upper and lower thresholds

The threshold is the boundaries that determine if a value is an outlier.
If the value falls above the upper threshold or below the lower threshold, it is an outlier.

** Calculation the upper threshold **

# upper_threshold = (iqr * 1.5) + upperq 
upper_threshold = (iqr * 1.5) + upperq
upper_threshold

##  75% 
## 1560

** Calculation the lower threshold **

# lower_threshold = lowerq - (iqr * 1.5)
lower_threshold = lowerq - (iqr * 1.5)
lower_threshold

##  25% 
## -360

** Identify outliers **

# VARIABLE[ VARIABLE > upper_threshold][1:10]
# VARIABLE[ VARIABLE > lower_threshold][1:10]
Trip_Seconds[Trip_Seconds > upper_threshold][1:10]

##  [1] 1680 1860 1980 3360 4500 2760 1860 2700 2940 3300

Trip_Seconds[Trip_Seconds < lower_threshold][1:10]

##  [1] NA NA NA NA NA NA NA NA NA NA

** Finding outliers records **

# mydata[ VARIABLE > upper_threshold, ][1:10]
# mydata[ VARIABLE > lower_threshold, ][1:10]
mydata[Trip_Seconds > upper_threshold, ]

## # A tibble: 3,385 x 24
##    `Trip ID`  `Taxi ID`   `Trip Start Tim~ `Trip End Times~ `Trip Seconds`
##    <chr>      <chr>       <chr>            <chr>                     <int>
##  1 3e7d78b99~ b80ae25954~ 06/06/2014 03:1~ 06/06/2014 03:4~           1680
##  2 3e7d851f8~ 259ec57ee4~ 09/18/2014 12:0~ 09/18/2014 12:3~           1860
##  3 3e7d87103~ 78fb99d332~ 03/24/2013 04:1~ 03/24/2013 04:4~           1980
##  4 3e7d8b799~ 74605d6aa0~ 07/23/2015 04:0~ 07/23/2015 04:4~           3360
##  5 3e7d8f714~ 97c6416993~ 09/30/2015 04:4~ 09/30/2015 06:0~           4500
##  6 3e7d91de6~ 0de60e5618~ 10/30/2015 08:0~ 10/30/2015 08:4~           2760
##  7 3e7d96aa7~ b918186326~ 02/05/2014 09:3~ 02/05/2014 10:0~           1860
##  8 3e7d99a42~ 5cfc2dce21~ 08/19/2015 03:4~ 08/19/2015 04:3~           2700
##  9 3e7d9d2f5~ 24b87270f3~ 06/08/2016 04:4~ 06/08/2016 05:3~           2940
## 10 3e7da54c9~ 8d13012158~ 02/14/2014 07:4~ 02/14/2014 08:4~           3300
## # ... with 3,375 more rows, and 19 more variables: `Trip Miles` <dbl>,
## #   `Pickup Census Tract` <dbl>, `Dropoff Census Tract` <dbl>, `Pickup
## #   Community Area` <int>, `Dropoff Community Area` <int>, Fare <chr>,
## #   Tips <chr>, Tolls <chr>, Extras <chr>, `Trip Total` <chr>, `Payment
## #   Type` <chr>, Company <chr>, `Pickup Centroid Latitude` <dbl>, `Pickup
## #   Centroid Longitude` <dbl>, `Pickup Centroid Location` <chr>, `Dropoff
## #   Centroid Latitude` <dbl>, `Dropoff Centroid Longitude` <dbl>, `Dropoff
## #   Centroid Location` <chr>, `Community Areas` <int>

Are there any outliers, if not explain the lack of outliers? if any explain what the outliers represent and how many records are outliers? Use the count() function to find the number of outliers.

Yes there are outliers, they represent data above the upper and lower threshhold limits. There are 3385 outliers that serve to skew the data.

count(mydata[Trip_Seconds > upper_threshold, ])

## # A tibble: 1 x 1
##       n
##   <int>
## 1  3385

Plotting Outliers

It can also be useful to visualize the data using a box and whisker plot.
The boxplot supports the IQR also shows the upper and lower thresholds

# p <- ggplot(data = Scoring, aes(x = "", y = Finrat)) + geom_boxplot() + coord_flip()
# p 
p <- ggplot(data = mydata, aes(x = "", y = Trip_Seconds)) + geom_boxplot() + coord_flip()
p

Task 3: Data Modeling - Chicago Taxi Data

3A) Go and explore the interactive dashboard. Take a screenshot of something that you find interesting in the dashboard and write a short summary of your findings

Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq
Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released

While the company, Uber was started in 2009, it is not until 2014 than the ridesharing industry began to seriously impact the taxi industry. It is as if it took several years for the public to trust the service.

3B) Write the Metadata for the Taxi data (description), note the size of the file its dimension (number of rows and columns). Use the function summary() to help you identify unique entities, fields, number of rows and data types.

The data consists of 24 columns and 100,000 rows of data. The excel file is 38, 7272 KB It is interesting to note that the median trip miles is only 540 seconds, or 9 minutes, a shorter range than one might think. Whereas the first quartile is only 6 minutes (360 seconds). Unique entities incluce Trip ID, Taxi ID, and Community Areas. Tip ID, Taxi ID, Trip Start Timestamp, Trip End TimeStamp, Fare, Trip Total, Payment Type, Company, Pickup Centroid Location and Dropoff Centroid Location are character values. The length of Trip ID, Trip Total, Payment Type, Company PickUp Centroid Locatoin,and Dropoff Centroid location is 39,939.

3C) Write a description of any relational business logic and field integrity

Relational logic includes requiring that the length of the trip is positive in number of seconds and miles. One cannot travel for negative distance or time. If the start time is after the end time or the number of miles is negative, then there may have been a mistake in the data entry. Field intengrity involves all value: Fare, Tips, Tools, Extras, and Trip Total are in dollars, and all values in Trip Miles are numeric and in miles.

3D) Draw a star schema of the Chicago Taxi Data

Modeling Tool: https://erdplus.com/#/standalone

 knitr::include_graphics('erdplus-diagram.png')