Business Discovery and Data Sourcing

Notebook Instructions

For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.

Load Packages in R/RStudio

We are going to use tidyverse a collection of R packages designed for data science.

Info: https://www.tidyverse.org/

## Loading required package: tidyverse

## -- Attaching packages ------------------------------- tidyverse 1.2.1 --

## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0

## -- Conflicts ---------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Task 1: Data Prepartion and Cleaning

1A) Read the csv file into R Studio and display the dataset.

Name your dataset ‘mydata’ so it easy to work with.
Commands: read_csv() head() mean() sub() as.numeric()

Extract the assigned features (columns) to perform some analytics

mydata = read_csv("data/Taxi_Trips_Sample.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Trip Seconds` = col_integer(),
##   `Trip Miles` = col_double(),
##   `Pickup Census Tract` = col_double(),
##   `Dropoff Census Tract` = col_double(),
##   `Pickup Community Area` = col_integer(),
##   `Dropoff Community Area` = col_integer(),
##   `Pickup Centroid Latitude` = col_double(),
##   `Pickup Centroid Longitude` = col_double(),
##   `Dropoff Centroid Latitude` = col_double(),
##   `Dropoff Centroid Longitude` = col_double(),
##   `Community Areas` = col_integer()
## )

## See spec(...) for full column specifications.

mydata = na.omit(mydata)

head(mydata)

## # A tibble: 6 x 24
##   `Trip ID`  `Taxi ID`    `Trip Start Tim~ `Trip End Times~ `Trip Seconds`
##   <chr>      <chr>        <chr>            <chr>                     <int>
## 1 3e7d6e5c4~ bc1c0381e3b~ 09/23/2013 05:1~ 09/23/2013 05:1~            420
## 2 3e7d706fe~ 631de3a4ad3~ 05/05/2014 09:3~ 05/05/2014 09:4~            900
## 3 3e7d70d4a~ 623c452c15d~ 03/26/2016 01:3~ 03/26/2016 01:3~            300
## 4 3e7d7115e~ 288345b3a02~ 05/13/2015 10:3~ 05/13/2015 10:4~            540
## 5 3e7d712c0~ dd9b35afedd~ 06/17/2016 04:0~ 06/17/2016 04:3~           1140
## 6 3e7d71a6e~ 19a9ad5b3b7~ 03/15/2013 10:1~ 03/15/2013 10:4~           1200
## # ... with 19 more variables: `Trip Miles` <dbl>, `Pickup Census
## #   Tract` <dbl>, `Dropoff Census Tract` <dbl>, `Pickup Community
## #   Area` <int>, `Dropoff Community Area` <int>, Fare <chr>, Tips <chr>,
## #   Tolls <chr>, Extras <chr>, `Trip Total` <chr>, `Payment Type` <chr>,
## #   Company <chr>, `Pickup Centroid Latitude` <dbl>, `Pickup Centroid
## #   Longitude` <dbl>, `Pickup Centroid Location` <chr>, `Dropoff Centroid
## #   Latitude` <dbl>, `Dropoff Centroid Longitude` <dbl>, `Dropoff Centroid
## #   Location` <chr>, `Community Areas` <int>

#assigning and viewing variables
fare = mydata$Fare
head(fare)

## [1] "$6.05"  "$10.45" "$6.25"  "$7.25"  "$13.25" "$35.25"

tips = mydata$Tips
head(tips)

## [1] "$0.00"  "$0.00"  "$0.00"  "$0.00"  "$2.00"  "$11.15"

tolls = mydata$Tolls
head(tolls)

## [1] "$0.00" "$0.00" "$0.00" "$0.00" "$0.00" "$0.00"

extras = mydata$Extras
head(extras)

## [1] "$0.00" "$0.00" "$0.00" "$0.00" "$0.00" "$2.00"

Find the average for each assigned feature

** What happens when we try to use the function? **

mean(fare)

## Warning in mean.default(fare): argument is not numeric or logical:
## returning NA

## [1] NA

mean(tips)

## Warning in mean.default(tips): argument is not numeric or logical:
## returning NA

## [1] NA

mean(tolls)

## Warning in mean.default(tolls): argument is not numeric or logical:
## returning NA

## [1] NA

mean(extras)

## Warning in mean.default(extras): argument is not numeric or logical:
## returning NA

## [1] NA

#non-numeric values do not allow for mean calculation. Characters in vars gets in the way

To resolve the error, check if the feature data type is correct
Notice that comma ‘1,234’ in some values
Also the data type and dollar sign ‘$’ symbol
To remove the any character in this case the comma from “1,234”. We must substitute it with just an empty space.

1B) Substitute the characters causing the issue with empty character (“”) on all features

** substitute comma with “” **

# VARIABLE <- sub("," , "", VARIABLE)

#no commas in variables

** substitute dollar sign with “” **

# use \\ because of special characters 
fare <- sub("\\$" , "", fare)
tips <- sub("\\$" , "", tips)
tolls <- sub("\\$" , "", tolls)
extras <- sub("\\$" , "", extras)

head(fare)

## [1] "6.05"  "10.45" "6.25"  "7.25"  "13.25" "35.25"

1C) Change the features data type from character to numeric

** character to numeric - as.numeric()**

fare <- as.numeric(fare)
tips <- as.numeric(tips)
tolls <- as.numeric(tolls)
extras <- as.numeric(extras)

** mean with NA removed **

# VARIABLE_MEAN <- mean( YOUR_VARIABLE , na.rm = TRUE)

fare_mean = mean(fare)
fare_mean

## [1] 10.73069

tips_mean = mean(tips)
tips_mean

## [1] 1.105968

tolls_mean = mean(tolls)
tolls_mean

## [1] 0.0003457773

extras_mean = mean(extras)
extras_mean

## [1] 0.7472891

1D) Assign new clean variables to dataset and save your clean new data

** Clean variables to dataset **

# mydata$VARIABLE <- VARIABLE

mydata$Fare <- fare
mydata$Tips <- tips
mydata$Tolls <- tolls
mydata$Extras <- extras

** Save clean data **

# write_csv(mydata, path = "data/mydata_clean.csv")

write_csv (mydata, path = "data/mydata_clean.csv")

Task 2: Outlier Detection

Commands: mean() sd() min() max() quantile()

In this task we must calculate the mean, standard deviation, maximum, and minimum for the given feature.

2A) Extract the assigned features (columns) to perform some analytics

#reread clean data to be safe.
mydata <- read_csv("data/mydata_clean.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Trip ID` = col_character(),
##   `Taxi ID` = col_character(),
##   `Trip Start Timestamp` = col_character(),
##   `Trip End Timestamp` = col_character(),
##   `Trip Seconds` = col_integer(),
##   `Pickup Community Area` = col_integer(),
##   `Dropoff Community Area` = col_integer(),
##   `Trip Total` = col_character(),
##   `Payment Type` = col_character(),
##   Company = col_character(),
##   `Pickup Centroid Location` = col_character(),
##   `Dropoff Centroid  Location` = col_character(),
##   `Community Areas` = col_integer()
## )

## See spec(...) for full column specifications.

duration = mydata$`Trip Seconds`
head(duration)

## [1]  420  900  300  540 1140 1200

Calculate the average and standard deviation of the assigned feature (column)

** calculate the average **

duration_mean = mean(duration)
duration_mean

## [1] 709.3322

** calculate the standard deviation **

duration_sd = sd(duration)
duration_sd

## [1] 659.8013

2B) Calculate the min and max of the assigned feature (column)

** calculate the min **

min_duration = min(duration)
min_duration

## [1] 0

** calculate the max **

max_duration = max(duration)
max_duration

## [1] 8940

2C) Quantile calculation to find outliers of the assigned feature (column)

To find the outliers we are going to look at the upper and lower limits
An outlier is value that “lies outside” most of the other values in a set of data.
A method to find upper and lower thresholds involves finding the interquartile range.

** quantile calculation for the give feature**

quantile(duration)

##   0%  25%  50%  75% 100% 
##    0  360  540  840 8940

** Lower and upper quantile calculation **

lowerq = quantile(duration)[2]
upperq = quantile(duration)[4]

lowerq

## 25% 
## 360

upperq

## 75% 
## 840

Interquantile calculation

# iqr = upperq - lowerq
iqr = upperq - lowerq
iqr

## 75% 
## 480

2D) Finding upper and lower thresholds

The threshold is the boundaries that determine if a value is an outlier.
If the value falls above the upper threshold or below the lower threshold, it is an outlier.

** Calculation the upper threshold **

upper_threshold = (iqr * 1.5) + upperq 
upper_threshold

##  75% 
## 1560

** Calculation the lower threshold **

lower_threshold = lowerq - (iqr * 1.5)
lower_threshold

##  25% 
## -360

** Identify outliers **

# VARIABLE[ VARIABLE > upper_threshold]
# VARIABLE[ VARIABLE < lower_threshold]

#find values above threshold
duration[ duration > upper_threshold][1:10]

##  [1] 1680 1860 1980 3360 4500 2760 1860 2700 2940 3300

#find values below threshold
duration[duration < lower_threshold][1:10]

##  [1] NA NA NA NA NA NA NA NA NA NA

** Finding outliers records **

# mydata[ VARIABLE > upper_threshold, ]
# mydata[ VARIABLE > lower_threshold, ]

mydata[ duration > upper_threshold, ]

## # A tibble: 3,385 x 24
##    `Trip ID`  `Taxi ID`   `Trip Start Tim~ `Trip End Times~ `Trip Seconds`
##    <chr>      <chr>       <chr>            <chr>                     <int>
##  1 3e7d78b99~ b80ae25954~ 06/06/2014 03:1~ 06/06/2014 03:4~           1680
##  2 3e7d851f8~ 259ec57ee4~ 09/18/2014 12:0~ 09/18/2014 12:3~           1860
##  3 3e7d87103~ 78fb99d332~ 03/24/2013 04:1~ 03/24/2013 04:4~           1980
##  4 3e7d8b799~ 74605d6aa0~ 07/23/2015 04:0~ 07/23/2015 04:4~           3360
##  5 3e7d8f714~ 97c6416993~ 09/30/2015 04:4~ 09/30/2015 06:0~           4500
##  6 3e7d91de6~ 0de60e5618~ 10/30/2015 08:0~ 10/30/2015 08:4~           2760
##  7 3e7d96aa7~ b918186326~ 02/05/2014 09:3~ 02/05/2014 10:0~           1860
##  8 3e7d99a42~ 5cfc2dce21~ 08/19/2015 03:4~ 08/19/2015 04:3~           2700
##  9 3e7d9d2f5~ 24b87270f3~ 06/08/2016 04:4~ 06/08/2016 05:3~           2940
## 10 3e7da54c9~ 8d13012158~ 02/14/2014 07:4~ 02/14/2014 08:4~           3300
## # ... with 3,375 more rows, and 19 more variables: `Trip Miles` <dbl>,
## #   `Pickup Census Tract` <dbl>, `Dropoff Census Tract` <dbl>, `Pickup
## #   Community Area` <int>, `Dropoff Community Area` <int>, Fare <dbl>,
## #   Tips <dbl>, Tolls <dbl>, Extras <dbl>, `Trip Total` <chr>, `Payment
## #   Type` <chr>, Company <chr>, `Pickup Centroid Latitude` <dbl>, `Pickup
## #   Centroid Longitude` <dbl>, `Pickup Centroid Location` <chr>, `Dropoff
## #   Centroid Latitude` <dbl>, `Dropoff Centroid Longitude` <dbl>, `Dropoff
## #   Centroid Location` <chr>, `Community Areas` <int>

mydata[ duration < lower_threshold, ]

## # A tibble: 0 x 24
## # ... with 24 variables: `Trip ID` <chr>, `Taxi ID` <chr>, `Trip Start
## #   Timestamp` <chr>, `Trip End Timestamp` <chr>, `Trip Seconds` <int>,
## #   `Trip Miles` <dbl>, `Pickup Census Tract` <dbl>, `Dropoff Census
## #   Tract` <dbl>, `Pickup Community Area` <int>, `Dropoff Community
## #   Area` <int>, Fare <dbl>, Tips <dbl>, Tolls <dbl>, Extras <dbl>, `Trip
## #   Total` <chr>, `Payment Type` <chr>, Company <chr>, `Pickup Centroid
## #   Latitude` <dbl>, `Pickup Centroid Longitude` <dbl>, `Pickup Centroid
## #   Location` <chr>, `Dropoff Centroid Latitude` <dbl>, `Dropoff Centroid
## #   Longitude` <dbl>, `Dropoff Centroid Location` <chr>, `Community
## #   Areas` <int>

Are there any outliers? How many?

There are many outliers in this dataset. 3,385 are above the threshold. This may be because it is less frequent for people to take taxis over great distance. In a city it is uncommon to have to to take trips outside the city limits so this may be why there is such a small iqr.

Plotting Outliers

It can also be useful to visualize the data using a box and whisker plot.
The boxplot supports the IQR also shows the upper and lower thresholds

# p <- ggplot(data = mydata, aes(x = "", y = VARIABLE)) + geom_boxplot() + coord_flip()

#no x values because we are using box plots
p <- ggplot( data = mydata, aes(x = "", y = duration)) + geom_boxplot() + coord_flip()

p

Task 3: Data Modeling - Chicago Taxi Data

Now, we will look at Chicago taxi data.

3A) Go and explore the interactive dashboard and read the description of the data. Take an screenshot of something that you find interesting in the dashboard data

Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq
Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released

3B) Describe the Taxi data (Note the size of the file, the number of columns and of rows). Use the functions summary() to describe the data, identify unique entities, fields and summarize.

##Size of file: 38.7KB. The file includes 24 col and 100K rows. It was cleaned before it was released to exclude unrealistic information such as trips that were less than one second.

summary(mydata)

##    Trip ID            Taxi ID          Trip Start Timestamp
##  Length:39939       Length:39939       Length:39939        
##  Class :character   Class :character   Class :character    
##  Mode  :character   Mode  :character   Mode  :character    
##                                                            
##                                                            
##                                                            
##  Trip End Timestamp  Trip Seconds      Trip Miles      Pickup Census Tract
##  Length:39939       Min.   :   0.0   Min.   :   0.00   Min.   :1.703e+10  
##  Class :character   1st Qu.: 360.0   1st Qu.:   0.00   1st Qu.:1.703e+10  
##  Mode  :character   Median : 540.0   Median :   0.50   Median :1.703e+10  
##                     Mean   : 709.3   Mean   :   1.95   Mean   :1.703e+10  
##                     3rd Qu.: 840.0   3rd Qu.:   1.50   3rd Qu.:1.703e+10  
##                     Max.   :8940.0   Max.   :1115.30   Max.   :1.703e+10  
##  Dropoff Census Tract Pickup Community Area Dropoff Community Area
##  Min.   :1.703e+10    Min.   : 1.00         Min.   : 1.00         
##  1st Qu.:1.703e+10    1st Qu.: 8.00         1st Qu.: 8.00         
##  Median :1.703e+10    Median : 8.00         Median : 8.00         
##  Mean   :1.703e+10    Mean   :21.58         Mean   :21.19         
##  3rd Qu.:1.703e+10    3rd Qu.:32.00         3rd Qu.:32.00         
##  Max.   :1.703e+10    Max.   :77.00         Max.   :77.00         
##       Fare             Tips            Tolls               Extras        
##  Min.   :  0.00   Min.   : 0.000   Min.   :0.0000000   Min.   :  0.0000  
##  1st Qu.:  5.65   1st Qu.: 0.000   1st Qu.:0.0000000   1st Qu.:  0.0000  
##  Median :  7.25   Median : 0.000   Median :0.0000000   Median :  0.0000  
##  Mean   : 10.73   Mean   : 1.106   Mean   :0.0003458   Mean   :  0.7473  
##  3rd Qu.: 10.45   3rd Qu.: 2.000   3rd Qu.:0.0000000   3rd Qu.:  1.0000  
##  Max.   :106.25   Max.   :50.000   Max.   :3.0000000   Max.   :213.0000  
##   Trip Total        Payment Type         Company         
##  Length:39939       Length:39939       Length:39939      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  Pickup Centroid Latitude Pickup Centroid Longitude
##  Min.   :41.79            Min.   :-87.90           
##  1st Qu.:41.88            1st Qu.:-87.64           
##  Median :41.89            Median :-87.63           
##  Mean   :41.90            Mean   :-87.65           
##  3rd Qu.:41.90            3rd Qu.:-87.62           
##  Max.   :42.02            Max.   :-87.58           
##  Pickup Centroid Location Dropoff Centroid Latitude
##  Length:39939             Min.   :41.78            
##  Class :character         1st Qu.:41.88            
##  Mode  :character         Median :41.89            
##                           Mean   :41.90            
##                           3rd Qu.:41.90            
##                           Max.   :42.02            
##  Dropoff Centroid Longitude Dropoff Centroid  Location Community Areas
##  Min.   :-87.90             Length:39939               Min.   : 1.00  
##  1st Qu.:-87.64             Class :character           1st Qu.:37.00  
##  Median :-87.63             Mode  :character           Median :37.00  
##  Mean   :-87.65                                        Mean   :40.63  
##  3rd Qu.:-87.62                                        3rd Qu.:38.00  
##  Max.   :-87.58                                        Max.   :76.00

3C) Identify any relational business logic and field integrity

** Business logic - Pickup Times must be before dropoff times. Taxi ID cannot be used in multiple trips at the same time. Community area for dropoff or pickups must correspond to one communuty area in the Community areas field. Field integrity - Numeric fields (monetary related, time related, lat/long) may not contain numbers. **

3D) Draw a star schema of the Chicago Taxi Data

Modeling Tool: https://erdplus.com/#/standalone