Business Discovery and Data Sourcing

Notebook Instructions

For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.

Load Packages in R/RStudio

We are going to use tidyverse a collection of R packages designed for data science.

Info: https://www.tidyverse.org/

## Loading required package: tidyverse

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Task 1: Data Prepartion and Cleaning

1A) Read the csv file into R Studio and display the dataset.

Name your dataset ‘mydata’ so it easy to work with.
Commands: read_csv() head() mean() sub() as.numeric()

Extract the assigned features (columns) to perform some analytics

mydata <- read_csv('Taxi_Trips_Sample.csv')

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Trip Seconds` = col_integer(),
##   `Trip Miles` = col_double(),
##   `Pickup Census Tract` = col_double(),
##   `Dropoff Census Tract` = col_double(),
##   `Pickup Community Area` = col_integer(),
##   `Dropoff Community Area` = col_integer(),
##   `Pickup Centroid Latitude` = col_double(),
##   `Pickup Centroid Longitude` = col_double(),
##   `Dropoff Centroid Latitude` = col_double(),
##   `Dropoff Centroid Longitude` = col_double(),
##   `Community Areas` = col_integer()
## )

## See spec(...) for full column specifications.

fare = mydata$Fare
tips = mydata$Tips
tolls = mydata$Tolls
extras = mydata$Extras
head(fare)

## [1] "$7.05"  "$6.05"  "$7.05"  "$31.25" "$5.50"  "$9.25"

head(tips)

## [1] "$0.00" "$0.00" "$0.00" "$0.00" "$0.00" "$0.00"

head(tolls)

## [1] "$0.00" "$0.00" "$0.00" "$0.00" "$0.00" "$0.00"

head(extras)

## [1] "$1.50" "$0.00" "$0.00" "$3.00" "$0.00" "$1.00"

Find the average for each assigned feature

** What happens when we try to use the function? **

mean(fare)

## Warning in mean.default(fare): argument is not numeric or logical:
## returning NA

## [1] NA

To resolve the error, check if the feature data type is correct
Notice that comma ‘1,234’ in some values
Also the data type and dollar sign ‘$’ symbol
To remove the any character in this case the comma from “1,234”. We must substitute it with just an empty space.

1B) Substitute the characters causing the issue with empty character (“”) on all features

** substitute comma with “” **

# VARIABLE <- sub("," , "", VARIABLE)
fare= sub(pattern = ",", replacement = "", x = fare)
head(fare)

## [1] "$7.05"  "$6.05"  "$7.05"  "$31.25" "$5.50"  "$9.25"

tips = sub(pattern = ",", replacement = "", x = tips)
head(tips)

## [1] "$0.00" "$0.00" "$0.00" "$0.00" "$0.00" "$0.00"

tolls = sub(pattern = ",", replacement = "", x = tolls)
head(tolls)

## [1] "$0.00" "$0.00" "$0.00" "$0.00" "$0.00" "$0.00"

extras = sub(pattern = ",", replacement = "", x = extras)
head(extras)

## [1] "$1.50" "$0.00" "$0.00" "$3.00" "$0.00" "$1.00"

** substitute dollar sign with “” **

fare= sub(pattern = "\\$", replacement = "", x = fare)
head(fare)

## [1] "7.05"  "6.05"  "7.05"  "31.25" "5.50"  "9.25"

tips = sub(pattern = "\\$", replacement = "", x = tips)
head(tips)

## [1] "0.00" "0.00" "0.00" "0.00" "0.00" "0.00"

tolls = sub(pattern = "\\$", replacement = "", x = tolls)
head(tolls)

## [1] "0.00" "0.00" "0.00" "0.00" "0.00" "0.00"

extras = sub(pattern = "\\$", replacement = "", x = extras)
head(extras)

## [1] "1.50" "0.00" "0.00" "3.00" "0.00" "1.00"

1C) Change the features data type from character to numeric

** character to numeric - as.numeric()**

fare = as.numeric(fare)
tips = as.numeric(tips)
tolls = as.numeric(tolls)
extras = as.numeric(extras)

** mean with NA removed **

# VARIABLE_MEAN <- mean( YOUR_VARIABLE , na.rm = TRUE)
fare_mean = mean(fare, na.rm = TRUE)
fare_mean

## [1] 12.56912

tips_mean = mean(tips, na.rm = TRUE)
tips_mean

## [1] 1.249043

tolls_mean = mean(tolls, na.rm = TRUE)
tolls_mean

## [1] 0.007330247

extras_mean = mean(extras, na.rm = TRUE)
extras_mean

## [1] 0.7976405

1D) Assign new clean variables to dataset and save your clean new data

** Clean variables to dataset **

# mydata$VARIABLE <- VARIABLE
mydata$Fare = fare
mydata$Tips = tips
mydata$Tolls = tolls
mydata$Extras = extras
head(mydata)

## # A tibble: 6 x 24
##   `Trip ID`  `Taxi ID`    `Trip Start Tim… `Trip End Times… `Trip Seconds`
##   <chr>      <chr>        <chr>            <chr>                     <int>
## 1 3e7d6d8cc… b47c583b142… 08/19/2014 02:4… 08/19/2014 02:4…            480
## 2 3e7d6e5c4… bc1c0381e3b… 09/23/2013 05:1… 09/23/2013 05:1…            420
## 3 3e7d6e69c… f529487ccf3… 02/16/2014 10:1… 02/16/2014 10:3…            420
## 4 3e7d6efe4… 0f831bff43d… 05/10/2013 08:0… 05/10/2013 08:4…           2340
## 5 3e7d6f001… e5274d6c103… 02/21/2016 07:1… 02/21/2016 07:1…            300
## 6 3e7d6f2a0… 329d9f0b72c… 12/10/2015 05:3… 12/10/2015 06:0…           1020
## # ... with 19 more variables: `Trip Miles` <dbl>, `Pickup Census
## #   Tract` <dbl>, `Dropoff Census Tract` <dbl>, `Pickup Community
## #   Area` <int>, `Dropoff Community Area` <int>, Fare <dbl>, Tips <dbl>,
## #   Tolls <dbl>, Extras <dbl>, `Trip Total` <chr>, `Payment Type` <chr>,
## #   Company <chr>, `Pickup Centroid Latitude` <dbl>, `Pickup Centroid
## #   Longitude` <dbl>, `Pickup Centroid Location` <chr>, `Dropoff Centroid
## #   Latitude` <dbl>, `Dropoff Centroid Longitude` <dbl>, `Dropoff Centroid
## #   Location` <chr>, `Community Areas` <int>

** Save clean data **

write_csv(mydata, path = "data/mydata_clean.csv")

Task 2: Outlier Detection

Commands: mean() sd() min() max() quantile()

In this task we must calculate the mean, standard deviation, maximum, and minimum for the given feature.

2A) Extract the assigned features (columns) to perform some analytics

mydata = read.csv("data/mydata_clean.csv")
trip_seconds = mydata$Trip.Seconds

Calculate the average and standard deviation of the assigned feature (column)

** calculate the average **

mean(mydata$Trip.Seconds, na.rm = TRUE)

## [1] 739.2153

** calculate the standard deviation **

sd(mydata$Trip.Seconds, na.rm = TRUE)

## [1] 1167.484

2B) Calculate the min and max of the assigned feature (column)

** calculate the min **

min(trip_seconds, na.rm = TRUE)

## [1] 0

** calculate the max **

max(trip_seconds, na.rm = TRUE)

## [1] 74340

2C) Quantile calculation to find outliers of the assigned feature (column)

To find the outliers we are going to look at the upper and lower limits
An outlier is value that “lies outside” most of the other values in a set of data.
A method to find upper and lower thresholds involves finding the interquartile range.

** quantile calculation for the give feature**

quantile(trip_seconds, na.rm = TRUE)

##    0%   25%   50%   75%  100% 
##     0   300   540   900 74340

** Lower and upper quantile calculation **

lowerq = quantile(trip_seconds, na.rm = TRUE)[2]
upperq = quantile(trip_seconds, na.rm = TRUE)[4]

Interquantile calculation

iqr = upperq - lowerq

2D) Finding upper and lower thresholds

The threshold is the boundaries that determine if a value is an outlier.
If the value falls above the upper threshold or below the lower threshold, it is an outlier.

** Calculation the upper threshold **

upper_threshold = (iqr * 1.5) + upperq

** Calculation the lower threshold **

lower_threshold = lowerq - (iqr * 1.5)

** Identify outliers **

# VARIABLE[ VARIABLE > upper_threshold][1:10]
# VARIABLE[ VARIABLE > lower_threshold][1:10]
trip_seconds[ trip_seconds > upper_threshold][1:10]

##  [1] 2340 2220 3300 2520 2220 1860 1980   NA 3360 4500

trip_seconds[ trip_seconds < lower_threshold][1:10]

##  [1] NA NA NA NA NA NA NA NA NA NA

** Finding outliers records **

#mydata[ trip_seconds > upper_threshold, ][1:10]
#mydata[ trip_seconds > lower_threshold, ][1:10]

Are there any outliers, if not explain the lack of outliers? if any explain what the outliers represent and how many records are outliers? Use the count() function to find the number of outliers.

count( mydata[ trip_seconds > upper_threshold, ] )

## # A tibble: 1 x 1
##       n
##   <int>
## 1  8160

count( mydata[ trip_seconds < lower_threshold, ] )

## # A tibble: 1 x 1
##       n
##   <int>
## 1  1327

Plotting Outliers

It can also be useful to visualize the data using a box and whisker plot.
The boxplot supports the IQR also shows the upper and lower thresholds

p = ggplot(data = mydata, aes(x = "", y = trip_seconds)) + geom_boxplot() + coord_flip()
p

## Warning: Removed 1327 rows containing non-finite values (stat_boxplot).

Task 3: Data Modeling - Chicago Taxi Data

3A) Go and explore the interactive dashboard. Take a screenshot of something that you find interesting in the dashboard and write a short summary of your findings

Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq
Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released

knitr::include_graphics("data/img.png")

3B) Write the Metadata for the Taxi data (description), note the size of the file its dimension (number of rows and columns). Use the function summary() to help you identify unique entities, fields, number of rows and data types.

summary(mydata)

##                                      Trip.ID     
##  3e7d6d8ccf1425ae1dcd584f5c3ca303cf6362ed:    1  
##  3e7d6e5c4e87f01a475c8200b33777e85497da89:    1  
##  3e7d6e69c1d6755d9e7484a453cd93a3ee9fed4c:    1  
##  3e7d6efe43222b0ebc698583916674c648dd4520:    1  
##  3e7d6f001e9bcda8478a489cb53293d26328ac85:    1  
##  3e7d6f2a03527d63dc01b95e829fdfdd706102da:    1  
##  (Other)                                 :99993  
##                                                                                                                              Taxi.ID     
##  aebf720288b80a8ee36860541db64951c696c749f1a392d312fa4d2a8cd3f95dfb0be580fda7eb63455f809a1be9b3acad19a3ca167073126d0350b50f30741a:   58  
##  4f189764b8d9b6f71f7936ab414cac07634be0a00790ca179f9460521b7c9c3e5e102f5ba4e1c9cd18cdd9856dbf4f66ae8f13d8c82f8d2d4872f74b96938a24:   57  
##  f737a9a31b07650672910268d7cceb9c06a379c0e75070c0dc0366db8132b06ba2800c5e63c5e56f821a591fc78a92c1c60fb5f48e01aa02e62ff10d18ececd0:   55  
##  1158f25979ad78fd3dafc867a540ad761b65922c312e6170ccee63c3f14adea37317d3cf4e2053d2bdb1531d17670872e0411e496905ef9cb4821e0e96056139:   53  
##  0861cb74337c620cb9ec639af7dc3aa99173b768caf750a2fd1ff17a8d9db86cad36772c7ff6ddaf2fda48de41bc82981145fe46693ed147d86ae194ee15c703:   52  
##  (Other)                                                                                                                         :99720  
##  NA's                                                                                                                            :    4  
##              Trip.Start.Timestamp              Trip.End.Timestamp
##  07/25/2014 06:45:00 PM:    9     02/10/2014 10:30:00 AM:    9   
##  02/05/2015 07:15:00 PM:    8     02/05/2015 07:45:00 PM:    8   
##  02/27/2015 08:45:00 AM:    8     03/03/2014 06:45:00 PM:    8   
##  04/25/2014 06:45:00 PM:    8     03/22/2014 08:15:00 PM:    8   
##  09/18/2013 07:30:00 PM:    8     03/24/2016 07:30:00 PM:    8   
##  03/15/2014 07:00:00 PM:    7     (Other)               :99942   
##  (Other)               :99951     NA's                  :   16   
##   Trip.Seconds       Trip.Miles       Pickup.Census.Tract
##  Min.   :    0.0   Min.   :   0.000   Min.   :1.703e+10  
##  1st Qu.:  300.0   1st Qu.:   0.000   1st Qu.:1.703e+10  
##  Median :  540.0   Median :   0.900   Median :1.703e+10  
##  Mean   :  739.2   Mean   :   2.686   Mean   :1.703e+10  
##  3rd Qu.:  900.0   3rd Qu.:   2.400   3rd Qu.:1.703e+10  
##  Max.   :74340.0   Max.   :1830.000   Max.   :1.703e+10  
##  NA's   :1327      NA's   :1          NA's   :38042      
##  Dropoff.Census.Tract Pickup.Community.Area Dropoff.Community.Area
##  Min.   :1.703e+10    Min.   : 1.00         Min.   : 1.00         
##  1st Qu.:1.703e+10    1st Qu.: 8.00         1st Qu.: 8.00         
##  Median :1.703e+10    Median : 8.00         Median :14.00         
##  Mean   :1.703e+10    Mean   :22.04         Mean   :21.14         
##  3rd Qu.:1.703e+10    3rd Qu.:32.00         3rd Qu.:32.00         
##  Max.   :1.703e+10    Max.   :77.00         Max.   :77.00         
##  NA's   :38775        NA's   :15534         NA's   :17532         
##       Fare              Tips             Tolls              Extras        
##  Min.   :   0.00   Min.   :  0.000   Min.   : 0.00000   Min.   :   0.000  
##  1st Qu.:   6.00   1st Qu.:  0.000   1st Qu.: 0.00000   1st Qu.:   0.000  
##  Median :   8.05   Median :  0.000   Median : 0.00000   Median :   0.000  
##  Mean   :  12.57   Mean   :  1.249   Mean   : 0.00733   Mean   :   0.798  
##  3rd Qu.:  13.05   3rd Qu.:  2.000   3rd Qu.: 0.00000   3rd Qu.:   1.000  
##  Max.   :9825.04   Max.   :108.100   Max.   :82.50000   Max.   :3452.340  
##  NA's   :1         NA's   :1         NA's   :1          NA's   :1         
##    Trip.Total         Payment.Type  
##  $7.25  : 2010   Cash       :60760  
##  $6.25  : 1908   Credit Card:38322  
##  $3.25  : 1889   Dispute    :   58  
##  $6.65  : 1762   No Charge  :  622  
##  $8.25  : 1729   Pcard      :   18  
##  (Other):90700   Prcard     :    6  
##  NA's   :    1   Unknown    :  213  
##                                      Company      Pickup.Centroid.Latitude
##  Taxi Affiliation Services               :29911   Min.   :41.66           
##  Dispatch Taxi Affiliation               : 9417   1st Qu.:41.88           
##  Blue Ribbon Taxi Association Inc.       : 6766   Median :41.89           
##  Choice Taxi Association                 : 5185   Mean   :41.90           
##  Chicago Elite Cab Corp. (Chicago Carriag: 5091   3rd Qu.:41.92           
##  (Other)                                 : 8218   Max.   :42.02           
##  NA's                                    :35411   NA's   :15533           
##  Pickup.Centroid.Longitude                 Pickup.Centroid.Location
##  Min.   :-87.91            POINT (-87.632746 41.880994): 8572      
##  1st Qu.:-87.66            POINT (-87.620993 41.884987): 5034      
##  Median :-87.63            POINT (-87.633308 41.899602): 3850      
##  Mean   :-87.66            POINT (-87.626215 41.892508): 3832      
##  3rd Qu.:-87.63            POINT (-87.631864 41.892042): 3692      
##  Max.   :-87.54            (Other)                     :59486      
##  NA's   :15533             NA's                        :15533      
##  Dropoff.Centroid.Latitude Dropoff.Centroid.Longitude
##  Min.   :41.67             Min.   :-87.91            
##  1st Qu.:41.88             1st Qu.:-87.66            
##  Median :41.89             Median :-87.63            
##  Mean   :41.90             Mean   :-87.66            
##  3rd Qu.:41.92             3rd Qu.:-87.63            
##  Max.   :42.02             Max.   :-87.53            
##  NA's   :17376             NA's   :17376             
##                 Dropoff.Centroid..Location Community.Areas
##  POINT (-87.632746 41.880994): 7644        Min.   : 1.00  
##  POINT (-87.620993 41.884987): 4412        1st Qu.:37.00  
##  POINT (-87.626215 41.892508): 3073        Median :37.00  
##  POINT (-87.631864 41.892042): 3072        Mean   :41.18  
##  POINT (-87.655998 41.944227): 2850        3rd Qu.:38.00  
##  (Other)                     :61572        Max.   :77.00  
##  NA's                        :17376        NA's   :15533

3C) Write a description of any relational business logic and field integrity

Integrity lies within all of the money related areas as each one finds itself in the total cost, thus if all the costs were totaled together and exceeded or went under the total then there would be a violation. The field integrity is clear in that each type of value has the correct format for those values, such as full words in the company name and currency values in cost.

3D) Draw a star schema of the Chicago Taxi Data

Modeling Tool: https://erdplus.com/#/standalone

knitr::include_graphics("data/img2.png")