Business Discovery and Data Sourcing

Notebook Instructions

For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.

Load Packages in R/RStudio

We are going to use tidyverse a collection of R packages designed for data science.

Info: https://www.tidyverse.org/

## Loading required package: tidyverse

## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --

## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Task 1: Data Prepartion and Cleaning

1A) Read the csv file into R Studio and display the dataset.

Name your dataset ‘mydata’ so it easy to work with.
Commands: read_csv() head() mean() sub() as.numeric()

Extract the assigned features (columns) to perform some analytics

mydata <- read.csv("Taxi_Trips_Sample.csv")
mydata = na.omit(mydata)
head(mydata)

##                                     Trip.ID
## 1  3e7d6d8ccf1425ae1dcd584f5c3ca303cf6362ed
## 2  3e7d6e5c4e87f01a475c8200b33777e85497da89
## 4  3e7d6efe43222b0ebc698583916674c648dd4520
## 5  3e7d6f001e9bcda8478a489cb53293d26328ac85
## 8  3e7d6f35332ed1069a218b63cef47c3be42a26fb
## 14 3e7d706fe60689dc646772c09e128e19ab022472
##                                                                                                                             Taxi.ID
## 1  b47c583b142d75b42882975eaab19c6cb98d82686016576cce6e305b1b99eb16aacfb9a21ff61c84873a6c3dde282756c162c538c8b69554fd8f811f3a8f60a2
## 2  bc1c0381e3bca623e6c04f3410f7b67201a9fc85c6b66d0f420a88099d38448f9b9874e246da49cf2ef32ea3d027eec9c5b484fe77dbfc033c389b5576ac66bd
## 4  0f831bff43d83f396f2e4950126c6137dcdb60fb4c8580ffe860203747a83a789b22f2f9e4fbdd0dd8ed8c310366d8935228ddbcadf708fb9691ca5dd1b6c802
## 5  e5274d6c103515af3ce705182d0bbbea7ca077a6f23b1736254f2de8ba3e1687dd77f5fb541b7f00b1ebc24cfde54caf5a9562f046a0559acbfe1e7159e17c1a
## 8  5cb1d0f30a10dc0c80525ef9cbf1fd1a0f2c57e7ae2a5efca20c69229e52ada226b39071f69d9e8bef81722dc96234dca57b26694b6ec4773eb79b1beeac5c83
## 14 631de3a4ad35b7d7d75949b33ad59e09820b51b3f58225df9f3f1c9aa65554e97ba9d6f868c94b5c2f170cf086a0ca7089dc2f6dab2c9096283f482659071447
##      Trip.Start.Timestamp     Trip.End.Timestamp Trip.Seconds Trip.Miles
## 1  08/19/2014 02:45:00 PM 08/19/2014 02:45:00 PM          480       0.15
## 2  09/23/2013 05:15:00 PM 09/23/2013 05:15:00 PM          420       0.00
## 4  05/10/2013 08:00:00 PM 05/10/2013 08:45:00 PM         2340      13.80
## 5  02/21/2016 07:15:00 PM 02/21/2016 07:15:00 PM          300       0.70
## 8  03/08/2014 03:45:00 PM 03/08/2014 04:30:00 PM         2220      13.30
## 14 05/05/2014 09:30:00 AM 05/05/2014 09:45:00 AM          900       0.10
##    Pickup.Census.Tract Dropoff.Census.Tract Pickup.Community.Area
## 1          17031280100          17031839100                    28
## 2          17031081800          17031281900                     8
## 4          17031980000          17031060400                    76
## 5          17031081500          17031081500                     8
## 8          17031980100          17031242300                    56
## 14         17031832600          17031081401                     7
##    Dropoff.Community.Area   Fare  Tips Tolls Extras Trip.Total
## 1                      32  $7.05 $0.00 $0.00  $1.50      $8.55
## 2                      28  $6.05 $0.00 $0.00  $0.00      $6.05
## 4                       6 $31.25 $0.00 $0.00  $3.00     $34.25
## 5                       8  $5.50 $0.00 $0.00  $0.00      $5.50
## 8                      24 $30.45 $6.49 $0.00  $2.00     $38.94
## 14                      8 $10.45 $0.00 $0.00  $0.00     $10.45
##    Payment.Type                   Company Pickup.Centroid.Latitude
## 1          Cash                                           41.88530
## 2          Cash Taxi Affiliation Services                 41.89322
## 4          Cash                                           41.97907
## 5          Cash                                           41.89251
## 8   Credit Card                                           41.78600
## 14         Cash Taxi Affiliation Services                 41.91475
##    Pickup.Centroid.Longitude     Pickup.Centroid.Location
## 1                  -87.64281   POINT (-87.642808 41.8853)
## 2                  -87.63784 POINT (-87.637844 41.893216)
## 4                  -87.90304  POINT (-87.90304 41.979071)
## 5                  -87.62621 POINT (-87.626215 41.892508)
## 8                  -87.75093 POINT (-87.750934 41.785999)
## 14                 -87.65401 POINT (-87.654007 41.914747)
##    Dropoff.Centroid.Latitude Dropoff.Centroid.Longitude
## 1                   41.88099                  -87.63275
## 2                   41.87926                  -87.64265
## 4                   41.95067                  -87.66654
## 5                   41.89251                  -87.62621
## 8                   41.89951                  -87.67960
## 14                  41.89503                  -87.61971
##      Dropoff.Centroid..Location Community.Areas
## 1  POINT (-87.632746 41.880994)              29
## 2  POINT (-87.642649 41.879255)              37
## 4  POINT (-87.666536 41.950673)              75
## 5  POINT (-87.626215 41.892508)              37
## 8    POINT (-87.6796 41.899507)              53
## 14 POINT (-87.619711 41.895033)              68

fare = mydata$Fare
tips = mydata$Tips
tolls = mydata$Tolls
extras = mydata$Extras
head(fare)

## [1] $7.05  $6.05  $31.25 $5.50  $30.45 $10.45
## 892 Levels:  $0.00 $0.01 $0.03 $0.05 $0.10 $0.11 $0.28 $0.30 ... $99.99

head(tips)

## [1] $0.00 $0.00 $0.00 $0.00 $6.49 $0.00
## 1081 Levels:  $0.00 $0.01 $0.02 $0.03 $0.04 $0.05 $0.06 $0.08 ... $9.98

head(tolls)

## [1] $0.00 $0.00 $0.00 $0.00 $0.00 $0.00
## 25 Levels:  $0.00 $0.60 $0.95 $1.00 $1.11 $1.20 $1.50 $1.90 ... $82.50

head(extras)

## [1] $1.50 $0.00 $3.00 $0.00 $2.00 $0.00
## 183 Levels:  $0.00 $0.01 $0.03 $0.08 $0.10 $0.11 $0.20 $0.34 ... $9.50

Find the average for each assigned feature

** What happens when we try to use the function? **

mean(fare)

## Warning in mean.default(fare): argument is not numeric or logical:
## returning NA

## [1] NA

mean(tips)

## Warning in mean.default(tips): argument is not numeric or logical:
## returning NA

## [1] NA

mean(tolls)

## Warning in mean.default(tolls): argument is not numeric or logical:
## returning NA

## [1] NA

mean(extras)

## Warning in mean.default(extras): argument is not numeric or logical:
## returning NA

## [1] NA

To resolve the error, check if the feature data type is correct
Notice that comma ‘1,234’ in some values
Also the data type and dollar sign ‘$’ symbol
To remove the any character in this case the comma from “1,234”. We must substitute it with just an empty space.

1B) Substitute the characters causing the issue with empty character (“”) on all features

** substitute comma with “” **

fare_clean <- sub(pattern = "," , replacement = "", x = fare)
head(fare_clean)

## [1] "$7.05"  "$6.05"  "$31.25" "$5.50"  "$30.45" "$10.45"

tips_clean <- sub(pattern = "," , replacement =  "", x = tips)
head(tips_clean)

## [1] "$0.00" "$0.00" "$0.00" "$0.00" "$6.49" "$0.00"

tolls_clean <- sub(pattern = "," , replacement =  "", x = tolls)
head(tolls_clean)

## [1] "$0.00" "$0.00" "$0.00" "$0.00" "$0.00" "$0.00"

extras_clean <- sub(pattern = "," , replacement =  "", x = extras)
head(extras_clean)

## [1] "$1.50" "$0.00" "$3.00" "$0.00" "$2.00" "$0.00"

** substitute dollar sign with “” **

fare_clean <- sub(pattern = "\\$" , replacement =  "", x = fare_clean)
head(fare_clean)

## [1] "7.05"  "6.05"  "31.25" "5.50"  "30.45" "10.45"

tips_clean <- sub(pattern = "\\$" , replacement =  "", x = tips_clean)
head(tips_clean)

## [1] "0.00" "0.00" "0.00" "0.00" "6.49" "0.00"

tolls_clean <- sub(pattern = "\\$" , replacement =  "", x = tolls_clean)
head(tolls_clean)

## [1] "0.00" "0.00" "0.00" "0.00" "0.00" "0.00"

extras_clean <- sub(pattern = "\\$" , replacement =  "", x = extras_clean)
head(extras_clean)

## [1] "1.50" "0.00" "3.00" "0.00" "2.00" "0.00"

1C) Change the features data type from character to numeric

** character to numeric - as.numeric()**

fare_clean <- as.numeric(fare_clean)
tips_clean <- as.numeric(tips_clean)
tolls_clean <- as.numeric(tolls_clean)
extras_clean <- as.numeric(extras_clean)

** mean with NA removed **

fare_clean <- mean(fare_clean , na.rm = TRUE)
fare_clean

## [1] 10.95067

tips_clean <- mean(tips_clean , na.rm = TRUE)
tips_clean

## [1] 1.096804

tolls_clean <- mean(tolls_clean , na.rm = TRUE)
tolls_clean

## [1] 0.005630209

extras_clean <- mean(extras_clean , na.rm = TRUE)
extras_clean

## [1] 0.7317954

1D) Assign new clean variables to dataset and save your clean new data

** Clean variables to dataset **

mydata$Fare <- fare
mydata$Tips <- tips
mydata$Tolls <- tolls
mydata$Extras <- extras
head(mydata)

##                                     Trip.ID
## 1  3e7d6d8ccf1425ae1dcd584f5c3ca303cf6362ed
## 2  3e7d6e5c4e87f01a475c8200b33777e85497da89
## 4  3e7d6efe43222b0ebc698583916674c648dd4520
## 5  3e7d6f001e9bcda8478a489cb53293d26328ac85
## 8  3e7d6f35332ed1069a218b63cef47c3be42a26fb
## 14 3e7d706fe60689dc646772c09e128e19ab022472
##                                                                                                                             Taxi.ID
## 1  b47c583b142d75b42882975eaab19c6cb98d82686016576cce6e305b1b99eb16aacfb9a21ff61c84873a6c3dde282756c162c538c8b69554fd8f811f3a8f60a2
## 2  bc1c0381e3bca623e6c04f3410f7b67201a9fc85c6b66d0f420a88099d38448f9b9874e246da49cf2ef32ea3d027eec9c5b484fe77dbfc033c389b5576ac66bd
## 4  0f831bff43d83f396f2e4950126c6137dcdb60fb4c8580ffe860203747a83a789b22f2f9e4fbdd0dd8ed8c310366d8935228ddbcadf708fb9691ca5dd1b6c802
## 5  e5274d6c103515af3ce705182d0bbbea7ca077a6f23b1736254f2de8ba3e1687dd77f5fb541b7f00b1ebc24cfde54caf5a9562f046a0559acbfe1e7159e17c1a
## 8  5cb1d0f30a10dc0c80525ef9cbf1fd1a0f2c57e7ae2a5efca20c69229e52ada226b39071f69d9e8bef81722dc96234dca57b26694b6ec4773eb79b1beeac5c83
## 14 631de3a4ad35b7d7d75949b33ad59e09820b51b3f58225df9f3f1c9aa65554e97ba9d6f868c94b5c2f170cf086a0ca7089dc2f6dab2c9096283f482659071447
##      Trip.Start.Timestamp     Trip.End.Timestamp Trip.Seconds Trip.Miles
## 1  08/19/2014 02:45:00 PM 08/19/2014 02:45:00 PM          480       0.15
## 2  09/23/2013 05:15:00 PM 09/23/2013 05:15:00 PM          420       0.00
## 4  05/10/2013 08:00:00 PM 05/10/2013 08:45:00 PM         2340      13.80
## 5  02/21/2016 07:15:00 PM 02/21/2016 07:15:00 PM          300       0.70
## 8  03/08/2014 03:45:00 PM 03/08/2014 04:30:00 PM         2220      13.30
## 14 05/05/2014 09:30:00 AM 05/05/2014 09:45:00 AM          900       0.10
##    Pickup.Census.Tract Dropoff.Census.Tract Pickup.Community.Area
## 1          17031280100          17031839100                    28
## 2          17031081800          17031281900                     8
## 4          17031980000          17031060400                    76
## 5          17031081500          17031081500                     8
## 8          17031980100          17031242300                    56
## 14         17031832600          17031081401                     7
##    Dropoff.Community.Area   Fare  Tips Tolls Extras Trip.Total
## 1                      32  $7.05 $0.00 $0.00  $1.50      $8.55
## 2                      28  $6.05 $0.00 $0.00  $0.00      $6.05
## 4                       6 $31.25 $0.00 $0.00  $3.00     $34.25
## 5                       8  $5.50 $0.00 $0.00  $0.00      $5.50
## 8                      24 $30.45 $6.49 $0.00  $2.00     $38.94
## 14                      8 $10.45 $0.00 $0.00  $0.00     $10.45
##    Payment.Type                   Company Pickup.Centroid.Latitude
## 1          Cash                                           41.88530
## 2          Cash Taxi Affiliation Services                 41.89322
## 4          Cash                                           41.97907
## 5          Cash                                           41.89251
## 8   Credit Card                                           41.78600
## 14         Cash Taxi Affiliation Services                 41.91475
##    Pickup.Centroid.Longitude     Pickup.Centroid.Location
## 1                  -87.64281   POINT (-87.642808 41.8853)
## 2                  -87.63784 POINT (-87.637844 41.893216)
## 4                  -87.90304  POINT (-87.90304 41.979071)
## 5                  -87.62621 POINT (-87.626215 41.892508)
## 8                  -87.75093 POINT (-87.750934 41.785999)
## 14                 -87.65401 POINT (-87.654007 41.914747)
##    Dropoff.Centroid.Latitude Dropoff.Centroid.Longitude
## 1                   41.88099                  -87.63275
## 2                   41.87926                  -87.64265
## 4                   41.95067                  -87.66654
## 5                   41.89251                  -87.62621
## 8                   41.89951                  -87.67960
## 14                  41.89503                  -87.61971
##      Dropoff.Centroid..Location Community.Areas
## 1  POINT (-87.632746 41.880994)              29
## 2  POINT (-87.642649 41.879255)              37
## 4  POINT (-87.666536 41.950673)              75
## 5  POINT (-87.626215 41.892508)              37
## 8    POINT (-87.6796 41.899507)              53
## 14 POINT (-87.619711 41.895033)              68

** Save clean data **

write_csv(mydata, path = "data/mydata_clean.csv")

Task 2: Outlier Detection

Commands: mean() sd() min() max() quantile()

In this task we must calculate the mean, standard deviation, maximum, and minimum for the given feature.

2A) Extract the assigned features (columns) to perform some analytics

trip_seconds = mydata$Trip.Seconds
head(trip_seconds)

## [1]  480  420 2340  300 2220  900

Calculate the average and standard deviation of the assigned feature (column)

** calculate the average **

mean(trip_seconds)

## [1] 720.8911

** calculate the standard deviation **

sd(trip_seconds)

## [1] 1005.201

2B) Calculate the min and max of the assigned feature (column)

** calculate the min **

min(trip_seconds)

## [1] 0

** calculate the max **

max(trip_seconds)

## [1] 74340

2C) Quantile calculation to find outliers of the assigned feature (column)

To find the outliers we are going to look at the upper and lower limits
An outlier is value that “lies outside” most of the other values in a set of data.
A method to find upper and lower thresholds involves finding the interquartile range.

** quantile calculation for the give feature**

quantile(trip_seconds)

##    0%   25%   50%   75%  100% 
##     0   300   540   840 74340

** Lower and upper quantile calculation **

lowerq = quantile(trip_seconds)[2]
lowerq

## 25% 
## 300

upperq = quantile(trip_seconds)[4]
upperq

## 75% 
## 840

Interquantile calculation

iqr = upperq - lowerq
iqr

## 75% 
## 540

2D) Finding upper and lower thresholds

The threshold is the boundaries that determine if a value is an outlier.
If the value falls above the upper threshold or below the lower threshold, it is an outlier.

** Calculation the upper threshold **

upper_threshold = (iqr * 1.5) + upperq 
upper_threshold

##  75% 
## 1650

** Calculation the lower threshold **

lower_threshold = lowerq - (iqr * 1.5)
lower_threshold

##  25% 
## -510

** Identify outliers **

trip_seconds[ trip_seconds > upper_threshold][1:10]

##  [1] 2340 2220 3300 1680 1860 1980 3360 4500 2760 1860

trip_seconds[ trip_seconds < lower_threshold][1:10]

##  [1] NA NA NA NA NA NA NA NA NA NA

** Finding outliers records **

count(mydata[ trip_seconds > upper_threshold, ][1:10])

## # A tibble: 1 x 1
##       n
##   <int>
## 1  4837

count(mydata[ trip_seconds < lower_threshold, ][1:10])

## # A tibble: 1 x 1
##       n
##   <int>
## 1     0

Are there any outliers, if not explain the lack of outliers? if any explain what the outliers represent and how many records are outliers? Use the count() function to find the number of outliers.

There are 4837 outliers, all above the upper threshold. Outliers are irregular data points that skew from the regular pattern of the rest of the data.

Plotting Outliers

It can also be useful to visualize the data using a box and whisker plot.
The boxplot supports the IQR also shows the upper and lower thresholds

p <- ggplot(data = mydata, aes(x = "", y = trip_seconds)) + geom_boxplot() + coord_flip()
p

Task 3: Data Modeling - Chicago Taxi Data

3A) Go and explore the interactive dashboard. Take a screenshot of something that you find interesting in the dashboard and write a short summary of your findings

Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq
Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released

knitr::include_graphics("img01.png")

This graph confirms that the expectation that most taxi trips are under 10 miles. Traveling 100 miles plus would be a bad economic choice and would be more time consuming than taking a plane; that is why there is a huge drop in between 10 and 100 miles.

3B) Write the Metadata for the Taxi data (description), note the size of the file its dimension (number of rows and columns). Use the function summary() to help you identify unique entities, fields, number of rows and data types.

summary(mydata)

##                                      Trip.ID     
##  3e7d6d8ccf1425ae1dcd584f5c3ca303cf6362ed:    1  
##  3e7d6e5c4e87f01a475c8200b33777e85497da89:    1  
##  3e7d6efe43222b0ebc698583916674c648dd4520:    1  
##  3e7d6f001e9bcda8478a489cb53293d26328ac85:    1  
##  3e7d6f35332ed1069a218b63cef47c3be42a26fb:    1  
##  3e7d706fe60689dc646772c09e128e19ab022472:    1  
##  (Other)                                 :60705  
##                                                                                                                              Taxi.ID     
##  4f189764b8d9b6f71f7936ab414cac07634be0a00790ca179f9460521b7c9c3e5e102f5ba4e1c9cd18cdd9856dbf4f66ae8f13d8c82f8d2d4872f74b96938a24:   45  
##  1b2865284986db761b656ee64ad35fa46df8f7da125dde86ddb79d798917d2a8ac1bc1e7865a340f797d7cc7ba275d9397e4ce03c88da96b088fab3088c07dc8:   38  
##  ad33dffdd6cd00795ea1a00a6a6db1a38482075d532b55e712741e9b4a2541375fcf642d01d35b51646d2a07b49376f167b5ddd1f7b0a5354afe07f514108365:   37  
##  0861cb74337c620cb9ec639af7dc3aa99173b768caf750a2fd1ff17a8d9db86cad36772c7ff6ddaf2fda48de41bc82981145fe46693ed147d86ae194ee15c703:   36  
##  21e6058fef096e0d45f7b9d47433974c9c4e1820c3e5bee61670db5e58eb32e58432f1b6cf09adf73fe8a36a25cfe880f2e797158b2c890ec52603c3635d759f:   36  
##  2a29bcc02d98eb9e748a8427ffd773934047f77e20f70325b01f515ba28cb6e379194d8d65b7783c6791f390ea0d9833daf89a537be9be1db5b81e1db23a07c5:   36  
##  (Other)                                                                                                                         :60483  
##              Trip.Start.Timestamp              Trip.End.Timestamp
##  02/05/2015 07:15:00 PM:    7     07/25/2014 06:45:00 PM:    8   
##  02/27/2015 08:45:00 AM:    7     02/28/2014 08:45:00 PM:    7   
##  04/25/2014 06:45:00 PM:    7     03/11/2016 08:15:00 PM:    7   
##  05/17/2014 11:30:00 PM:    7     02/05/2015 07:45:00 PM:    6   
##  07/25/2014 06:45:00 PM:    7     02/07/2014 09:00:00 PM:    6   
##  02/07/2014 08:45:00 PM:    6     04/10/2015 09:45:00 AM:    6   
##  (Other)               :60670     (Other)               :60671   
##   Trip.Seconds       Trip.Miles       Pickup.Census.Tract
##  Min.   :    0.0   Min.   :   0.000   Min.   :1.703e+10  
##  1st Qu.:  300.0   1st Qu.:   0.000   1st Qu.:1.703e+10  
##  Median :  540.0   Median :   0.900   Median :1.703e+10  
##  Mean   :  720.9   Mean   :   2.478   Mean   :1.703e+10  
##  3rd Qu.:  840.0   3rd Qu.:   1.900   3rd Qu.:1.703e+10  
##  Max.   :74340.0   Max.   :1770.000   Max.   :1.703e+10  
##                                                          
##  Dropoff.Census.Tract Pickup.Community.Area Dropoff.Community.Area
##  Min.   :1.703e+10    Min.   : 1.00         Min.   : 1.00         
##  1st Qu.:1.703e+10    1st Qu.: 8.00         1st Qu.: 8.00         
##  Median :1.703e+10    Median : 8.00         Median : 8.00         
##  Mean   :1.703e+10    Mean   :21.58         Mean   :21.19         
##  3rd Qu.:1.703e+10    3rd Qu.:32.00         3rd Qu.:32.00         
##  Max.   :1.703e+10    Max.   :77.00         Max.   :77.00         
##                                                                   
##       Fare            Tips           Tolls           Extras     
##  $6.25  : 2162   $0.00  :40362   $0.00  :60694   $0.00  :34234  
##  $5.25  : 2041   $2.00  : 7355   $50.00 :    5   $1.00  :13299  
##  $5.85  : 1835   $3.00  : 2472   $1.50  :    3   $2.00  : 6253  
##  $5.65  : 1829   $1.00  : 2234   $2.00  :    2   $1.50  : 3483  
##  $5.45  : 1817   $4.00  :  595   $3.00  :    2   $3.00  : 1376  
##  $6.05  : 1804   $5.00  :  520          :    1   $4.00  :  657  
##  (Other):49223   (Other): 7173   (Other):    4   (Other): 1409  
##    Trip.Total         Payment.Type  
##  $7.25  : 1540   Cash       :39135  
##  $6.25  : 1455   Credit Card:21147  
##  $6.65  : 1361   Dispute    :   37  
##  $7.05  : 1253   No Charge  :  305  
##  $6.45  : 1251   Pcard      :    9  
##  $7.45  : 1228   Prcard     :    3  
##  (Other):52623   Unknown    :   75  
##                               Company      Pickup.Centroid.Latitude
##                                   :20771   Min.   :41.74           
##  Taxi Affiliation Services        :20403   1st Qu.:41.88           
##  Dispatch Taxi Affiliation        : 6685   Median :41.89           
##  Blue Ribbon Taxi Association Inc.: 4589   Mean   :41.90           
##  Choice Taxi Association          : 3784   3rd Qu.:41.90           
##  Northwest Management LLC         : 2236   Max.   :42.02           
##  (Other)                          : 2243                           
##  Pickup.Centroid.Longitude                 Pickup.Centroid.Location
##  Min.   :-87.90            POINT (-87.632746 41.880994): 8461      
##  1st Qu.:-87.64            POINT (-87.620993 41.884987): 4963      
##  Median :-87.63            POINT (-87.626215 41.892508): 3787      
##  Mean   :-87.65            POINT (-87.631864 41.892042): 3647      
##  3rd Qu.:-87.62            POINT (-87.90304 41.979071) : 3016      
##  Max.   :-87.58            POINT (-87.642649 41.879255): 2780      
##                            (Other)                     :34057      
##  Dropoff.Centroid.Latitude Dropoff.Centroid.Longitude
##  Min.   :41.74             Min.   :-87.90            
##  1st Qu.:41.88             1st Qu.:-87.64            
##  Median :41.89             Median :-87.63            
##  Mean   :41.90             Mean   :-87.65            
##  3rd Qu.:41.90             3rd Qu.:-87.62            
##  Max.   :42.02             Max.   :-87.58            
##                                                      
##                 Dropoff.Centroid..Location Community.Areas
##  POINT (-87.632746 41.880994): 7602        Min.   : 1.00  
##  POINT (-87.620993 41.884987): 4394        1st Qu.:37.00  
##  POINT (-87.631864 41.892042): 3062        Median :37.00  
##  POINT (-87.626215 41.892508): 3057        Mean   :40.71  
##  POINT (-87.642649 41.879255): 2696        3rd Qu.:38.00  
##  POINT (-87.90304 41.979071) : 2435        Max.   :76.00  
##  (Other)                     :37465

The data is about all the variables that go into taxi rides and the data is both quantitative and qualitative. There are 24 columns and 10,000 rows including the column titles. The data provides valuable data that could be used to look for patterns and correlations amongst the variables included. Some of the data includes miles driven, tips, tolls, extras, latitude and longitude, and much more.

3C) Write a description of any relational business logic and field integrity

There are many relationships that can be drawn from the data, but the most noteworthy would be the relationship between costs. The total cost relies on the fare, tips, tolls, and extras. The value of these variables cumulatively make up the total cost. Another important relationship would be the fare and both the trip miles and the trip seconds. Taxis’ payment method is based on the total distance and the total time in the taxi.

3D) Draw a star schema of the Chicago Taxi Data

Modeling Tool: https://erdplus.com/#/standalone

knitr::include_graphics("erdplus-diagram.png")