ATTENTION:

1. READ EVERY SINGLE LINE BELOW TO GET A COMPLETE UNDERSTANDING OF WHAT NEED TO BE DONE IN THIS ASSIGNMENT.

2. SET UP THE DIRECTORY AS INSTRUCTED IN THE “SETUP” BELOW BEFORE YOU PROCEED WITH THIS ASSIGNMENT TO AVOID ANY ERRORS.

4. DO THE TASKS SEQUENTIALLY TO AVOID ERRORS WHEN YOU RUN THE CODES.

5. BEFORE YOU KNIT YOUR RMD FILE AND PUBLISH IT ON THE RPUBS, MAKE SURE TO RUN EACH CHUNK OF YOUR CODES BY CLICKING THE GREEN ARROW ON THE RIGHT OF THE CODE CHUNK.

6. FULL POINTS WILL ONLY BE GIVEN TO TASKS WHICH CODES ARE RUN SUCCESSFULLY (ERROR-FREE), FOR THE LAST TASK - IF YOU ANSWER CORRECTLY BOTH QUESTIONS

Question IIA: What are some other ways to clean this data?

You can delete all the blanks, duplicates or miscodes if cannot be imputed

Question IIB: How does Excel treat the missing values and the “$” symbols?

Excel will leave out missing values or treat it like a 0 if its asked to get the average. Excel can recognize the numbers of $ sign. Missing values can be included using the IF function.


PART III: Data Modeling

Now, we will look at Chicago taxi data. Go and explore the interactive dashboard and read the description of the data.

Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq

Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/

TASK 10 A:

Open in RStudio or Excel the Taxi_Trips_Sample.csv file located in the data folder. Answer these questions: What is the size of the file? How many number of columns and of rows does the file have? What is the primary key? What are other attributes’ names?

Size of the file is: 39,656,331 bytes (39.7 MB) Number of rows: 99,999 (excluding attribute row) Number of columns: 24 Primary Key: TripID Attributes: Trip ID Taxi ID Trip Start Timestamp Trip End Timestamp Trip Seconds Trip Miles Pickup Census Tract Dropoff Census Tract Pickup Community Area Dropoff Community Area Fare Tips Tolls Extras Trip Total Payment Type Company Pickup Centroid Latitude Pickup Centroid Longitude Pickup Centroid Location Dropoff Centroid Latitude Dropoff Centroid Longitude Dropoff Centroid Location Community Areas

TASK 10 B:

Define a relational business logic integrity check for the column field ‘Trip Seconds’.

There cant be the same trip start time and trip end time, it cant possibly happen so the trip seconds data bust be false

TASK 10 C:

Using https://erdplus.com/#/standalone draw a star like schema using at least the following tables and insert the image of the star schema in this file.

  • A Fact table for Trip
  • A Dimension table for Taxi Company with its relevant attributes
  • A Dimension table for Payment Type with its relevant attributes
  • A Dimension table for Total Fare with its relevant attributes

==================================

EXTRA information:

Another similar method to find the upper and lower thresholds discussed in introductory statistics courses involves finding the interquartile range. Follow along below to see how we first calculate the interquartile range..

quantile(age) 
##   0%  25%  50%  75% 100% 
##   18   28   36   45   68
lowerq = quantile(age)[2]
upperq = quantile(age)[4]
iqr = upperq - lowerq

The formula below calculates the threshold. The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier.

Below is the upper threshold:

upperthreshold = (iqr * 1.5) + upperq 
upperthreshold
##  75% 
## 70.5

Below is the lower threshold:

lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold
## 25% 
## 2.5

It can also be useful to visualize the data using a box and whisker plot.

boxplot(age) 
## Warning in axis(side = side, at = at, labels = labels, ...): no font could
## be found for family "Arial"

## Warning in axis(side = side, at = at, labels = labels, ...): no font could
## be found for family "Arial"

## Warning in axis(side = side, at = at, labels = labels, ...): no font could
## be found for family "Arial"

## Warning in axis(side = side, at = at, labels = labels, ...): no font could
## be found for family "Arial"

## Warning in axis(side = side, at = at, labels = labels, ...): no font could
## be found for family "Arial"

## Warning in axis(side = side, at = at, labels = labels, ...): no font could
## be found for family "Arial"

## Warning in axis(side = side, at = at, labels = labels, ...): no font could
## be found for family "Arial"

## Warning in axis(side = side, at = at, labels = labels, ...): no font could
## be found for family "Arial"

## Warning in axis(side = side, at = at, labels = labels, ...): no font could
## be found for family "Arial"

## Warning in axis(side = side, at = at, labels = labels, ...): no font could
## be found for family "Arial"

## Warning in axis(side = side, at = at, labels = labels, ...): no font could
## be found for family "Arial"

## Warning in axis(side = side, at = at, labels = labels, ...): no font could
## be found for family "Arial"

## Warning in axis(side = side, at = at, labels = labels, ...): no font could
## be found for family "Arial"

## Warning in axis(side = side, at = at, labels = labels, ...): no font could
## be found for family "Arial"

## Warning in axis(side = side, at = at, labels = labels, ...): no font could
## be found for family "Arial"

## Warning in axis(side = side, at = at, labels = labels, ...): no font could
## be found for family "Arial"

## Warning in axis(side = side, at = at, labels = labels, ...): no font could
## be found for family "Arial"

## Warning in axis(side = side, at = at, labels = labels, ...): no font could
## be found for family "Arial"

## Warning in axis(side = side, at = at, labels = labels, ...): no font could
## be found for family "Arial"

## Warning in axis(side = side, at = at, labels = labels, ...): no font could
## be found for family "Arial"