ATTENTION:

1. READ EVERY SINGLE LINE BELOW TO GET A COMPLETE UNDERSTANDING OF WHAT NEED TO BE DONE IN THIS ASSIGNMENT.

2. SET UP THE DIRECTORY AS INSTRUCTED IN THE “SETUP” BELOW BEFORE YOU PROCEED WITH THIS ASSIGNMENT TO AVOID ANY ERRORS.

4. DO THE TASKS SEQUENTIALLY TO AVOID ERRORS WHEN YOU RUN THE CODES.

5. BEFORE YOU KNIT YOUR RMD FILE AND PUBLISH IT ON THE RPUBS, MAKE SURE TO RUN EACH CHUNK OF YOUR CODES BY CLICKING THE GREEN ARROW ON THE RIGHT OF THE CODE CHUNK.

6. FULL POINTS WILL ONLY BE GIVEN TO TASKS WHICH CODES ARE RUN SUCCESSFULLY (ERROR-FREE), FOR THE LAST TASK - IF YOU ANSWER CORRECTLY BOTH QUESTIONS

Question IIA: What are some other ways to clean this data?

Another way to clean the data is to make sure the data is accurate. We can do this by cross checking fields that have relationships with each other and make sure they are correct. #Question IIB: How does Excel treat the missing values and the “$” symbols? Missing values in excel are treated as 0. ————-

PART III: Data Modeling

Now, we will look at Chicago taxi data. Go and explore the interactive dashboard and read the description of the data.

Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq

Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/

TASK 10 A:

Open in RStudio or Excel the Taxi_Trips_Sample.csv file located in the data folder. Answer these questions: What is the size of the file? How many number of columns and of rows does the file have? What is the primary key? What are other attributes’ names?

The size of the file is 38,727 KB. There are a total of 100,000 rows and 24 columns. The primary key in Trip ID. Some of the attributes are Taxi ID, distance traveled, Taxi company, payment type, and total fare.

TASK 10 B:

Define a relational business logic integrity check for the column field ‘Trip Seconds’. A way to check this information is to cross reference it with the pickup and dropoff locations. If the trip seconds is 0 and there was a distance that was traveled then their is a relationship integrity violation. ## TASK 10 C: Using https://erdplus.com/#/standalone draw a star like schema using at least the following tables and insert the image of the star schema in this file.

  • A Fact table for Trip
  • A Dimension table for Taxi Company with its relevant attributes
  • A Dimension table for Payment Type with its relevant attributes
  • A Dimension table for Total Fare with its relevant attributes

#================================== #EXTRA information: Another similar method to find the upper and lower thresholds discussed in introductory statistics courses involves finding the interquartile range. Follow along below to see how we first calculate the interquartile range..

quantile(age) 
##   0%  25%  50%  75% 100% 
##   18   26   32   41   73
lowerq = quantile(age)[2]
upperq = quantile(age)[4]
iqr = upperq - lowerq

The formula below calculates the threshold. The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier.

Below is the upper threshold:

upperthreshold = (iqr * 1.5) + upperq 
upperthreshold
##  75% 
## 63.5

Below is the lower threshold:

lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold
## 25% 
## 3.5

It can also be useful to visualize the data using a box and whisker plot.

boxplot(age)