ATTENTION:

1. READ EVERY SINGLE LINE BELOW TO GET A COMPLETE UNDERSTANDING OF WHAT NEED TO BE DONE IN THIS ASSIGNMENT.

2. SET UP THE DIRECTORY AS INSTRUCTED IN THE “SETUP” BELOW BEFORE YOU PROCEED WITH THIS ASSIGNMENT TO AVOID ANY ERRORS.

4. DO THE TASKS SEQUENTIALLY TO AVOID ERRORS WHEN YOU RUN THE CODES.

5. BEFORE YOU KNIT YOUR RMD FILE AND PUBLISH IT ON THE RPUBS, MAKE SURE TO RUN EACH CHUNK OF YOUR CODES BY CLICKING THE GREEN ARROW ON THE RIGHT OF THE CODE CHUNK.

6. FULL POINTS WILL ONLY BE GIVEN TO TASKS WHICH CODES ARE RUN SUCCESSFULLY (ERROR-FREE), FOR THE LAST TASK - IF YOU ANSWER CORRECTLY BOTH QUESTIONS

Question IIA: What are some other ways to clean this data?

You can clean it up by replacing commas and dollar signs with blanks. Additionally, the “as.numeric” function allows the removal of blank values. #Question IIB: How does Excel treat the missing values and the “\(" symbols? It is not a numeric value so if there are "\)” in the data it will not be able to compute the mean or do various other functions.


PART III: Data Modeling

Now, we will look at Chicago taxi data. Go and explore the interactive dashboard and read the description of the data.

Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq

Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/

TASK 10 A:

Open in RStudio or Excel the Taxi_Trips_Sample.csv file located in the data folder. Answer these questions: What is the size of the file? How many number of columns and of rows does the file have? What is the primary key? What are other attributes’ names?

TASK 10 B:

Define a relational business logic integrity check for the column field ‘Trip Seconds’.

TASK 10 C:

Using https://erdplus.com/#/standalone draw a star like schema using at least the following tables and insert the image of the star schema in this file.

  • A Fact table for Trip
  • A Dimension table for Taxi Company with its relevant attributes
  • A Dimension table for Payment Type with its relevant attributes
  • A Dimension table for Total Fare with its relevant attributes
c

c

==================================

EXTRA information:

Another similar method to find the upper and lower thresholds discussed in introductory statistics courses involves finding the interquartile range. Follow along below to see how we first calculate the interquartile range..

quantile(age) 
##   0%  25%  50%  75% 100% 
##   18   28   36   45   68
lowerq = quantile(age)[2]
upperq = quantile(age)[4]
iqr = upperq - lowerq

The formula below calculates the threshold. The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier.

Below is the upper threshold:

upperthreshold = (iqr * 1.5) + upperq 
upperthreshold
##  75% 
## 70.5

Below is the lower threshold:

lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold
## 25% 
## 2.5

It can also be useful to visualize the data using a box and whisker plot.

boxplot(age)