ATTENTION:

1. READ EVERY SINGLE LINE BELOW TO GET A COMPLETE UNDERSTANDING OF WHAT NEED TO BE DONE IN THIS ASSIGNMENT.

2. SET UP THE DIRECTORY AS INSTRUCTED IN THE “SETUP” BELOW BEFORE YOU PROCEED WITH THIS ASSIGNMENT TO AVOID ANY ERRORS.

4. DO THE TASKS SEQUENTIALLY TO AVOID ERRORS WHEN YOU RUN THE CODES.

5. BEFORE YOU KNIT YOUR RMD FILE AND PUBLISH IT ON THE RPUBS, MAKE SURE TO RUN EACH CHUNK OF YOUR CODES BY CLICKING THE GREEN ARROW ON THE RIGHT OF THE CODE CHUNK.

6. FULL POINTS WILL ONLY BE GIVEN TO TASKS WHICH CODES ARE RUN SUCCESSFULLY (ERROR-FREE), FOR THE LAST TASK - IF YOU ANSWER CORRECTLY BOTH QUESTIONS

Question IIA: What are some other ways to clean this data?

Question IIB: How does Excel treat the missing values and the “$” symbols?

# Task 6:
#STEP II1a: 
newdata = read.csv(file="data/scoring_original.csv")
head(newdata)
##   Status Seniority  Home Time Age Marital Records       Job Expenses
## 1   good         9  rent   60  30 married  no_rec freelance     $73K
## 2   good        17  rent   60  58   widow  no_rec     fixed     $48K
## 3    bad        10 owner   36  46 married yes_rec freelance     $90K
## 4   good         0  rent   60  24  single  no_rec     fixed     $63K
## 5   good         0  rent   36  26  single  no_rec     fixed     $46K
## 6   good         1 owner   60  36 married  no_rec     fixed     $75K
##   Income Assets Debt     Amount      Price   Finrat   Savings
## 1  $129K      0    0   $800.00    $846.00  94.56265  4.200000
## 2  $131K      0    0 $1,000.00  $1,658.00  60.31363  4.980000
## 3  $200K   3000    0 $2,000.00  $2,985.00  67.00168  1.980000
## 4  $182K   2500    0   $900.00  $1,325.00  67.92453  7.933333
## 5  $107K      0    0   $310.00    $910.00  34.06593  7.083871
## 6  $214K   3500    0   $650.00  $1,645.00  39.51368 12.830769
#STEP II1b:
#newdata = read.csv(file="data/creditriskorg.csv",skip=1,header=TRUE,sep=",")
#head(newdata)

#STEP II1c:
price = newdata$Price

#STEP II1d: Calculate the mean for Checking use the mean() function. What happens when we try to use the function?
#meanChecking = mean(checking)
# Task 7:
#STEP II2: substitute comma, dollar sign with blank in all of checking data. 
price = sub(",","", price) #Substitute comma with blank
price = sub("\\$","", price) #Substitute $ with blank
# Task 8:
#STEP II3: Convert values to numeric to remove any NA and calculate the mean
price = as.numeric(price) #Convert values to numeric to remove any NA
## Warning: NAs introduced by coercion
meanPrice = mean(price,na.rm=TRUE)
meanPrice
## [1] 1462.48

Task 9:

Question IIA: To clean the data we can delete the blank data, duplicates, miscode, bad business logic, or “dirty” records.

Question IIB: Excel treats the missing values as zeros and treats the “$” symbols as a character.


PART III: Data Modeling

Now, we will look at Chicago taxi data. Go and explore the interactive dashboard and read the description of the data.

Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq

Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/

TASK 10 A:

Open in RStudio or Excel the Taxi_Trips_Sample.csv file located in the data folder. Answer these questions: What is the size of the file? 38,727 KB

How many number of columns and of rows does the file have? 24 columns & 100,000 rows.

What is the primary key? Trip ID

What are other attributes’ names? Taxi ID, trip start timestamp, trip end time stamp, trip seconds, trip miles, pick up census tract, dropoff census tract, pickup community area, dropoff community area, fare, tips, tolls, estras, trip total, payment type, company, pickup centroid latitude, pickup centroid longitude, pickup centroid location, dropoff centroid latitude, dropoff centroid longitude, dropoff centroid location, community areas.

TASK 10 B:

Define a relational business logic integrity check for the column field ‘Trip Seconds’. Trip Seconds and Total Fares. If the trips are longer, the fares should be higher; if the trips are shorter, the fares should be lower. And if the trip is 0 seconds, there should not be a charge at all since the trip didn’t happen.

TASK 10 C:

Using https://erdplus.com/#/standalone draw a star like schema using at least the following tables and insert the image of the star schema in this file.

  • A Fact table for Trip
  • A Dimension table for Taxi Company with its relevant attributes
  • A Dimension table for Payment Type with its relevant attributes
  • A Dimension table for Total Fare with its relevant attributes

==================================

EXTRA information:

Another similar method to find the upper and lower thresholds discussed in introductory statistics courses involves finding the interquartile range. Follow along below to see how we first calculate the interquartile range..

quantile(age) 
##   0%  25%  50%  75% 100% 
##   18   28   36   45   68
lowerq = quantile(age)[2]
upperq = quantile(age)[4]
iqr = upperq - lowerq

The formula below calculates the threshold. The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier.

Below is the upper threshold:

upperthreshold = (iqr * 1.5) + upperq 
upperthreshold
##  75% 
## 70.5

Below is the lower threshold:

lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold
## 25% 
## 2.5

It can also be useful to visualize the data using a box and whisker plot.

boxplot(age)