ATTENTION:

1. READ EVERY SINGLE LINE BELOW TO GET A COMPLETE UNDERSTANDING OF WHAT NEED TO BE DONE IN THIS ASSIGNMENT.

2. SET UP THE DIRECTORY AS INSTRUCTED IN THE “SETUP” BELOW BEFORE YOU PROCEED WITH THIS ASSIGNMENT TO AVOID ANY ERRORS.

4. DO THE TASKS SEQUENTIALLY TO AVOID ERRORS WHEN YOU RUN THE CODES.

5. BEFORE YOU KNIT YOUR RMD FILE AND PUBLISH IT ON THE RPUBS, MAKE SURE TO RUN EACH CHUNK OF YOUR CODES BY CLICKING THE GREEN ARROW ON THE RIGHT OF THE CODE CHUNK.

6. FULL POINTS WILL ONLY BE GIVEN TO TASKS WHICH CODES ARE RUN SUCCESSFULLY (ERROR-FREE), FOR THE LAST TASK - IF YOU ANSWER CORRECTLY BOTH QUESTIONS

Task 5

IA: No, the maximum value is 68 which is less than the upperthreshold of 70.043. IB: No, the minimum value is 18 which is larger than the lowerthreshold of 4.125. IC: There are no outliers. =============================================

PART II: DATA PREPARATION

We use file creditriskorg.csv to provide an example related to data preparation. This is the original dataset and contains missing values. We will focus on the data in the Checking column. There are some problems with this data. Firstly: missing values in this column which is quite common as most datasets are not perfect. Secondly: there are commas, while R does not recognize that ‘1,234’ is equivalent to ‘1234’. Lastly: there are ‘$’ symbols throughout the file which is not a numerical symbol either. We need to prepare the Checking column data to fix all the problems mentioned above, before we can analyze the Checking data. In this case for example we want to find the mean of the Checking data.

Below are the steps to prepare the Checking data.

We must read the file which has the Checking column and extract the Checking column.

#STEP II1a: Read the file creditriskorg.csv 
newdata = read.csv(file="data/creditriskorg.csv")
head(newdata)
##                 X       X.1         X.2             X.3             X.4
## 1    Loan Purpose Checking      Savings Months Customer Months Employed
## 2 Small Appliance     $-       $739.00               13              12
## 3       Furniture     $-     $1,230.00               25               0
## 4         New Car     $-       $389.00               19             119
## 5       Furniture  $638.00     $347.00               13              14
## 6       Education  $963.00   $4,754.00               40              45
##      X.5            X.6 X.7     X.8   X.9       X.10        X.11
## 1 Gender Marital Status Age Housing Years        Job Credit Risk
## 2      M         Single  23     Own     3  Unskilled         Low
## 3      M       Divorced  32     Own     1    Skilled        High
## 4      M         Single  38     Own     4 Management        High
## 5      M         Single  36     Own     2  Unskilled        High
## 6      M         Single  31    Rent     3    Skilled         Low

We observe that the column names are shifted down below because of the empty line. So, we must make sure to use the command skip and set the header to true.

#STEP II1b:
newdata = read.csv(file="data/creditriskorg.csv",skip=1,header=TRUE,sep=",") 
head(newdata)
##      Loan.Purpose    Checking     Savings Months.Customer Months.Employed
## 1 Small Appliance       $-       $739.00               13              12
## 2       Furniture       $-     $1,230.00               25               0
## 3         New Car       $-       $389.00               19             119
## 4       Furniture    $638.00     $347.00               13              14
## 5       Education    $963.00   $4,754.00               40              45
## 6       Furniture  $2,827.00        $-                 11              13
##   Gender Marital.Status Age Housing Years        Job Credit.Risk
## 1      M         Single  23     Own     3  Unskilled         Low
## 2      M       Divorced  32     Own     1    Skilled        High
## 3      M         Single  38     Own     4 Management        High
## 4      M         Single  36     Own     2  Unskilled        High
## 5      M         Single  31    Rent     3    Skilled         Low
## 6      M        Married  25     Own     1    Skilled         Low
#STEP II1c: Extract the Checking colum.
checking = newdata$Checking
#STEP II1d: Calculate the mean for Checking use the mean() function. What happens when we try to use the function?
meanChecking = mean(checking)
## Warning in mean.default(checking): argument is not numeric or logical:
## returning NA

There’s an error. These errors are related to the problems mentioned above (under PART II).

In order to remove the comma and dollar sign in the number e.g. “$1,234”, we must substitute them with just an empty space. The sub function replaces these symbols with something els, e.g. blank or empty space.

#STEP II2: substitute comma, dollar sign with blank in all of checking data. 
checking = sub(",","",checking)   #Substitute comma with blank
checking = sub("\\$","",checking) #Substitute $ with blank

In order to calculate we have to convert all Checking values to numeric to remove any NA

#STEP II3: Convert values to numeric to remove any NA and calculate the mean
checking = as.numeric(checking) #Convert values to numeric to remove any NA
## Warning: NAs introduced by coercion
meanChecking = mean(checking,na.rm=TRUE)
meanChecking
## [1] 2559.805

You will use a different file: scoring_origial.csv for the following three tasks.

TASK 6: Repeat STEP II1a-II1d to read file scoring-original.csv and extract its Price column

TASK 7: Repeat STEP II2 to remove the comma and $ sign in the Price data

TASK 8: Repeat STEP II3 to remove any NA and calculate the mean of Price

TASK 9: Answer the following questions:

Question IIA: What are some other ways to clean this data?

Question IIB: How does Excel treat the missing values and the “$” symbols?

#TASK 6 
newdata = read.csv(file="data/scoring_original.csv")
head(newdata)
##   Status Seniority  Home Time Age Marital Records       Job Expenses
## 1   good         9  rent   60  30 married  no_rec freelance     $73K
## 2   good        17  rent   60  58   widow  no_rec     fixed     $48K
## 3    bad        10 owner   36  46 married yes_rec freelance     $90K
## 4   good         0  rent   60  24  single  no_rec     fixed     $63K
## 5   good         0  rent   36  26  single  no_rec     fixed     $46K
## 6   good         1 owner   60  36 married  no_rec     fixed     $75K
##   Income Assets Debt     Amount      Price   Finrat   Savings
## 1  $129K      0    0   $800.00    $846.00  94.56265  4.200000
## 2  $131K      0    0 $1,000.00  $1,658.00  60.31363  4.980000
## 3  $200K   3000    0 $2,000.00  $2,985.00  67.00168  1.980000
## 4  $182K   2500    0   $900.00  $1,325.00  67.92453  7.933333
## 5  $107K      0    0   $310.00    $910.00  34.06593  7.083871
## 6  $214K   3500    0   $650.00  $1,645.00  39.51368 12.830769
price = newdata$Price
#TASK 7 
price = sub(",","",price)   
price = sub("\\$","",price)
#TASK 8 
price = as.numeric(price) 
## Warning: NAs introduced by coercion
meanPrice = mean(price,na.rm=TRUE)
meanPrice
## [1] 1462.48

Task 9

Question IIA: What are some other ways to clean this data?

There are other functions in excel that can be used to clean the data. Examples of these are Remove Duplicates or Find and Replace. #Question IIB: How does Excel treat the missing values and the “$” symbols?
Excel ignores the missing data and $, it sometimes fills empty cells with things such as #N/A or #NULL. ————-

PART III: Data Modeling

Now, we will look at Chicago taxi data. Go and explore the interactive dashboard and read the description of the data.

Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq

Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/

TASK 10 A:

Open in RStudio or Excel the Taxi_Trips_Sample.csv file located in the data folder. Answer these questions: What is the size of the file? How many number of columns and of rows does the file have? What is the primary key? What are other attributes’ names?

The size of the file is 37.8 MB. The file has 24 columns and 99,999 rows. The primary key is Trip ID. Other attributes are Taxi ID, Trip StartTimestamp, Trip End Timestamp, Trip Seconds, Trip Miles, Pickup Census Tract, Dropoff Census Tract, Pickup Community Area, Dropoff Community Area, Fare, Tips, Tolls, Extras, Trip Total, Payment Type, Company, Pickup centroid Latitude, Pickup centroid Longitute, Pickup Centroid Location, Dropoff Centroid Location, Dropoff centroid Latitude, Dropoff centroid Longitude, Community Areas.

TASK 10 B:

Define a relational business logic integrity check for the column field ‘Trip Seconds’.

A relational business logic integrity check would be if the trip seconds attribute exceeds the time allotted between the Trip Start Timestamp and Trip End Timestamp.

TASK 10 C:

Using https://erdplus.com/#/standalone draw a star like schema using at least the following tables and insert the image of the star schema in this file.

  • A Fact table for Trip
  • A Dimension table for Taxi Company with its relevant attributes
  • A Dimension table for Payment Type with its relevant attributes
  • A Dimension table for Total Fare with its relevant attributes

==================================

EXTRA information:

Another similar method to find the upper and lower thresholds discussed in introductory statistics courses involves finding the interquartile range. Follow along below to see how we first calculate the interquartile range..

quantile(age) 
##   0%  25%  50%  75% 100% 
##   18   28   36   45   68
lowerq = quantile(age)[2]
upperq = quantile(age)[4]
iqr = upperq - lowerq

The formula below calculates the threshold. The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier.

Below is the upper threshold:

upperthreshold = (iqr * 1.5) + upperq 
upperthreshold
##  75% 
## 70.5

Below is the lower threshold:

lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold
## 25% 
## 2.5

It can also be useful to visualize the data using a box and whisker plot.

boxplot(age)