ATTENTION:

1. READ EVERY SINGLE LINE BELOW TO GET A COMPLETE UNDERSTANDING OF WHAT NEED TO BE DONE IN THIS ASSIGNMENT.

2. SET UP THE DIRECTORY AS INSTRUCTED IN THE “SETUP” BELOW BEFORE YOU PROCEED WITH THIS ASSIGNMENT TO AVOID ANY ERRORS.

4. DO THE TASKS SEQUENTIALLY TO AVOID ERRORS WHEN YOU RUN THE CODES.

5. BEFORE YOU KNIT YOUR RMD FILE AND PUBLISH IT ON THE RPUBS, MAKE SURE TO RUN EACH CHUNK OF YOUR CODES BY CLICKING THE GREEN ARROW ON THE RIGHT OF THE CODE CHUNK.

6. FULL POINTS WILL ONLY BE GIVEN TO TASKS WHICH CODES ARE RUN SUCCESSFULLY (ERROR-FREE), FOR THE LAST TASK - IF YOU ANSWER CORRECTLY BOTH QUESTIONS

TASK 1:Repeat STEP I1 to read file Scoring.csv and extract its Age column.

mydata = read.csv(file="data/Scoring.csv")
myage = mydata$Age

TASK 2: Repeat STEP I2 to calculate the mean and spread and display their values.

meanAge = mean(myage)
meanAge
## [1] 37.08412
spreadAge = sd(myage)
spreadAge
## [1] 10.98637

TASK 3: Repeat STEP I3 to calculate the upper and lower thresholds and display their values.

upperTAge = meanAge + 3 * spreadAge
lowerTAge = meanAge - 3 * spreadAge
upperTAge
## [1] 70.04322
lowerTAge
## [1] 4.125023

TASK 4: Repeat STEP I4 to calculate the maximum and minimum and display their values.

maxAge = max(myage)
maxAge
## [1] 68
minAge = min(myage)
minAge
## [1] 18

TASK 5: Repeat STEP I5 to answer QUESTIONS IA, IB, and IC.

QUESTION IA: Are there any outliers on the upper side of the Age data? ANWER IA: No, because the maximum value is 68 which is not bigger than the upper threshold 70.04322

QUESTION IB: Are there any outliers on the lower side of the Age data? ANWER IB: No, because the minimum value is 18 which is not smaller than the lower threshold 4.125023

QUESTION IC: How many outliers are there and which data (if any)? ANSWER IC: zero outliers

=============================================

PART II: DATA PREPARATION

We use file creditriskorg.csv to provide an example related to data preparation. This is the original dataset and contains missing values. We will focus on the data in the Checking column. There are some problems with this data. Firstly: missing values in this column which is quite common as most datasets are not perfect. Secondly: there are commas, while R does not recognize that 1,234 is equivalent to 1234. Lastly: there are ‘$’ symbols throughout the file which is not a numerical symbol either. We need to prepare the Checking column data to fix all the problems mentioned above, before we can analyze the Checking data. In this case for example, the analysis that we want to do is finding the mean of the Checking data.

Below are the steps to prepare the Checking data.

We must read the file which has the Checking column and extract the Checking column.

#STEP II1a: Read the file creditriskorg.csv 
newdata = read.csv(file="data/creditriskorg.csv")
head(newdata)
##                 X       X.1         X.2             X.3             X.4
## 1    Loan Purpose Checking      Savings Months Customer Months Employed
## 2 Small Appliance     $-       $739.00               13              12
## 3       Furniture     $-     $1,230.00               25               0
## 4         New Car     $-       $389.00               19             119
## 5       Furniture  $638.00     $347.00               13              14
## 6       Education  $963.00   $4,754.00               40              45
##      X.5            X.6 X.7     X.8   X.9       X.10        X.11
## 1 Gender Marital Status Age Housing Years        Job Credit Risk
## 2      M         Single  23     Own     3  Unskilled         Low
## 3      M       Divorced  32     Own     1    Skilled        High
## 4      M         Single  38     Own     4 Management        High
## 5      M         Single  36     Own     2  Unskilled        High
## 6      M         Single  31    Rent     3    Skilled         Low

We observe that the column names are shifted down below because of the empty line. So, we must make sure to use the command skip and set the header to true.

#STEP II1b:
newdata = read.csv(file="data/creditriskorg.csv",skip=1,header=TRUE,sep=",") 
head(newdata)
##      Loan.Purpose    Checking     Savings Months.Customer Months.Employed
## 1 Small Appliance       $-       $739.00               13              12
## 2       Furniture       $-     $1,230.00               25               0
## 3         New Car       $-       $389.00               19             119
## 4       Furniture    $638.00     $347.00               13              14
## 5       Education    $963.00   $4,754.00               40              45
## 6       Furniture  $2,827.00        $-                 11              13
##   Gender Marital.Status Age Housing Years        Job Credit.Risk
## 1      M         Single  23     Own     3  Unskilled         Low
## 2      M       Divorced  32     Own     1    Skilled        High
## 3      M         Single  38     Own     4 Management        High
## 4      M         Single  36     Own     2  Unskilled        High
## 5      M         Single  31    Rent     3    Skilled         Low
## 6      M        Married  25     Own     1    Skilled         Low
#STEP II1c: Extract the Checking colum.
checking = newdata$Checking
#STEP II1d: Calculate the mean for Checking use the mean() function. What happens when we try to use the function?
meanChecking = mean(checking)
## Warning in mean.default(checking): argument is not numeric or logical:
## returning NA

There’s an error. These errors are related to the problems mentioned above (under PART II); i.e. in R the comma (,) and dollar ($) signs aren’t recognized as part of the numeric values. Therefore, we have to remmove them.

In order to remove the comma and dollar signs in the number e.g. “$1,234”, we must substitute them with just an empty space. The sub function replaces these symbols with something else, e.g. blank or empty space.

#STEP II2: substitute comma and dollar signs with blank in all of checking data. 
checking = sub(",","",checking)   #Substitute comma with blank.
checking = sub("\\$","",checking) #Substitute $ with blank.

In order to calculate the mean we have to convert all Checking values to numeric to remove any NA value.

#STEP II3: Convert values to numeric to remove any NA and calculate the mean.
checking = as.numeric(checking)             #Convert values to numeric to remove any NA.
## Warning: NAs introduced by coercion
meanChecking = mean(checking,na.rm=TRUE)    #Compute the mean value.   
meanChecking                                #Display the mean value.
## [1] 2559.805

Use a different file: scoring_origial.csv and complete the following four tasks.

TASK 6: Repeat STEPs II1a and II1c to read file scoring-original.csv and extract its Price column.

newdata = read.csv(file="data/scoring_original.csv")
head(newdata)
##   Status Seniority  Home Time Age Marital Records       Job Expenses
## 1   good         9  rent   60  30 married  no_rec freelance     $73K
## 2   good        17  rent   60  58   widow  no_rec     fixed     $48K
## 3    bad        10 owner   36  46 married yes_rec freelance     $90K
## 4   good         0  rent   60  24  single  no_rec     fixed     $63K
## 5   good         0  rent   36  26  single  no_rec     fixed     $46K
## 6   good         1 owner   60  36 married  no_rec     fixed     $75K
##   Income Assets Debt     Amount      Price   Finrat   Savings
## 1  $129K      0    0   $800.00    $846.00  94.56265  4.200000
## 2  $131K      0    0 $1,000.00  $1,658.00  60.31363  4.980000
## 3  $200K   3000    0 $2,000.00  $2,985.00  67.00168  1.980000
## 4  $182K   2500    0   $900.00  $1,325.00  67.92453  7.933333
## 5  $107K      0    0   $310.00    $910.00  34.06593  7.083871
## 6  $214K   3500    0   $650.00  $1,645.00  39.51368 12.830769
price = newdata$Price

TASK 7: Repeat STEP II2 to remove the comma and $ signs in the Price data.

price = sub(",","",price)
price = sub("\\$","",price)

TASK 8: Repeat STEP II3 to remove any NA, calculate the mean of Price data, and display the mean value.

price = as.numeric(price)
## Warning: NAs introduced by coercion
meanPrice = mean(price,na.rm=TRUE)
meanPrice
## [1] 1462.48

TASK 9: Answer the following questions by referring to lectures slides and other resources:

Question IIA: What are some other ways to clean this data?

Answer: cleansing, transformation, detection, imputation, reduction, abstraction, filtering, derivation, and attributes.

Question IIB: How does Excel treat the missing values and the “\(" symbols? Answer: Excel treats the missing values as non-existent with the ### symbols. The "\)” symbol is what helps excel read the data as a price. Values without the “$”" symbol are treated as missing data.


PART III: Data Modeling

Now, we will look at Chicago taxi data. Go and explore the interactive dashboard and read the description of the data.

Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq

Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/

TASK 10 A:

Open in RStudio or Excel the Taxi_Trips_Sample.csv file located in the data folder. Answer these questions: What is the size of the file? How many number of fields and records does the file have? What is the primary key? What are other attributes’ names (give at least three)?

Size of the file is 38,727 KB. Number of records is 100000 and number of fields is 24. The primary key is Trip ID. Other attributes are Trip Miles, Tips, and Tolls

TASK 10 B:

Define a relational business logic integrity check for the column field ‘Trip Seconds’ in relation to other field(s).

A relational business logic integrity check for ‘Trip Seconds’ would be if the time of ‘Trip Seconds’ exists when the ‘Trip Miles’ is zero. It could also be when the difference between ‘Trip End’ and ‘Trip Start’ is less or greater than ‘Trip Seconds’ even if conversion is necessary.

TASK 10 C:

Using https://erdplus.com/#/standalone draw a star like schema using at least the following tables and insert the image of the star schema in this file. Use the fields in the Taxi_Trips_Sample.csv file to determine the required attributes for each dimension.

==================================

EXTRA information:

Another similar method to find the upper and lower thresholds discussed in introductory statistics courses involves finding the interquartile range. Follow along below to see how we first calculate the interquartile range..