A. Remember to always set your working directory to the source file location. Click on Session -> Set Working Directory -> To Source File Location
B. For your assignment you will be using different data sets than what is included in this worksheet demo. Make sure to read carefully the instructions on Sakai.
We use file creditrisk.csv to provide an example related to data outliers. We use the formula from class to detect any outliers. An outlier is value that “lies outside” most of the other values in a set of data. A common way to detect the outliers is to determine the UPPER and LOWER THRESHOLDS using formula mean (+ or -) 3 * standard deviation.
Below is the steps on how to determine the outliers in the Age data: 1. Read the data file and extract the Age column to get the Age data. 2. Calculate the mean and standard deviation of the data in Age column. 3. Calculate the upper threshold using formula mean + 3 * standard deviation, and calculate the lower threshold using formula mean - 3 * standard deviation. 4. Calculate the maximum and minimum values of the Age data. 5. Compare the maximum and minimum values to the upper and lower thresholds to identify any potential outliers. If the data value falls above the upper threshold or below the lower threshold, it is an outlier. Finally, to determine how many outliers in the Age data, look at Age data by opening the related file, sort the Age column and count the number of data values which are above the upper threshold, and count the number of data values which are below the lower thresholds. Add these two counts to get the number of outliers in the Age data.
Let’s determine the outliers in the Age data using R.
We must write codes to read the file which has the Age column, extract the Age column, calculate the mean, standard deviation, maximum value, and minimum value for the Age column.
#STEP I1: Read the file creditrisk.csv and extract column Age in the file and name the extracted variable as myage.
mydata = read.csv(file="data/creditrisk.csv")
myage = mydata$Age
#STEP I2: Calculate the average or mean and standard deviation or SD of myage (the Age data) and display their values. Refer to Worksheet lab_02 for the correct commands.
meanAge = mean(myage) #Calculate the mean of Age data and hold it in variable meanAge.
meanAge #Display the mean value.
## [1] 34.39765
spreadAge = sd(myage) #Calculate the SD of Age data and hold it in variable spreadAge.
spreadAge #Display the SD value.
## [1] 11.04513
#STEP I3: Calculate the upper and lower thresholds of myage (the Age data) and display their values. The upperTAge is the variable to hold the upper threshold and the lowerTAge is the variable to hold the lower threshold.
upperTAge = meanAge + 3 * spreadAge #Calculate the upper threshold of the Age data.
lowerTAge = meanAge - 3 * spreadAge #Calculate the lower threshold of the Age data.
upperTAge #Display the upper threshold value.
## [1] 67.53302
lowerTAge #Display the lower threshold value.
## [1] 1.262269
#STEP I4: Calculate the maximum and minimum values of myage (the Age data) and display their values. The command to find the maximum is max(variable) and the command to find the minimum is min(variable) where variable in this case is the extracted variable myage.
maxAge = max(myage) #Calculate the maximum value of Age data and hold it in variable maxAge.
maxAge #Display the maximum value.
## [1] 73
minAge = min(myage) #Calculate the minimum value of Age data and hold it in variable minAge.
minAge #Display the minimum value.
## [1] 18
STEP I5: Compare the maximum value with upper threshold and compare the minimum value with lower threshold to determine any potential outliers. QUESTION IA: Are there any outliers on the upper side of the Age data? ANWER IA: YES, since the maximum value is 73 which is bigger than the upper threshold 67.53302.
QUESTION IB: Are there any outliers on the lower side of the Age data? ANWER IB: No, since the minimum value is 18 which is NOT smaller than the lower threshold.
QUESTION IC: How many outliers are there and which data? ANSWER IC: ONE outlier, where Age is 73.
mydata = read.csv(file="data/Scoring.csv")
myage = mydata$Age
meanAge = mean(myage)
meanAge
## [1] 37.08412
spreadAge = sd(myage)
spreadAge
## [1] 10.98637
upperTAge = meanAge + 3 * spreadAge
lowerTAge = meanAge - 3 * spreadAge
upperTAge
## [1] 70.04322
lowerTAge
## [1] 4.125023
maxAge = max(myage)
maxAge
## [1] 68
minAge = min(myage)
minAge
## [1] 18
QUESTION IA: Are there any outliers on the upper side of the Age data? ANWER IA: No, because the maximum value is 68 which is not bigger than the upper threshold 70.04322
QUESTION IB: Are there any outliers on the lower side of the Age data? ANWER IB: No, because the minimum value is 18 which is not smaller than the lower threshold 4.125023
QUESTION IC: How many outliers are there and which data (if any)? ANSWER IC: zero outliers
=============================================
We use file creditriskorg.csv to provide an example related to data preparation. This is the original dataset and contains missing values. We will focus on the data in the Checking column. There are some problems with this data. Firstly: missing values in this column which is quite common as most datasets are not perfect. Secondly: there are commas, while R does not recognize that 1,234 is equivalent to 1234. Lastly: there are ‘$’ symbols throughout the file which is not a numerical symbol either. We need to prepare the Checking column data to fix all the problems mentioned above, before we can analyze the Checking data. In this case for example, the analysis that we want to do is finding the mean of the Checking data.
Below are the steps to prepare the Checking data.
We must read the file which has the Checking column and extract the Checking column.
#STEP II1a: Read the file creditriskorg.csv
newdata = read.csv(file="data/creditriskorg.csv")
head(newdata)
## X X.1 X.2 X.3 X.4
## 1 Loan Purpose Checking Savings Months Customer Months Employed
## 2 Small Appliance $- $739.00 13 12
## 3 Furniture $- $1,230.00 25 0
## 4 New Car $- $389.00 19 119
## 5 Furniture $638.00 $347.00 13 14
## 6 Education $963.00 $4,754.00 40 45
## X.5 X.6 X.7 X.8 X.9 X.10 X.11
## 1 Gender Marital Status Age Housing Years Job Credit Risk
## 2 M Single 23 Own 3 Unskilled Low
## 3 M Divorced 32 Own 1 Skilled High
## 4 M Single 38 Own 4 Management High
## 5 M Single 36 Own 2 Unskilled High
## 6 M Single 31 Rent 3 Skilled Low
We observe that the column names are shifted down below because of the empty line. So, we must make sure to use the command skip and set the header to true.
#STEP II1b:
newdata = read.csv(file="data/creditriskorg.csv",skip=1,header=TRUE,sep=",")
head(newdata)
## Loan.Purpose Checking Savings Months.Customer Months.Employed
## 1 Small Appliance $- $739.00 13 12
## 2 Furniture $- $1,230.00 25 0
## 3 New Car $- $389.00 19 119
## 4 Furniture $638.00 $347.00 13 14
## 5 Education $963.00 $4,754.00 40 45
## 6 Furniture $2,827.00 $- 11 13
## Gender Marital.Status Age Housing Years Job Credit.Risk
## 1 M Single 23 Own 3 Unskilled Low
## 2 M Divorced 32 Own 1 Skilled High
## 3 M Single 38 Own 4 Management High
## 4 M Single 36 Own 2 Unskilled High
## 5 M Single 31 Rent 3 Skilled Low
## 6 M Married 25 Own 1 Skilled Low
#STEP II1c: Extract the Checking colum.
checking = newdata$Checking
#STEP II1d: Calculate the mean for Checking use the mean() function. What happens when we try to use the function?
meanChecking = mean(checking)
## Warning in mean.default(checking): argument is not numeric or logical:
## returning NA
There’s an error. These errors are related to the problems mentioned above (under PART II); i.e. in R the comma (,) and dollar ($) signs aren’t recognized as part of the numeric values. Therefore, we have to remmove them.
In order to remove the comma and dollar signs in the number e.g. “$1,234”, we must substitute them with just an empty space. The sub function replaces these symbols with something else, e.g. blank or empty space.
#STEP II2: substitute comma and dollar signs with blank in all of checking data.
checking = sub(",","",checking) #Substitute comma with blank.
checking = sub("\\$","",checking) #Substitute $ with blank.
In order to calculate the mean we have to convert all Checking values to numeric to remove any NA value.
#STEP II3: Convert values to numeric to remove any NA and calculate the mean.
checking = as.numeric(checking) #Convert values to numeric to remove any NA.
## Warning: NAs introduced by coercion
meanChecking = mean(checking,na.rm=TRUE) #Compute the mean value.
meanChecking #Display the mean value.
## [1] 2559.805
newdata = read.csv(file="data/scoring_original.csv")
head(newdata)
## Status Seniority Home Time Age Marital Records Job Expenses
## 1 good 9 rent 60 30 married no_rec freelance $73K
## 2 good 17 rent 60 58 widow no_rec fixed $48K
## 3 bad 10 owner 36 46 married yes_rec freelance $90K
## 4 good 0 rent 60 24 single no_rec fixed $63K
## 5 good 0 rent 36 26 single no_rec fixed $46K
## 6 good 1 owner 60 36 married no_rec fixed $75K
## Income Assets Debt Amount Price Finrat Savings
## 1 $129K 0 0 $800.00 $846.00 94.56265 4.200000
## 2 $131K 0 0 $1,000.00 $1,658.00 60.31363 4.980000
## 3 $200K 3000 0 $2,000.00 $2,985.00 67.00168 1.980000
## 4 $182K 2500 0 $900.00 $1,325.00 67.92453 7.933333
## 5 $107K 0 0 $310.00 $910.00 34.06593 7.083871
## 6 $214K 3500 0 $650.00 $1,645.00 39.51368 12.830769
price = newdata$Price
price = sub(",","",price)
price = sub("\\$","",price)
price = as.numeric(price)
## Warning: NAs introduced by coercion
meanPrice = mean(price,na.rm=TRUE)
meanPrice
## [1] 1462.48
Answer: cleansing, transformation, detection, imputation, reduction, abstraction, filtering, derivation, and attributes.
Now, we will look at Chicago taxi data. Go and explore the interactive dashboard and read the description of the data.
Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq
Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/
Open in RStudio or Excel the Taxi_Trips_Sample.csv file located in the data folder. Answer these questions: What is the size of the file? How many number of fields and records does the file have? What is the primary key? What are other attributes’ names (give at least three)?
Size of the file is 38,727 KB. Number of records is 100000 and number of fields is 24. The primary key is Trip ID. Other attributes are Trip Miles, Tips, and Tolls
Define a relational business logic integrity check for the column field ‘Trip Seconds’ in relation to other field(s).
A relational business logic integrity check for ‘Trip Seconds’ would be if the time of ‘Trip Seconds’ exists when the ‘Trip Miles’ is zero. It could also be when the difference between ‘Trip End’ and ‘Trip Start’ is less or greater than ‘Trip Seconds’ even if conversion is necessary.
Using https://erdplus.com/#/standalone draw a star like schema using at least the following tables and insert the image of the star schema in this file. Use the fields in the Taxi_Trips_Sample.csv file to determine the required attributes for each dimension.
Another similar method to find the upper and lower thresholds discussed in introductory statistics courses involves finding the interquartile range. Follow along below to see how we first calculate the interquartile range..