A. Remember to always set your working directory to the source file location. Click on Session -> Set Working Directory -> To Source File Location
B. For your assignment you will be using different data sets than what is included in this worksheet demo. Make sure to read carefully the instructions on Sakai.
We use file creditrisk.csv to provide an example related to data outliers. We use the formula from class to detect any outliers. An outlier is value that “lies outside” most of the other values in a set of data. A common way to detect the outliers is to determine the UPPER and LOWER THRESHOLDS using formula mean (+ or -) 3 * standard deviation.
To use the above formula for Age data in the above file, we firstly must calculate the mean and standard deviation of the data in Age column. The results would be two values which are the upper and lower thresholds. Secondly, we determine the maximum and minimum values of the Age data. Thirdly, compare the maximum and minimum values to the upper and minimum thresholds to identify any potential thresholds. Finally, to determine how many Age data are the outliers, look at Age data by opening the related file, sort the Age column and compare the data to the upper and lower thresholds. If an Age data value falls above the upper threshold or below the lower threshold, it is an outlier.
Below are the steps to determine outliers for Age column.
We must read the file which has the Age column, extract the Age column, calculate the mean, standard deviation, maximum value, and minimum value for the Age column using R.
#STEP I1: Read the file creditrisk.csv and extract column Age in the file and name the extracted variable as age
mydata = read.csv(file="data/creditrisk.csv")
age = mydata$Age
#STEP I2: Calculate the average or mean and standard deviation or SD of age below. Refer to Worksheet lab_02 for the correct commands.
meanAge = mean(age) #Calculate the mean and hold it in variable meanAge
meanAge #Display the mean value
## [1] 34.39765
spreadAge = sd(age) #Calculate the SD and hold it in variable spreadAge
spreadAge #Display the SD value
## [1] 11.04513
#STEP I3: Calculate the maximum and minimum of age below. The command to find the maximum is max(variable) and the command to find the minimum is min(variable) where variable in this case is the extracted variable age
maxAge = max(age) #Calculate the maximum and hold it in variable maxAge
maxAge #Display the minimum value
## [1] 73
minAge = min(age) #Calculate the minimum and hold it in variable minAge
minAge #Display the minimum value
## [1] 18
#STEP I4: Calculate the upper and lower thresholds. upperTAge is the variable to hold the upper threshold for the Age data, and lowerTAge is the variable to hold the lower threshold for the Age data.
upperTAge = meanAge + 3 * spreadAge
lowerTAge = meanAge - 3 * spreadAge
upperTAge
## [1] 67.53302
lowerTAge
## [1] 1.262269
STEP I5: Compare the maximum value with upper threshold and the minimum value with lower threshold to determine any potential outliers. QUESTION IA: Are there any outliers on the upper side of the Age data? ANWER IA: YES, since the maximum value is 73 which is bigger than the upperthreshold 67.53302.
QUESTION IB: Are there any outliers on the lower side of the Age data? ANWER IB: No, since the minimu value is 18 which is NOT smaller than the lower threshold.
QUESTION IC: How many outliers are there and which data? ANSWER IC: ONE outlier, where Age is 73.
#Task1:
Scoring = read.csv(file="data/Scoring.csv")
age = Scoring$Age
#Task2:
meanAge = mean(age)
meanAge
## [1] 37.08412
spreadAge = sd(age)
spreadAge
## [1] 10.98637
#Task3:
maxAge = max(age)
maxAge
## [1] 68
minAge = min(age)
minAge
## [1] 18
#Task4:
upperTAge = meanAge + 3 * spreadAge
lowerTAge = meanAge - 3 * spreadAge
upperTAge
## [1] 70.04322
lowerTAge
## [1] 4.125023
A:No, the maximum value is 68 is below the upper threshold. B:No, the minimum value is 18 which is above the lower threshold. C:No outliers.
=============================================
We use file creditriskorg.csv to provide an example related to data preparation. This is the original dataset and contains missing values. We will focus on the data in the Checking column. There are some problems with this data. Firstly: missing values in this column which is quite common as most datasets are not perfect. Secondly: there are commas, while R does not recognize that ‘1,234’ is equivalent to ‘1234’. Lastly: there are ‘$’ symbols throughout the file which is not a numerical symbol either. We need to prepare the Checking column data to fix all the problems mentioned above, before we can analyze the Checking data. In this case for example we want to find the mean of the Checking data.
Below are the steps to prepare the Checking data.
We must read the file which has the Checking column and extract the Checking column.
#STEP II1a: Read the file creditriskorg.csv
newdata = read.csv(file="data/creditriskorg.csv")
head(newdata)
## X X.1 X.2 X.3 X.4
## 1 Loan Purpose Checking Savings Months Customer Months Employed
## 2 Small Appliance $- $739.00 13 12
## 3 Furniture $- $1,230.00 25 0
## 4 New Car $- $389.00 19 119
## 5 Furniture $638.00 $347.00 13 14
## 6 Education $963.00 $4,754.00 40 45
## X.5 X.6 X.7 X.8 X.9 X.10 X.11
## 1 Gender Marital Status Age Housing Years Job Credit Risk
## 2 M Single 23 Own 3 Unskilled Low
## 3 M Divorced 32 Own 1 Skilled High
## 4 M Single 38 Own 4 Management High
## 5 M Single 36 Own 2 Unskilled High
## 6 M Single 31 Rent 3 Skilled Low
We observe that the column names are shifted down below because of the empty line. So, we must make sure to use the command skip and set the header to true.
#STEP II1b:
newdata = read.csv(file="data/creditriskorg.csv",skip=1,header=TRUE,sep=",")
head(newdata)
## Loan.Purpose Checking Savings Months.Customer Months.Employed
## 1 Small Appliance $- $739.00 13 12
## 2 Furniture $- $1,230.00 25 0
## 3 New Car $- $389.00 19 119
## 4 Furniture $638.00 $347.00 13 14
## 5 Education $963.00 $4,754.00 40 45
## 6 Furniture $2,827.00 $- 11 13
## Gender Marital.Status Age Housing Years Job Credit.Risk
## 1 M Single 23 Own 3 Unskilled Low
## 2 M Divorced 32 Own 1 Skilled High
## 3 M Single 38 Own 4 Management High
## 4 M Single 36 Own 2 Unskilled High
## 5 M Single 31 Rent 3 Skilled Low
## 6 M Married 25 Own 1 Skilled Low
#STEP II1c: Extract the Checking colum.
checking = newdata$Checking
#STEP II1d: Calculate the mean for Checking use the mean() function. What happens when we try to use the function?
meanChecking = mean(checking)
## Warning in mean.default(checking): argument is not numeric or logical:
## returning NA
There’s an error. These errors are related to the problems mentioned above (under PART II).
In order to remove the comma and dollar sign in the number e.g. “$1,234”, we must substitute them with just an empty space. The sub function replaces these symbols with something els, e.g. blank or empty space.
#STEP II2: substitute comma, dollar sign with blank in all of checking data.
checking = sub(",","",checking) #Substitute comma with blank
checking = sub("\\$","",checking) #Substitute $ with blank
In order to calculate we have to convert all Checking values to numeric to remove any NA
#STEP II3: Convert values to numeric to remove any NA and calculate the mean
checking = as.numeric(checking) #Convert values to numeric to remove any Na
## Warning: NAs introduced by coercion
meanChecking = mean(checking,na.rm=TRUE)
meanChecking
## [1] 2559.805
#Task6:
newdata = read.csv(file="data/Scoring_Original.csv")
head(newdata)
## Status Seniority Home Time Age Marital Records Job Expenses
## 1 good 9 rent 60 30 married no_rec freelance $73K
## 2 good 17 rent 60 58 widow no_rec fixed $48K
## 3 bad 10 owner 36 46 married yes_rec freelance $90K
## 4 good 0 rent 60 24 single no_rec fixed $63K
## 5 good 0 rent 36 26 single no_rec fixed $46K
## 6 good 1 owner 60 36 married no_rec fixed $75K
## Income Assets Debt Amount Price Finrat Savings
## 1 $129K 0 0 $800.00 $846.00 94.56265 4.200000
## 2 $131K 0 0 $1,000.00 $1,658.00 60.31363 4.980000
## 3 $200K 3000 0 $2,000.00 $2,985.00 67.00168 1.980000
## 4 $182K 2500 0 $900.00 $1,325.00 67.92453 7.933333
## 5 $107K 0 0 $310.00 $910.00 34.06593 7.083871
## 6 $214K 3500 0 $650.00 $1,645.00 39.51368 12.830769
price = newdata$Price
#Task7:
price = sub(",","",price)
price = sub("\\$","",price)
#Task8:
price = as.numeric(price)
## Warning: NAs introduced by coercion
meanPrice= mean(price,na.rm=TRUE)
meanPrice
## [1] 1462.48
You can clean it up by replacing commas and dollar signs with blanks. Additionally, the “as.numeric” function allows the removal of blank values. #Question IIB: How does Excel treat the missing values and the “\(" symbols? It is not a numeric value so if there are "\)” in the data it will not be able to compute the mean or do various other functions.
Now, we will look at Chicago taxi data. Go and explore the interactive dashboard and read the description of the data.
Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq
Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/
Open in RStudio or Excel the Taxi_Trips_Sample.csv file located in the data folder. Answer these questions: What is the size of the file? How many number of columns and of rows does the file have? What is the primary key? What are other attributes’ names?
Define a relational business logic integrity check for the column field ‘Trip Seconds’.
Using https://erdplus.com/#/standalone draw a star like schema using at least the following tables and insert the image of the star schema in this file.
c
Another similar method to find the upper and lower thresholds discussed in introductory statistics courses involves finding the interquartile range. Follow along below to see how we first calculate the interquartile range..
quantile(age)
## 0% 25% 50% 75% 100%
## 18 28 36 45 68
lowerq = quantile(age)[2]
upperq = quantile(age)[4]
iqr = upperq - lowerq
The formula below calculates the threshold. The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier.
Below is the upper threshold:
upperthreshold = (iqr * 1.5) + upperq
upperthreshold
## 75%
## 70.5
Below is the lower threshold:
lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold
## 25%
## 2.5
It can also be useful to visualize the data using a box and whisker plot.
boxplot(age)