About

This worksheet includes basic steps to compute a simple signal-to-noise ratio and than it focuses on data outliers, data preparation, and data modeling. This lab requires the use of Microsoft Excel, R, and ERDplus.

Setup

Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.


Task 1

Next, read the csv file into R Studio. It can be useful to name your data to create a shortcut to it. Here we will label the data, ‘mydata’. To see the top head data in the console, one can ‘call’ it using the function ‘head’ and referring to it by its given shortcut name.

mydata = read.csv(file="data/creditrisk.csv")
head(mydata)
##      Loan.Purpose Checking Savings Months.Customer Months.Employed Gender
## 1 Small Appliance        0     739              13              12      M
## 2       Furniture        0    1230              25               0      M
## 3         New Car        0     389              19             119      M
## 4       Furniture      638     347              13              14      M
## 5       Education      963    4754              40              45      M
## 6       Furniture     2827       0              11              13      M
##   Marital.Status Age Housing Years        Job Credit.Risk
## 1         Single  23     Own     3  Unskilled         Low
## 2       Divorced  32     Own     1    Skilled        High
## 3         Single  38     Own     4 Management        High
## 4         Single  36     Own     2  Unskilled        High
## 5         Single  31    Rent     3    Skilled         Low
## 6        Married  25     Own     1    Skilled         Low

To capture, or extract, the checking and savings columns and perform some analytics on them, we must first be able to extract the columns from the data separately. Using the ‘$’ sign following the label for the data extracts a specific column. For convenience, we relabel the extracted data.

Below, we have extracted the checking column.

#Extracting the Checking Column
checking = mydata$Checking 

#Calling the Checking Column to display top head values
head(checking)
## [1]    0    0    0  638  963 2827

Now, fill in the code to extract and call the savings column.

#Extracting the Savings Column
saving = mydata$Savings

#Calling the Savings Column
head(saving)
## [1]  739 1230  389  347 4754    0

In order to calculate the mean, or the average by hand of the checkings columns, one can add each individual entry and divide by the total number or rows. This would take much time, but thankfully, R has a command for this.

We have done an example using the checkings column. Compute the same using the savings column.

#Using the 'mean' function on checking to calculate the checking average and naming the average 'meanChecking'
meanChecking = mean(checking)

#Calling the average
meanChecking
## [1] 1048.014
#Find the average of the savings column and name the average of the savings meanSavings
meanSavings = mean(saving)

#Call meanSavings
meanSavings
## [1] 1812.562

Next, compute the standard deviation or spread of both the checkings and savings columns.

#Computing the standard deviation of standard deviation
spreadChecking = sd(checking)
spreadChecking
## [1] 3147.183
#Find the standard deviation of savings 
spreadSavings = sd(saving)
spreadSavings
## [1] 3597.285

Now, to compute the SNR, the signal to noise ratio, a formula is created because there is no built in function.

SNR is the mean, or average, divided by the spread.

#Compute the snr of Checking and name it snr_Checking
snr_Checking = meanChecking/spreadChecking

#Call snr_Checking
snr_Checking
## [1] 0.3330006
#Find the snr of the savings and name it snr_Saving
snr_Savings = meanSavings/spreadSavings

#Call snr_Saving
snr_Savings
## [1] 0.5038695

Of the Checking and Savings, which has a higher SNR? Why do you think that is?


Task 2: Data Outliers

First, we must calculate the mean, standard deviation, maximum, and minimum for the Age column using R.

In R, we must extract the column “Age” and find the values that are asked for.

#Name the extracted variable
age = mydata$Age 
#Calculate the average age below. Refer to Worksheet 2 for the correct command.
meanage = mean(age)
meanage
## [1] 34.39765
#Calculate standard deviation of age below. Refer to Worksheet 2 for the correct command. 
spreadage=sd(age)
spreadage
## [1] 11.04513
#Calculate the maximum of age below. The command to find the maximum is max(variable) where variable is the extracted variable. 
MaxAge = max(age)
MaxAge
## [1] 73
#Calculate the minimum of age below. The command to find the minimum is min(variable) where variable is the extracted variable.
MinAge = min(age)
MinAge
## [1] 18

Next, use the formula from class to detect any outliers. An outlier is value that “lies outside” most of the other values in a set of data. A common way to estimate the upper and lower threshold is to take the mean (+ or -) 3 * standard deviation. Try using this formula to find the upper and lower limit for age.

#Use the formula above to calculate the upper and lower threshold
UpperOutlier = meanage + 3*spreadage
UpperOutlier
## [1] 67.53302
LowerOtlier = meanage - 3*spreadage
LowerOtlier
## [1] 1.262269

Are there any outliers? How many? It can also be useful to visualize the data using a box and whisker plot. The boxplot below supports the IQR we found of 15 and upper and lower threshold.

boxplot(age)

outside <- length(age[age >= UpperOutlier])
outside
## [1] 1

Yes, there is one outlier (the max). We can see that in the vector “outside”. I knew there weren’t any entries lower than the lower threshold, since the minimum age in the set is 18, but there is one individual above the upper threshold.


Task 3: Data Preparation

Next, we must read the ‘creditriskorg.csv’ file into R. This is the original dataset and contains missing values.

newdata = read.csv(file="data/creditriskorg.csv")
head(newdata)
##                 X       X.1         X.2             X.3             X.4
## 1    Loan Purpose Checking      Savings Months Customer Months Employed
## 2 Small Appliance     $-       $739.00               13              12
## 3       Furniture     $-     $1,230.00               25               0
## 4         New Car     $-       $389.00               19             119
## 5       Furniture  $638.00     $347.00               13              14
## 6       Education  $963.00   $4,754.00               40              45
##      X.5            X.6 X.7     X.8   X.9       X.10        X.11
## 1 Gender Marital Status Age Housing Years        Job Credit Risk
## 2      M         Single  23     Own     3  Unskilled         Low
## 3      M       Divorced  32     Own     1    Skilled        High
## 4      M         Single  38     Own     4 Management        High
## 5      M         Single  36     Own     2  Unskilled        High
## 6      M         Single  31    Rent     3    Skilled         Low

We observe that the column names are shifted down below because of the empty line. So, we must make sure to use the command skip and set the header to true.

newdata = read.csv(file="data/creditriskorg.csv",skip=1,header=TRUE,sep=",") 
head(newdata)
##      Loan.Purpose    Checking     Savings Months.Customer Months.Employed
## 1 Small Appliance       $-       $739.00               13              12
## 2       Furniture       $-     $1,230.00               25               0
## 3         New Car       $-       $389.00               19             119
## 4       Furniture    $638.00     $347.00               13              14
## 5       Education    $963.00   $4,754.00               40              45
## 6       Furniture  $2,827.00        $-                 11              13
##   Gender Marital.Status Age Housing Years        Job Credit.Risk
## 1      M         Single  23     Own     3  Unskilled         Low
## 2      M       Divorced  32     Own     1    Skilled        High
## 3      M         Single  38     Own     4 Management        High
## 4      M         Single  36     Own     2  Unskilled        High
## 5      M         Single  31    Rent     3    Skilled         Low
## 6      M        Married  25     Own     1    Skilled         Low

To calculate the mean for Checking in R, follow Worksheet 2. Extract the Checking column first and then find the average using the function built in R. What happens when we try to use the function?

checking = newdata$Checking
MeanChecking = mean(checking)
## Warning in mean.default(checking): argument is not numeric or logical:
## returning NA
MeanChecking
## [1] NA

To resolve the error, we must understand where it is coming from and correct for. There are missing values in the csv file, which is quite common as most datasets are not perfect. Additionally, there are commas within the excel spreadsheet, and R does not recognize that ‘1,234’ is equivalent to ‘1234’. Lastly, there are ‘$’ symbols throughout the file which is not a numerical symbol either.

The sub function replaces these symbols with something else. So, in order to remove the comma in the number “1,234”, we must substitute it with just an empty space.

As shown on the worksheet, type and copy the exact commands to find the mean with the NA values removed.

#substitute comma with blank in all of checking.  Below are examples using a hypothetical variable name 'new'.
# Example new = sub(",","",new)
newchecking = sub(",","",checking)

#substitute dollar sign with blank in all of checking 
# Example new = sub("\\$","",new)
newchecking = sub("\\$","",newchecking)

#Convert values to numeric to remove any NA
# Example new = as.numeric(new)
newchecking = as.numeric(newchecking)
## Warning: NAs introduced by coercion
#Calculate mean of checking with NA removed 
# Example mean(new,na.rm=TRUE)
MeanChecking = mean(newchecking,NA.rm=TRUE)
cleanChecking <- newchecking[is.na(newchecking) == FALSE]
MeanChecking <- mean(cleanChecking)
MeanChecking
## [1] 2559.805

What are some other ways to clean this data? How about Excel? How does Excel treat the missing values and the “$” symbols?

I used the cleanChecking line to clear out the NA values so you could get the actual mean. In Excel, you can format cells so that it ignores $ and commas.