About

This worksheet includes three main tasks in data modeling (a key step to understand the data), basic steps to compute a simple signal-to-noise ratio, and data exploration to identify trends and patterns using Watson Analytics.

Setup

Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.


Task 1

To begin the Lab, examine the content of the csv file ‘creditrisk.csv’ by opening the file in RStudio. You can use File -> Import Dataset for that purpose.

Create a simple star relational schema in ERDPlus standalone feature https://erdplus.com/#/standalone, take a screenshot of the image, and upload it below.

To add a picture, use the directions found in Lab00. Below are steps and an example to create a simple star relational schema in ERDPlus.

Based off the material discussed in class, I chose to include all of the categorical data as dimension tables. I organized the fact table in order of rows in Excel to keep the structure consistent.

Steps to create an star relation schema using erdplus.
From the drop-down option select New Start Schema
Example of how to create an start schema using erdplus
Completed Star schema example

Finally export the diagram as an image.


Task 2

Next, read the csv file into R Studio. It can be useful to name your data to create a shortcut to it. Here we will label the data, ‘mydata’. To see the top head data in the console, one can ‘call’ it using the function ‘head’ and referring to it by its given shortcut name.

mydata = read.csv(file="data/creditrisk.csv")
head(mydata)
##      Loan.Purpose Checking Savings Months.Customer Months.Employed Gender
## 1 Small Appliance        0     739              13              12      M
## 2       Furniture        0    1230              25               0      M
## 3         New Car        0     389              19             119      M
## 4       Furniture      638     347              13              14      M
## 5       Education      963    4754              40              45      M
## 6       Furniture     2827       0              11              13      M
##   Marital.Status Age Housing Years        Job Credit.Risk
## 1         Single  23     Own     3  Unskilled         Low
## 2       Divorced  32     Own     1    Skilled        High
## 3         Single  38     Own     4 Management        High
## 4         Single  36     Own     2  Unskilled        High
## 5         Single  31    Rent     3    Skilled         Low
## 6        Married  25     Own     1    Skilled         Low

To capture, or extract, the checking and savings columns and perform some analytics on them, we must first be able to extract the columns from the data separately. Using the ‘$’ sign following the label for the data extracts a specific column. For convenience, we relabel the extracted data.

Below, we have extracted the checking column.

#Extracting the Checking Column
checking = mydata$Checking 

#Calling the Checking Column to display top head values
head(checking)
## [1]    0    0    0  638  963 2827

Now, fill in the code to extract and call the savings column.

#Extracting the Savings Column
savings=mydata$Savings

#Calling the Savings Column
head(savings)
## [1]  739 1230  389  347 4754    0

In order to calculate the mean, or the average by hand of the checkings columns, one can add each individual entry and divide by the total number or rows. This would take much time, but thankfully, R has a command for this.

We have done an example using the checkings column. Compute the same using the savings column.

#Using the 'mean' function on checking to calculate the checking average and naming the average 'meanChecking'
meanChecking = mean(checking)

#Calling the average
meanChecking
## [1] 1048.014
#Find the average of the savings column and name the average of the savings meanSavings
meanSavings=mean(savings)

#Call meanSavings
meanSavings
## [1] 1812.562

Next, compute the standard deviation or spread of both the checkings and savings columns.

#Computing the standard deviation of standard deviation
spreadChecking = sd(checking)

#Find the standard deviation of savings 
spreadSavings= sd(savings)

Now, to compute the SNR, the signal to noise ratio, a formula is created because there is no built in function.

SNR is the mean, or average, divided by the spread.

#Compute the snr of Checking and name it snr_Checking
snr_Checking=meanChecking/spreadChecking

#Call snr_Checking
snr_Checking
## [1] 0.3330006
#Find the snr of the savings and name it snr_Saving
snr_Saving=meanSavings/spreadSavings

#Call snr_Saving
snr_Saving
## [1] 0.5038695

Of the Checking and Savings, which has a higher SNR? Why do you think that is?

The higher SNR would be savings because it has a higher ratio, making it easier to trust. As discussed in class, the more spread the data is the higher volume of noise. This means there is a larger amount of junk data in the checkings column.


Task 3

Login to Watson Analytics and upload the file creditrisk.csv to your account. Use Explore to find patterns in the data. Consider for example trend of ‘Months Employed over Age by Gender’. Save your work and upload any screenshot(s) here. Refer to Task 1 on how to upload a photo. For every uploaded screenshot share your observations on general data trends and data behavior. Any screenshot without observations will be dismissed.

#Since the Checkings has a lower SNR, I wanted to figure out its relationship with the other variables in the table to better understand why the data was filled with so much noise. This image depicts which age group has the highest Checking account. I singled out 25 because it was the largest in the data set, and 36 because of its relationship to the Credit Risk graph depicted below.

#This graph continues to hold Checking and age constant, with my independent variable being Credit Risk. The left handed side is High Credit Risk, while the right handed side is Low. Therefore, I found that 36 year olds have a higher credit risks than 25 year olds. I found this to be very odd, because I figured 25 year olds would not have enough financial credability, so I decided to examine things further in the following image.

#Through deeper analysis of this word chart I found that the mean of Unskilled workers is 31.6, meaning that there are more unskilled workers in the 36 age range than the 25 range. This explains why the 36 year olds have a higher credit risk, because they are not in a skilled profession.

The last thing I wanted to answer is, What are these age groups obtaining a loan for? This way the business can figure out which purposes can be paid back the fastest. I decided to limit the purposes to the three most popular. In this graph the first set of bars on the left corresponds with furniture, the second with a new car, and the third being small appliances. It makes sense the 25 year olds are purchasing furniture because they are probably trying to furnish their first official apartment, while 36 year olds are at the point in their life of upgrading their car. This data suggests that the New Car loans will take the longest to pay back because they of the high risk 36 year olds have in relation to their checking account. It makes sense that indiviuals who are 25 can pay back their furniture loans because furniture does not cost nearly as much as a new car.