This worksheet includes three main tasks in data modeling (a key step to understand the data), basic steps to compute a simple signal-to-noise ratio, and data exploration to identify trends and patterns using Watson Analytics.
Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.
To begin the Lab, examine the content of the csv file ‘creditrisk.csv’ by opening the file in RStudio. You can use File -> Import Dataset for that purpose.
Create a simple star relational schema in ERDPlus standalone feature https://erdplus.com/#/standalone, take a screenshot of the image, and upload it below.
To add a picture, use the directions found in Lab00. Below are steps and an example to create a simple star relational schema in ERDPlus.
Finally export the diagram as an image.
This star schema divides the data into four dimensions: demographics, finances, ownership and employment. People’s demographics, such as their family size, can impact their need for a loan. Additionally, financial well-being and type of employment affect the risk involved with giving out a loan. Ownership shows people’s comfort and experience with significant costs, specifically with housing.
Next, read the csv file into R Studio. It can be useful to name your data to create a shortcut to it. Here we will label the data, ‘mydata’. To see the top head data in the console, one can ‘call’ it using the function ‘head’ and referring to it by its given shortcut name.
mydata = read.csv(file="data/creditrisk.csv")
head(mydata)
## Loan.Purpose Checking Savings Months.Customer Months.Employed Gender
## 1 Small Appliance 0 739 13 12 M
## 2 Furniture 0 1230 25 0 M
## 3 New Car 0 389 19 119 M
## 4 Furniture 638 347 13 14 M
## 5 Education 963 4754 40 45 M
## 6 Furniture 2827 0 11 13 M
## Marital.Status Age Housing Years Job Credit.Risk
## 1 Single 23 Own 3 Unskilled Low
## 2 Divorced 32 Own 1 Skilled High
## 3 Single 38 Own 4 Management High
## 4 Single 36 Own 2 Unskilled High
## 5 Single 31 Rent 3 Skilled Low
## 6 Married 25 Own 1 Skilled Low
To capture, or extract, the checking and savings columns and perform some analytics on them, we must first be able to extract the columns from the data separately. Using the ‘$’ sign following the label for the data extracts a specific column. For convenience, we relabel the extracted data.
Below, we have extracted the checking column.
#Extracting the Checking Column
checking = mydata$Checking
#Calling the Checking Column to display top head values
head(checking)
## [1] 0 0 0 638 963 2827
Now, fill in the code to extract and call the savings column.
#Extracting the Savings Column
savings = mydata$Savings
#Calling the Savings Column
head(savings)
## [1] 739 1230 389 347 4754 0
In order to calculate the mean, or the average by hand of the checkings columns, one can add each individual entry and divide by the total number or rows. This would take much time, but thankfully, R has a command for this.
We have done an example using the checkings column. Compute the same using the savings column.
#Using the 'mean' function on checking to calculate the checking average and naming the average 'meanChecking'
meanChecking = mean(checking)
#Calling the average
meanChecking
## [1] 1048.014
#Find the average of the savings column and name the average of the savings meanSavings
meanSavings = mean(savings)
#Call meanSavings
meanSavings
## [1] 1812.562
Next, compute the standard deviation or spread of both the checkings and savings columns.
#Computing the standard deviation of standard deviation
spreadChecking = sd(checking)
sd(checking)
## [1] 3147.183
#Find the standard deviation of savings
spreadSavings = sd(savings)
sd(savings)
## [1] 3597.285
Now, to compute the SNR, the signal to noise ratio, a formula is created because there is no built in function.
SNR is the mean, or average, divided by the spread.
#Compute the snr of Checking and name it snr_Checking
snr_Checking = meanChecking/spreadChecking
#Call snr_Checking
snr_Checking
## [1] 0.3330006
#Find the snr of the savings and name it snr_Saving
snr_Savings = meanSavings/spreadSavings
#Call snr_Saving
snr_Savings
## [1] 0.5038695
Of the Checking and Savings, which has a higher SNR? Why do you think that is?
Savings (0.5038695) has a higher signal to noise ratio than Checking (0.330006). I think this is the case because the Checking column has more zero values and possible outliers than the Savings column, and that may be skewing the Checking data.
Login to Watson Analytics and upload the file creditrisk.csv to your account. Use Explore to find patterns in the data. Consider for example trend of ‘Months Employed over Age by Gender’. Save your work and upload any screenshot(s) here. Refer to Task 1 on how to upload a photo. For every uploaded screenshot share your observations on general data trends and data behavior. Any screenshot without observations will be dismissed.
This analysis shows that women are generally employed for a significantly lower amount of time than men, across all ages. More men and women, overall, remain employed for the longest time during their 20s and 30s. However, there are continuous spikes and depressions in the data for both genders, likely because there isn’t data for every single age, and the data that is there has a lot of variation. Additionally, here is little data for both genders around age 54-56, but the general trend implies that less months are worked as indidivuals age.
This diagram analyzes gender and credit risk. The general trend seen is that slightly more women have a higher credit risk, and slightly more men have a lower credit risk. Overall, more data was given for men, which is why the mens’ circle is larger than the females. It seems that there isn’t a lot of correlation between credit risk and gender because both genders are very close to a 50-50 breakdown.
This line graph shows the trends in age and amount of money saved in checking accounts. The highest savings are seen around 25 and 35 years of age. The general trend shows that there is a decrease in checking account balances as as individual gets older, after around 21 years of age. Data for individuals younger than 21 years shows similar trends in checking balance to those in their late 50s and early 60s. Those 21 years and younger may start taking out loans gradually to pay for education as they get out of school. Those in their late 50s and early 60s may not need to take out loans as they near retirement. However, this may just be the case because of less data for those age groups.
This word analysis shows that most loans are taken out for small appliances, new cars and furniture. It was interesting to see education as one of the less likely loan purposes, because of how popular student loans are.
This analysis shows comparisons between savings, marital status and credit risk. It appears that single individuals with low credit risk have the highest savings. Married people have lowest savings, with a fairly equal share of high and low credit risk. Divorced individuals have about average amoutns in savings and also have a fairly even split between high and low credit risk. This data would suggest that single individuals are in the best position of having average to high savings and a high credit risk. For married and divorced individuals, their credit risk does not seem to correllate with the amount had in savings.