This worksheet includes three main tasks in data modeling (a key step to understand the data), basic steps to compute a simple signal-to-noise ratio, and some examples for basic R operations.
Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.
To begin the Lab, examine the content of the csv file ‘creditrisk.csv’ in folder bsad_lab02 >> data
Create a simple star relational schema for the data in file ‘creditrisk.csv’ in ERDPlus standalone feature https://erdplus.com/#/standalone.
Export the image of the star schema, save it in your wokring directory, and then upload it to R Studio.
Below your picture, add brief comments about the logic you used to create the star schema (explain your steps and thoughts).
To upload a picture in R Studio, use the directions found in Lab00.
Below are steps and an example to create a simple star relational schema in ERDPlus.
Finally export the diagram as an image.
For the credit risk, we have Loan, customer and credit risk which goes to dimension table. The unique key from dimension table turns to forigen key in the fact table. Since the nunmber in records in fact table will increase by 1 if we have 1 new loan, loan id will also be included in fact table.
Next, read the csv file into R Studio. It can be useful to name your data to create a shortcut to it. Here we will label the data, ‘mydata’. To see the top head data in the console, one can ‘call’ it using the function ‘head’ and referring to it by its given shortcut name.
mydata = read.csv(file="data/creditrisk.csv")
head(mydata)
## Loan.Purpose Checking Savings Months.Customer Months.Employed Gender
## 1 Small Appliance 0 739 13 12 M
## 2 Furniture 0 1230 25 0 M
## 3 New Car 0 389 19 119 M
## 4 Furniture 638 347 13 14 M
## 5 Education 963 4754 40 45 M
## 6 Furniture 2827 0 11 13 M
## Marital.Status Age Housing Years Job Credit.Risk
## 1 Single 23 Own 3 Unskilled Low
## 2 Divorced 32 Own 1 Skilled High
## 3 Single 38 Own 4 Management High
## 4 Single 36 Own 2 Unskilled High
## 5 Single 31 Rent 3 Skilled Low
## 6 Married 25 Own 1 Skilled Low
To capture, or extract, the checking and savings columns and perform some analytics on them, we must first be able to extract the columns from the data separately. Using the ‘$’ sign following the label for the data extracts a specific column. For convenience, we relabel the extracted data.
Below, we have extracted the Checking column.
#Extracting the Checking Column and naming it 'checking'
checking = mydata$Checking
#Calling 'checking' to display top head values
head(checking)
## [1] 0 0 0 638 963 2827
Now, fill in the code to extract and call the savings column.
#Extract the Savings Column and name it 'savings'
savings = mydata$Savings
#Call 'savings' to display top head values
head(savings)
## [1] 739 1230 389 347 4754 0
In order to calculate the mean, or the average by hand of the Checkings columns, one can add each individual entry and divide by the total number or rows. This would take much time, but thankfully, R has a command for this.
We have done an example using the Checking column. Compute the same using the Savings column where requested.
#Using the 'mean' function on 'checking' to calculate the average of Checking column and naming the average 'meanChecking'
meanChecking = mean(checking)
#Calling meanChecking
meanChecking
## [1] 1048.014
#Calculate the average of the Savings column and name the average 'meanSavings'
meanSavings = mean(savings)
#Call meanSavings
meanSavings
## [1] 1812.562
Next, compute the standard deviation / spread of both the Checkings and Savings columns. We have done a example using the Checking column. Compute the same using the Savings column where requested.
#Computing the standard deviation of Checking column and naming the standard deviation 'spreadChecking'
spreadChecking = sd(checking)
#Calling 'spreadChecking'
spreadChecking
## [1] 3147.183
#Compute the standard deviation of Savings column and name the standard deviation 'spreadSavings'
spreadSavings = sd(savings)
#Call 'spreadSavings'
spreadSavings
## [1] 3597.285
Now, to compute the SNR, the signal to noise ratio, a formula is created because there is no built in function.
SNR is the mean/average, divided by the spread/standard deviation
We have done a example using the Checking column. Compute the same using the Savings column where requested.
#Compute the snr of Checking column and name the snr 'snr_Checking'
snr_Checking = meanChecking/spreadChecking
#Call snr_Checking
snr_Checking
## [1] 0.3330006
#Find the snr of the Savings column and name the snr 'snr_Savings'
snr_Savings = meanSavings/spreadSavings
#Call snr_Savings
snr_Savings
## [1] 0.5038695
Of the Checking and Savings, which has a higher SNR? Why do you think that is?
In this task we solve some exmples about R operations. Read the solved example, and then do the requested task.
Solved example
# extracting column Loan Purpose from file 'creditrisk.csv'. Remember that we already read the csv file into R Studio and labeled the data, 'mydata'. We will name the column Loan Purpose 'my_LoanPurpose'
my_LoanPurpose = mydata$Loan.Purpose
#extracting the column Gender and naming it my_Gender
my_Gender = mydata$Gender
#Create a data frame from my_LoanPurpose and my_Gender and naming the data frame my_data_frame_1
my_data_frame_1 = data.frame(my_LoanPurpose,my_Gender)
#Calling my_data_frame_1 to show top head values
head(my_data_frame_1)
## my_LoanPurpose my_Gender
## 1 Small Appliance M
## 2 Furniture M
## 3 New Car M
## 4 Furniture M
## 5 Education M
## 6 Furniture M
#we will sum the value in the 1st row in column Months Employed with the value of the 4th row in column Months Customer
mydata$Months.Employed[1]+mydata$Months.Customer[4]
## [1] 25
Now you have to do the following
1- Insert a chunck of R code and do the following
#2- Extract column Age and name it my_Age
my_Age = mydata$Age
#3- Create a data frame from my_Gender (you already created this variable) and my_Age. Name the data frame my_data_frame_2
my_data_frame_2 = data.frame(my_Gender,my_Age)
#4- call my_data_frame_2 to show top head values
head(my_data_frame_2)
## my_Gender my_Age
## 1 M 23
## 2 M 32
## 3 M 38
## 4 M 36
## 5 M 31
## 6 M 25
#5- Sum the value in the 5th row in column Years with the value of the 2nd row in column Months Customer
mydata$Years[5]+mydata$Months.Customer[2]
## [1] 28