Business Analytics Lab Worksheet 01

About

R-Studio is a free and open-source integrated development environment (IDE) for R, a programming language for statistical computing and graphics. The Credit Risk Data displays the credit risk of an individual based on the loan they have taken out and other features of the individual.

Capabilities

R studio is able to compute various statistical and graphical techniques, such as linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, time series plots, maps, etc.

Setup

After downloading the bdad_lab01 zip folder, make sure to open the folder in the downloads, right click it, and select ‘extract’. This will give you a new unzipped folder. Next, we must set this folder as the working directory. The way to do this is to open R Studio, go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Now, follow the worksheet directions to complete the lab.

Task 1

To begin the Lab, follow Task 1 as outlined in the worksheet. Examine the content of the csv file in Excel, create a simple star relational schema in erdplus, take a screenshot of the image, and upload it below.

To add a picture, use the directions found in Lab 0. Below is an example of what the simple star relational schema should look like.

Task 2

Next, read the csv file into R Studio. It can be useful to name your data to create a shortcut to it. Here we will label the data, ‘mydata’. To see the data in the console, one can ‘call’ it by referring to it by its given name.

mydata = read.csv(file="data/creditrisk.csv")
head(mydata)

##      Loan.Purpose Checking Savings Months.Customer Months.Employed Gender
## 1 Small Appliance        0     739              13              12      M
## 2       Furniture        0    1230              25               0      M
## 3         New Car        0     389              19             119      M
## 4       Furniture      638     347              13              14      M
## 5       Education      963    4754              40              45      M
## 6       Furniture     2827       0              11              13      M
##   Marital.Status Age Housing Years        Job Credit.Risk
## 1         Single  23     Own     3  Unskilled         Low
## 2       Divorced  32     Own     1    Skilled        High
## 3         Single  38     Own     4 Management        High
## 4         Single  36     Own     2  Unskilled        High
## 5         Single  31    Rent     3    Skilled         Low
## 6        Married  25     Own     1    Skilled         Low

To capture, or extract, the checking and savings columns and perform some analytics on them, we must first be able to extract the columns from the data separately. Using the ‘$’ sign following the label for the data extracts a specific column. For convenience, we relabel the extracted data.

Below, we have extracted the checking column.

#Extracting the Checking Column
checking = mydata$Checking 

#Calling the Checking Column
checking

##   [1]     0     0     0   638   963  2827     0     0  6509   966     0
##  [12]     0   322     0   396     0   652   708   207   287     0   101
##  [23]     0     0     0   141     0  2484   237     0   335  3565     0
##  [34] 16647     0     0     0   940     0     0   218     0 16935   664
##  [45]   150     0   216     0     0     0   265  4256   870   162     0
##  [56]     0     0   461     0     0     0   580     0     0     0     0
##  [67]   758   399   513     0     0   565     0     0     0   166  9783
##  [78]   674     0 15328     0   713     0     0     0     0     0   303
##  [89]   900     0  1257     0   273   522     0     0     0     0   514
## [100]   457  5133     0   644   305  9621     0     0     0     0     0
## [111]  6851 13496   509     0 19155     0     0   374     0   828     0
## [122]   829     0     0   939     0   889   876   893 12760     0     0
## [133]   959     0     0     0     0   698     0     0     0 12974     0
## [144]   317     0     0     0   192     0     0     0     0     0   942
## [155]     0  3329     0     0     0     0     0     0   339     0     0
## [166]     0   105     0   216   113   109     0     0  8176     0   468
## [177]  7885     0     0     0     0     0     0     0     0     0   734
## [188]     0     0   172   644     0   617     0   586     0     0     0
## [199]     0     0   522   585  5588     0   352     0  2715   560   895
## [210]   305     0     0     0  8948     0     0     0     0     0   483
## [221]     0     0     0   663   624     0     0   152     0     0   498
## [232]     0   156  1336     0     0     0  2641     0     0     0     0
## [243]     0   887     0     0     0     0 18408   497     0   946   986
## [254]  8122     0   778   645     0   682 19812     0     0   859     0
## [265]     0     0     0     0     0   795     0     0     0     0   852
## [276]     0     0   425     0     0     0 11072     0   219  8060     0
## [287]     0     0     0  1613   757     0     0   977   197     0     0
## [298]     0     0     0   256   296     0     0     0   298     0  8636
## [309]     0     0 19766     0     0     0     0  4089     0   271   949
## [320]     0   911     0     0     0     0   271     0     0     0     0
## [331]  4802   177     0     0   996   705     0     0  5960     0   759
## [342]     0   651   257   955     0  8249     0   956   382     0   842
## [353]  3111     0     0  2846   231     0 17366     0   332   242     0
## [364]   929     0     0     0     0     0     0     0   646   538     0
## [375]     0     0     0   135  2472     0 10417   211 16630     0   642
## [386]     0   296   898   478   315   122     0     0     0   670   444
## [397]  3880   819     0     0     0     0     0     0     0     0     0
## [408]   161     0     0   789   765     0     0   983     0     0   798
## [419]     0   193   497     0     0     0     0

Now, fill in the code to extract and call the savings column.

#Extracting the Savings Column

#Calling the Savings Column

In order to calculate the mean, or the average by hand of the checkings columns, one can add each individual entry and divide by the total number or rows. This would take much time, but thankfully, R has a command for this.

We have done an example using the checkings column. Compute the same using the savings column.

#Using the 'mean' function on checking to calculate the checking average and naming the average 'meanChecking'
meanChecking = mean(checking)

#Calling the average
meanChecking

## [1] 1048.014

#Find the average of the savings column and name the average of the savings meanSavings

#Call mean savings

Next, compute the standard deviation or spread of both the checkings and savings columns.

#Computing the standard deviation of standard deviation
spreadChecking = sd(checking)

#Find the standard deviation of savings

Now, to compute the SNR, the signal to noise ratio, a formula is created because there is no built in function.

SNR is the mean, or average, divided by the spread.

#Compute the snr of Checking and name it snr_Checking
snr_Checking = meanChecking/spreadChecking

#Call snr_Checking
snr_Checking

## [1] 0.3330006

#Find the snr of the savings and name it snr_Saving

#Call snr_Saving

Of the Checking and Savings, which has a higher SNR? Why do you think that is?

Task 3

After using Watson Analytics to find patterns in the data, save your work and upload a screenshot here. Refer to Task 1 on how to upload a photo.