About
This worksheet includes three main tasks: data modeling (a key step to understand the data), basic steps to compute a simple signal-to-noise ratio, and data exploration to identify trends & patterns using Watson Analytics.
Setup
Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.
Note
For your assignment you may be using different data sets than what is included here. Always read carefully the instructions on Sakai. For clarity, tasks/questions to be completed/answered are highlighted in red color (visible in preview) and numbered according to their particular placement in the task section. Quite often you will need to add your own code chunk.
Execute all code chunks, preview, publish, and submit link on Sakai.
Task 1: Data Modeling
To begin the Lab, examine the content of the csv file ‘creditrisk.csv’ by opening the file in RStudio. You can view the file separetely in Excel or use File -> Import Dataset in RStudio for that purpose.
An important early phase when working with any data is modeling. Whether we are dealing with structured or unstructurred data, data modeling is an exercise to demonstrate our understanding of the data, not just from the data relational aspect, but also from the business perspective. There are many database modeling tools available for the purpose of creating relational schemas. For the purpose of this lab we will work with the tools ERDPlus available at https://erdplus.com. An animated view on how to create star schemas is on the web page.
##### 1) Create a star relational schema of the data using ERDPlus standalone feature https://erdplus.com/standalone, take a screenshot of the image, and add it below. Consider using one fact table for loan, one dimension table for customer profile, and one dimension table for credit risk (8pts)
To add a picture, use the directions found in Lab01. Below are steps and an example to create a simple star relational schema in ERDPlus.
Once completed select Export Image to save your work as an image file, and to include in this lab worksheet.

Task 2: Signal-to-Noise Ratio
Next, read the csv file into R Studio. It can be useful to name your data to create a shortcut to it. Here we will label the data, ‘mydata’. To see the top head data in the console, one can ‘call’ it using the function ‘head’ and referring to it by its given shortcut name.
mydata = read.csv(file="creditrisk.csv")
head(mydata)
To capture, or extract, the checking and savings columns and perform some analytics on them, we must first be able to extract the columns from the data separately. Using the ‘$’ sign following the label for the data extracts a specific column. For convenience, we relabel the extracted data. Below, we have extracted the checking column.
#Extracting the Checking Column
checking = mydata$Checking
#Calling the Checking Column to display top head values
head(checking)
[1] 0 0 0 638 963 2827
##### 2A) Repeat here the above code chunk to extract instead the savings column. Be careful to use different variable naming (2pts)
#Extracting the Savings Column
savings = mydata$Savings
#Calling the savings Column to display top head values
head(savings)
[1] 739 1230 389 347 4754 0
In order to calculate the mean, or the average by hand of the checking column, one can add each individual row entry and divide by the total number of rows. Thankfully, R has a built-in command for this. We have done an example using the checking column.
#Using the 'mean' function on checking to calculate the checking average and naming the average 'meanChecking'
meanChecking = mean(checking)
#Calling the average
meanChecking
[1] 1048.014
We similarly compute the standard deviation or spread of the checking column
#Computing the standard deviation of checking
spreadChecking = sd(checking)
Now, to compute the SNR, the signal to noise ratio, a formula is created because there is no built in function. SNR is the mean, or average, divided by the spread.
#Compute the snr of Checking and name it snr_Checking
snr_Checking = meanChecking/spreadChecking
#Call snr_Checking
snr_Checking
[1] 0.3330006
##### 2B) Repeat here the above code chunks calculations to iintroduce new variables for the saving column and to derive the corresponding SNR (6pts)
#Using the 'mean' function on checking to calculate the checking average and naming the average 'meanChecking'
meanSavings = mean(savings)
#Calling the average
meanSavings
[1] 1812.562
#Computing the standard deviation of Savings
spreadsavings = sd(savings)
#Compute the snr of Savings and name it snr_Savings
snr_savings = meanSavings/spreadsavings
#Call snr_savings
snr_savings
[1] 0.5038695
##### 2C) Of the checking and savings data , which one has a higher SNR? What does it mean in terms of possible data quality? (4pts) Checking Because when data gets bigger, the noise will grow faster than the signal.Then, it will become more challenging to separate signal from noise. Also, snrSavings= 0.5038695 and snrChecking= 0.3330006. Therefore, Checking has a higher SNR.
LS0tDQp0aXRsZTogIkRhdGEgTW9kZWxpbmcgJiBFeHBsb3JhdGlvbiAobGFiMDMpIg0KYXV0aG9yOiAieWFjaGVuZyBzb25nIg0KZGF0ZTogIjIvMDYvMjAyMCINCm91dHB1dDoNCiAgaHRtbF9ub3RlYm9vazogZGVmYXVsdA0KICBodG1sX2RvY3VtZW50OiBkZWZhdWx0DQpzdWJ0aXRsZTogQlNBRDM0MywgQnVzaW5lc3MgQW5hbHl0aWNzLCBTcHJpbmcgMjAyMA0KLS0tDQoNCiMjIyBBYm91dA0KDQpUaGlzIHdvcmtzaGVldCBpbmNsdWRlcyB0aHJlZSBtYWluIHRhc2tzOiBkYXRhIG1vZGVsaW5nIChhIGtleSBzdGVwIHRvIHVuZGVyc3RhbmQgdGhlIGRhdGEpLCBiYXNpYyBzdGVwcyB0byBjb21wdXRlIGEgc2ltcGxlIHNpZ25hbC10by1ub2lzZSByYXRpbywgYW5kIGRhdGEgZXhwbG9yYXRpb24gdG8gaWRlbnRpZnkgdHJlbmRzICYgcGF0dGVybnMgdXNpbmcgV2F0c29uIEFuYWx5dGljcy4NCg0KIyMjIFNldHVwDQoNClJlbWVtYmVyIHRvIGFsd2F5cyBzZXQgeW91ciB3b3JraW5nIGRpcmVjdG9yeSB0byB0aGUgc291cmNlIGZpbGUgbG9jYXRpb24uIEdvIHRvICdTZXNzaW9uJywgc2Nyb2xsIGRvd24gdG8gJ1NldCBXb3JraW5nIERpcmVjdG9yeScsIGFuZCBjbGljayAnVG8gU291cmNlIEZpbGUgTG9jYXRpb24nLiBSZWFkIGNhcmVmdWxseSB0aGUgYmVsb3cgYW5kIGZvbGxvdyB0aGUgaW5zdHJ1Y3Rpb25zIHRvIGNvbXBsZXRlIHRoZSB0YXNrcyBhbmQgYW5zd2VyIGFueSBxdWVzdGlvbnMuICBTdWJtaXQgeW91ciB3b3JrIHRvIFJQdWJzIGFzIGRldGFpbGVkIGluIHByZXZpb3VzIG5vdGVzLiANCg0KIyMjIE5vdGUNCg0KRm9yIHlvdXIgYXNzaWdubWVudCB5b3UgbWF5IGJlIHVzaW5nIGRpZmZlcmVudCBkYXRhIHNldHMgdGhhbiB3aGF0IGlzIGluY2x1ZGVkIGhlcmUuIEFsd2F5cyByZWFkIGNhcmVmdWxseSB0aGUgaW5zdHJ1Y3Rpb25zIG9uIFNha2FpLiAgRm9yIGNsYXJpdHksIHRhc2tzL3F1ZXN0aW9ucyB0byBiZSBjb21wbGV0ZWQvYW5zd2VyZWQgYXJlIGhpZ2hsaWdodGVkIGluIHJlZCBjb2xvciAodmlzaWJsZSBpbiBwcmV2aWV3KSBhbmQgbnVtYmVyZWQgYWNjb3JkaW5nIHRvIHRoZWlyIHBhcnRpY3VsYXIgcGxhY2VtZW50IGluIHRoZSB0YXNrIHNlY3Rpb24uICBRdWl0ZSBvZnRlbiB5b3Ugd2lsbCBuZWVkIHRvIGFkZCB5b3VyIG93biBjb2RlIGNodW5rLg0KDQpFeGVjdXRlIGFsbCBjb2RlIGNodW5rcywgcHJldmlldywgcHVibGlzaCwgYW5kIHN1Ym1pdCBsaW5rIG9uIFNha2FpLg0KDQotLS0tLS0tLS0tLS0tLQ0KDQojIyMgVGFzayAxOiBEYXRhIE1vZGVsaW5nDQoNClRvIGJlZ2luIHRoZSBMYWIsIGV4YW1pbmUgdGhlIGNvbnRlbnQgb2YgdGhlIGNzdiBmaWxlICdjcmVkaXRyaXNrLmNzdicgYnkgb3BlbmluZyB0aGUgZmlsZSBpbiBSU3R1ZGlvLiBZb3UgY2FuIHZpZXcgdGhlIGZpbGUgc2VwYXJldGVseSBpbiBFeGNlbCBvciB1c2UgRmlsZSAtPiBJbXBvcnQgRGF0YXNldCBpbiBSU3R1ZGlvIGZvciB0aGF0IHB1cnBvc2UuDQoNCkFuIGltcG9ydGFudCBlYXJseSBwaGFzZSB3aGVuIHdvcmtpbmcgd2l0aCBhbnkgZGF0YSBpcyBtb2RlbGluZy4gIFdoZXRoZXIgd2UgYXJlIGRlYWxpbmcgd2l0aCBzdHJ1Y3R1cmVkIG9yIHVuc3RydWN0dXJyZWQgZGF0YSwgZGF0YSBtb2RlbGluZyBpcyBhbiBleGVyY2lzZSB0byBkZW1vbnN0cmF0ZSBvdXIgdW5kZXJzdGFuZGluZyBvZiB0aGUgZGF0YSwgbm90IGp1c3QgZnJvbSB0aGUgZGF0YSByZWxhdGlvbmFsIGFzcGVjdCwgIGJ1dCBhbHNvIGZyb20gdGhlIGJ1c2luZXNzIHBlcnNwZWN0aXZlLiBUaGVyZSBhcmUgbWFueSBkYXRhYmFzZSBtb2RlbGluZyB0b29scyBhdmFpbGFibGUgZm9yIHRoZSBwdXJwb3NlIG9mIGNyZWF0aW5nIHJlbGF0aW9uYWwgc2NoZW1hcy4gIEZvciB0aGUgcHVycG9zZSBvZiB0aGlzIGxhYiB3ZSB3aWxsIHdvcmsgd2l0aCB0aGUgdG9vbHMgRVJEUGx1cyBhdmFpbGFibGUgYXQgW2h0dHBzOi8vZXJkcGx1cy5jb21dKGh0dHBzOi8vZXJkcGx1cy5jb20pLiAgQW4gYW5pbWF0ZWQgdmlldyBvbiBob3cgdG8gY3JlYXRlIHN0YXIgc2NoZW1hcyBpcyBvbiB0aGUgd2ViIHBhZ2UuDQoNCg0KPHNwYW4gc3R5bGU9ImNvbG9yOnJlZCI+DQojIyMjIyAxKSBDcmVhdGUgYSBzdGFyIHJlbGF0aW9uYWwgc2NoZW1hIG9mIHRoZSBkYXRhIHVzaW5nIEVSRFBsdXMgc3RhbmRhbG9uZSBmZWF0dXJlIFtodHRwczovL2VyZHBsdXMuY29tL3N0YW5kYWxvbmVdKGh0dHBzOi8vZXJkcGx1cy5jb20vc3RhbmRhbG9uZSksIHRha2UgYSBzY3JlZW5zaG90IG9mIHRoZSBpbWFnZSwgYW5kIGFkZCBpdCBiZWxvdy4gQ29uc2lkZXIgdXNpbmcgb25lIGZhY3QgdGFibGUgZm9yIGxvYW4sIG9uZSBkaW1lbnNpb24gdGFibGUgZm9yIGN1c3RvbWVyIHByb2ZpbGUsIGFuZCBvbmUgZGltZW5zaW9uIHRhYmxlIGZvciBjcmVkaXQgcmlzayAoOHB0cykNCiA8L3NwYW4+DQoNCiFbXShpbWEwMDEucG5nKQ0KVG8gYWRkIGEgcGljdHVyZSwgdXNlIHRoZSBkaXJlY3Rpb25zIGZvdW5kIGluIExhYjAxLiBCZWxvdyBhcmUgc3RlcHMgYW5kIGFuIGV4YW1wbGUgdG8gY3JlYXRlIGEgc2ltcGxlIHN0YXIgcmVsYXRpb25hbCBzY2hlbWEgaW4gRVJEUGx1cy4gDQoNCiMjIyMjIEZyb20gdGhlIE1lbnUgZHJvcC1kb3duIHNlbGVjdCAqKk5ldyBTdGFyIFNjaGVtYSoqICANCg0KIVtdKGltZzAxLnBuZykNCg0KQWRkIHRoZSBGYWN0IGFuZCBEaW1lbnNpb24ocykgdGFibGVzIGFzIG5lZWRlZC4gRm9yIGVhY2ggdGFibGUgbWFrZSBzdXJlIHlvdSBpZGVudGlmeSB0aGUgcHJpbWFyeSBrZXkuIENvbm5lY3QgdGhlIERpbWVuc2lvbihzKSB0YWJsZXMgdG8gdGhlIEZhY3QgdGFibGUgYXMgZGVtb25zdHJhdGVkIGluIHRoZSBhbmltYXRlZCB2aWV3IGFuZCB0aGUgcGljdHVyZSBiZWxvdw0KDQohW10oaW1nMDIucG5nKQ0KDQoNCg0KIyMjIyMgT25jZSBjb21wbGV0ZWQgc2VsZWN0ICoqRXhwb3J0IEltYWdlKiogdG8gc2F2ZSB5b3VyIHdvcmsgYXMgYW4gaW1hZ2UgZmlsZSwgYW5kIHRvIGluY2x1ZGUgaW4gdGhpcyBsYWIgd29ya3NoZWV0Lg0KDQohW10oaW1nMDMucG5nKQ0KDQotLS0tLS0tLS0tLS0tDQoNCiMjIyBUYXNrIDI6IFNpZ25hbC10by1Ob2lzZSBSYXRpbw0KDQpOZXh0LCByZWFkIHRoZSBjc3YgZmlsZSBpbnRvIFIgU3R1ZGlvLiBJdCBjYW4gYmUgdXNlZnVsIHRvIG5hbWUgeW91ciBkYXRhIHRvIGNyZWF0ZSBhIHNob3J0Y3V0IHRvIGl0LiBIZXJlIHdlIHdpbGwgbGFiZWwgdGhlIGRhdGEsICdteWRhdGEnLiBUbyBzZWUgdGhlIHRvcCBoZWFkIGRhdGEgaW4gdGhlIGNvbnNvbGUsIG9uZSBjYW4gJ2NhbGwnIGl0IHVzaW5nIHRoZSBmdW5jdGlvbiAnaGVhZCcgYW5kIHJlZmVycmluZyB0byBpdCBieSBpdHMgZ2l2ZW4gc2hvcnRjdXQgbmFtZS4NCg0KYGBge3J9DQpteWRhdGEgPSByZWFkLmNzdihmaWxlPSJjcmVkaXRyaXNrLmNzdiIpDQpoZWFkKG15ZGF0YSkNCmBgYA0KDQpUbyBjYXB0dXJlLCBvciBleHRyYWN0LCB0aGUgY2hlY2tpbmcgYW5kIHNhdmluZ3MgY29sdW1ucyBhbmQgcGVyZm9ybSBzb21lIGFuYWx5dGljcyBvbiB0aGVtLCB3ZSBtdXN0IGZpcnN0IGJlIGFibGUgdG8gZXh0cmFjdCB0aGUgY29sdW1ucyBmcm9tIHRoZSBkYXRhIHNlcGFyYXRlbHkuIFVzaW5nIHRoZSAnJCcgc2lnbiBmb2xsb3dpbmcgdGhlIGxhYmVsIGZvciB0aGUgZGF0YSBleHRyYWN0cyBhIHNwZWNpZmljIGNvbHVtbi4gRm9yIGNvbnZlbmllbmNlLCB3ZSByZWxhYmVsIHRoZSBleHRyYWN0ZWQgZGF0YS4gQmVsb3csIHdlIGhhdmUgZXh0cmFjdGVkIHRoZSBjaGVja2luZyBjb2x1bW4uIA0KDQpgYGB7cn0NCiNFeHRyYWN0aW5nIHRoZSBDaGVja2luZyBDb2x1bW4NCmNoZWNraW5nID0gbXlkYXRhJENoZWNraW5nIA0KDQojQ2FsbGluZyB0aGUgQ2hlY2tpbmcgQ29sdW1uIHRvIGRpc3BsYXkgdG9wIGhlYWQgdmFsdWVzDQpoZWFkKGNoZWNraW5nKQ0KYGBgDQoNCjxzcGFuIHN0eWxlPSJjb2xvcjpyZWQiPg0KIyMjIyMgMkEpIFJlcGVhdCBoZXJlIHRoZSBhYm92ZSBjb2RlIGNodW5rIHRvIGV4dHJhY3QgaW5zdGVhZCB0aGUgc2F2aW5ncyBjb2x1bW4uIEJlIGNhcmVmdWwgdG8gdXNlIGRpZmZlcmVudCB2YXJpYWJsZSBuYW1pbmcgKDJwdHMpDQo8L3NwYW4+DQpgYGB7cn0NCiNFeHRyYWN0aW5nIHRoZSBTYXZpbmdzIENvbHVtbg0Kc2F2aW5ncyA9IG15ZGF0YSRTYXZpbmdzDQoNCiNDYWxsaW5nIHRoZSBzYXZpbmdzIENvbHVtbiB0byBkaXNwbGF5IHRvcCBoZWFkIHZhbHVlcw0KaGVhZChzYXZpbmdzKQ0KYGBgDQoNCg0KSW4gb3JkZXIgdG8gY2FsY3VsYXRlIHRoZSBtZWFuLCBvciB0aGUgYXZlcmFnZSBieSBoYW5kIG9mIHRoZSBjaGVja2luZyBjb2x1bW4sIG9uZSBjYW4gYWRkIGVhY2ggaW5kaXZpZHVhbCByb3cgZW50cnkgYW5kIGRpdmlkZSBieSB0aGUgdG90YWwgbnVtYmVyIG9mIHJvd3MuIFRoYW5rZnVsbHksIFIgaGFzIGEgYnVpbHQtaW4gY29tbWFuZCBmb3IgdGhpcy4gV2UgaGF2ZSBkb25lIGFuIGV4YW1wbGUgdXNpbmcgdGhlIGNoZWNraW5nIGNvbHVtbi4gDQoNCmBgYHtyfQ0KI1VzaW5nIHRoZSAnbWVhbicgZnVuY3Rpb24gb24gY2hlY2tpbmcgdG8gY2FsY3VsYXRlIHRoZSBjaGVja2luZyBhdmVyYWdlIGFuZCBuYW1pbmcgdGhlIGF2ZXJhZ2UgJ21lYW5DaGVja2luZycNCm1lYW5DaGVja2luZyA9IG1lYW4oY2hlY2tpbmcpDQoNCiNDYWxsaW5nIHRoZSBhdmVyYWdlDQptZWFuQ2hlY2tpbmcNCmBgYA0KDQpXZSBzaW1pbGFybHkgY29tcHV0ZSB0aGUgc3RhbmRhcmQgZGV2aWF0aW9uIG9yIHNwcmVhZCBvZiB0aGUgY2hlY2tpbmcgY29sdW1uDQoNCmBgYHtyfQ0KI0NvbXB1dGluZyB0aGUgc3RhbmRhcmQgZGV2aWF0aW9uIG9mIGNoZWNraW5nDQpzcHJlYWRDaGVja2luZyA9IHNkKGNoZWNraW5nKQ0KYGBgDQoNCk5vdywgdG8gY29tcHV0ZSB0aGUgU05SLCB0aGUgc2lnbmFsIHRvIG5vaXNlIHJhdGlvLCBhIGZvcm11bGEgaXMgY3JlYXRlZCBiZWNhdXNlIHRoZXJlIGlzIG5vIGJ1aWx0IGluIGZ1bmN0aW9uLiBTTlIgaXMgdGhlIG1lYW4sIG9yIGF2ZXJhZ2UsIGRpdmlkZWQgYnkgdGhlIHNwcmVhZC4gDQoNCmBgYHtyfQ0KI0NvbXB1dGUgdGhlIHNuciBvZiBDaGVja2luZyBhbmQgbmFtZSBpdCBzbnJfQ2hlY2tpbmcNCnNucl9DaGVja2luZyA9IG1lYW5DaGVja2luZy9zcHJlYWRDaGVja2luZw0KDQojQ2FsbCBzbnJfQ2hlY2tpbmcNCnNucl9DaGVja2luZw0KYGBgDQoNCjxzcGFuIHN0eWxlPSJjb2xvcjpyZWQiPg0KIyMjIyMgMkIpIFJlcGVhdCBoZXJlIHRoZSBhYm92ZSBjb2RlIGNodW5rcyBjYWxjdWxhdGlvbnMgdG8gaWludHJvZHVjZSBuZXcgdmFyaWFibGVzIGZvciB0aGUgc2F2aW5nIGNvbHVtbiBhbmQgdG8gZGVyaXZlIHRoZSBjb3JyZXNwb25kaW5nIFNOUiAoNnB0cykNCjwvc3Bhbj4NCmBgYHtyfQ0KI1VzaW5nIHRoZSAnbWVhbicgZnVuY3Rpb24gb24gY2hlY2tpbmcgdG8gY2FsY3VsYXRlIHRoZSBjaGVja2luZyBhdmVyYWdlIGFuZCBuYW1pbmcgdGhlIGF2ZXJhZ2UgJ21lYW5DaGVja2luZycNCm1lYW5TYXZpbmdzID0gbWVhbihzYXZpbmdzKQ0KDQojQ2FsbGluZyB0aGUgYXZlcmFnZQ0KbWVhblNhdmluZ3MNCmBgYA0KYGBge3J9DQojQ29tcHV0aW5nIHRoZSBzdGFuZGFyZCBkZXZpYXRpb24gb2YgU2F2aW5ncw0Kc3ByZWFkc2F2aW5ncyA9IHNkKHNhdmluZ3MpDQpgYGANCmBgYHtyfQ0KI0NvbXB1dGUgdGhlIHNuciBvZiBTYXZpbmdzIGFuZCBuYW1lIGl0IHNucl9TYXZpbmdzDQpzbnJfc2F2aW5ncyA9IG1lYW5TYXZpbmdzL3NwcmVhZHNhdmluZ3MNCg0KI0NhbGwgc25yX3NhdmluZ3MNCnNucl9zYXZpbmdzDQpgYGANCg0KDQo8c3BhbiBzdHlsZT0iY29sb3I6cmVkIj4NCiMjIyMjIDJDKSBPZiB0aGUgY2hlY2tpbmcgYW5kIHNhdmluZ3MgZGF0YSAsIHdoaWNoIG9uZSBoYXMgYSBoaWdoZXIgU05SPyBXaGF0IGRvZXMgaXQgbWVhbiBpbiB0ZXJtcyBvZiBwb3NzaWJsZSBkYXRhIHF1YWxpdHk/ICg0cHRzKQ0KPC9zcGFuPg0KQ2hlY2tpbmcNCkJlY2F1c2Ugd2hlbiBkYXRhIGdldHMgYmlnZ2VyLCB0aGUgbm9pc2Ugd2lsbCBncm93IGZhc3RlciB0aGFuIHRoZSBzaWduYWwuVGhlbiwgaXQgd2lsbCBiZWNvbWUgbW9yZSBjaGFsbGVuZ2luZyB0byBzZXBhcmF0ZSBzaWduYWwgZnJvbSBub2lzZS4gQWxzbywgc25yU2F2aW5ncz0gMC41MDM4Njk1IGFuZCBzbnJDaGVja2luZz0gMC4zMzMwMDA2LiBUaGVyZWZvcmUsIENoZWNraW5nIGhhcyBhIGhpZ2hlciBTTlIuDQoNCg==