For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to the task section.
We are going to use tidyverse a collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. *Info:* https://www.tidyverse.org/
## Loading required package: tidyverse
## -- Attaching packages ---------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.7.2 v stringr 1.2.0
## v readr 1.1.1 v forcats 0.2.0
## -- Conflicts ------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
#with read_csv hit tab to see files in a folder. head shows top few rows
#use easy to remember variable names. case sensitive commands
mydata = read.csv("data/Scoring.csv")
head(mydata)
## Status Seniority Home Time Age Marital Records Job Expenses
## 1 good 9 rent 60 30 married no_rec freelance 73
## 2 good 17 rent 60 58 widow no_rec fixed 48
## 3 bad 10 owner 36 46 married yes_rec freelance 90
## 4 good 0 rent 60 24 single no_rec fixed 63
## 5 good 0 rent 36 26 single no_rec fixed 46
## 6 good 1 owner 60 36 married no_rec fixed 75
## Income Assets Debt Amount Price Finrat Savings
## 1 129 0 0 $800.00 $846.00 94.56265 4.200000
## 2 131 0 0 $1,000.00 $1,658.00 60.31363 4.980000
## 3 200 3000 0 $2,000.00 $2,985.00 67.00168 1.980000
## 4 182 2500 0 $900.00 $1,325.00 67.92453 7.933333
## 5 107 0 0 $310.00 $910.00 34.06593 7.083871
## 6 214 3500 0 $650.00 $1,645.00 39.51368 12.830769
To extract the features (columns) from the dataset, use the name of the dataset follow by ‘$’ sign and the name the specific column.
Extract the first feature (column)
#Extracting the Expenses Column
expenses = mydata$Expenses
#Calling the Expenses Column
expenses[1:150]
## [1] 73 48 90 63 46 75 75 35 90 90 60 60 75 75 35 75 35
## [18] 65 45 35 46 45 105 74 45 60 75 75 75 45 45 75 75 35
## [35] 45 45 75 70 45 44 75 35 45 75 35 35 75 48 60 60 75
## [52] 105 75 60 85 35 45 60 90 70 90 35 75 75 75 75 60 45
## [69] 93 60 60 45 75 75 86 90 45 70 45 60 60 60 35 75 35
## [86] 47 60 40 60 35 35 101 95 140 75 45 35 120 89 45 73 75
## [103] 78 35 75 76 90 45 35 79 45 90 35 60 75 35 75 45 35
## [120] 75 57 75 66 114 45 35 75 35 90 75 45 75 60 65 75 60
## [137] 45 60 45 47 60 90 75 119 35 105 60 35 45 60
Now, use the same procedure to extract the other feature
#Extracting the feature (column) - Savings
savings <- mydata$Savings
#Calling the feature (column)
savings[1:150]
## [1] 4.2000000 4.9800000 1.9800000 7.9333333 7.0838710 12.8307692
## [7] 1.8750000 2.7000000 0.8500000 -0.4000000 2.7130435 3.3784615
## [13] 3.9600000 5.5440000 0.6750000 1.4933333 4.7200000 4.6000000
## [19] 3.7090909 4.8960000 2.4400000 15.2181818 0.2000000 4.1684211
## [25] 4.3555556 3.6000000 6.3000000 3.5200000 0.4000000 3.1418182
## [31] 6.9300000 6.5600000 2.7000000 4.4181818 2.5371429 7.8240000
## [37] 2.4000000 1.6484211 4.0000000 1.6984615 4.1333333 12.3157895
## [43] 3.2160000 -0.1600000 4.4347826 4.3714286 5.1360000 6.4000000
## [49] 4.3200000 2.8125000 1.0000000 9.8000000 4.5200000 2.1600000
## [55] 0.7200000 3.9000000 3.3000000 3.8000000 18.7500000 5.7377049
## [61] 2.6181818 2.4500000 2.7600000 2.7720000 2.2666667 3.1800000
## [67] 6.2500000 8.8235294 2.4900000 7.1200000 2.8200000 12.4000000
## [73] 13.3200000 0.6171429 1.3200000 9.8400000 0.0000000 3.7028571
## [79] 2.9333333 0.3600000 5.8909091 0.4235294 13.6800000 7.5000000
## [85] 0.6240000 7.3000000 2.7349229 0.5769231 5.1600000 2.2105263
## [91] 2.8187919 14.9760000 8.5950000 7.2000000 4.5000000 13.3920000
## [97] 2.8421053 2.1120000 3.6000000 0.4160000 3.1090909 3.8000000
## [103] 5.6492308 2.5846154 14.4000000 3.5200000 11.7500000 2.8500000
## [109] 7.6114286 6.6947368 2.2720000 8.2835821 1.5120000 13.8268657
## [115] 0.6600000 7.4250000 3.8347826 2.4757895 13.7142857 5.0000000
## [121] 3.8666667 1.8333333 6.8307692 15.0545455 9.1764706 2.5714286
## [127] 4.8387097 1.8461538 2.2320000 4.1052632 0.6428571 4.8461538
## [133] 7.5000000 11.5714286 7.7076923 4.6666667 2.8695652 2.4545455
## [139] 2.9040000 5.4750000 1.2600000 1.6421053 5.0040000 20.2353333
## [145] 3.0461538 -0.1270588 16.2857143 2.2000000 4.6000000 14.4000000
# Calculate the feature averate
meanExpenses = mean(expenses)
# Inspect the variable with the calculated mean
meanExpenses
## [1] 55.60144
Repeat the same procedure for the other feature
# Calculate the feature averate
meanSavings = mean(savings)
# Inspect the variable with the calculated mean
meanSavings
## [1] 3.860083
Compute the standard deviation for the first feature
#Computing the standard deviation
spreadExpenses = sd(expenses)
# Inspect the variable with the calculated sd
spreadExpenses
## [1] 19.52084
Compute the standard deviation for the second feature
# Calculate the feature standard deviation
spreadSavings = sd(savings)
# Inspect the variable with the calculated standard deviation
spreadSavings
## [1] 3.726292
#Compute the snr of Checking and name it snr_Checking
snrExpenses = meanExpenses/spreadExpenses
#Call snr_Checking
snrExpenses
## [1] 2.848312
# Find the SNR of the second feature
snrSavings = meanSavings/spreadSavings
# Inspect the variable with the calculated SNR
snrSavings
## [1] 1.035905
Expenses has a higher SNR value than savings. This could be due to the fact that savings has a wider range in values than expenses and varies between positive and negative values whereas expenses tends to have a smaller variety of whole numbers.
head(mydata)
## Status Seniority Home Time Age Marital Records Job Expenses
## 1 good 9 rent 60 30 married no_rec freelance 73
## 2 good 17 rent 60 58 widow no_rec fixed 48
## 3 bad 10 owner 36 46 married yes_rec freelance 90
## 4 good 0 rent 60 24 single no_rec fixed 63
## 5 good 0 rent 36 26 single no_rec fixed 46
## 6 good 1 owner 60 36 married no_rec fixed 75
## Income Assets Debt Amount Price Finrat Savings
## 1 129 0 0 $800.00 $846.00 94.56265 4.200000
## 2 131 0 0 $1,000.00 $1,658.00 60.31363 4.980000
## 3 200 3000 0 $2,000.00 $2,985.00 67.00168 1.980000
## 4 182 2500 0 $900.00 $1,325.00 67.92453 7.933333
## 5 107 0 0 $310.00 $910.00 34.06593 7.083871
## 6 214 3500 0 $650.00 $1,645.00 39.51368 12.830769
Below is an example of what the simple star relational schema should look like.
Example of how to create an start schema using erdplus
Example of how to export the final start schema on erdplus
Completed Star Schema Example
Here we are going to familiarize with watson analytics, you should have access to the portal below.
https://watson.analytics.ibmcloud.com