Notebook Instructions


Task 1

Installing packages in R/RStudio.

We are going to use tidyverse a collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. *Info:* https://www.tidyverse.org/

## Loading required package: tidyverse
## -- Attaching packages ---------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0
## -- Conflicts ------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

1A) Read the csv file into R Studio and display the dataset.

  • Name your dataset ‘mydata’ so it easy to work with.
  • Commands: read_csv() head()
#with read_csv hit tab to see files in a folder. head shows top few rows
#use easy to remember variable names. case sensitive commands
mydata = read.csv("data/Scoring.csv")
head(mydata)
##   Status Seniority  Home Time Age Marital Records       Job Expenses
## 1   good         9  rent   60  30 married  no_rec freelance       73
## 2   good        17  rent   60  58   widow  no_rec     fixed       48
## 3    bad        10 owner   36  46 married yes_rec freelance       90
## 4   good         0  rent   60  24  single  no_rec     fixed       63
## 5   good         0  rent   36  26  single  no_rec     fixed       46
## 6   good         1 owner   60  36 married  no_rec     fixed       75
##   Income Assets Debt     Amount      Price   Finrat   Savings
## 1    129      0    0   $800.00    $846.00  94.56265  4.200000
## 2    131      0    0 $1,000.00  $1,658.00  60.31363  4.980000
## 3    200   3000    0 $2,000.00  $2,985.00  67.00168  1.980000
## 4    182   2500    0   $900.00  $1,325.00  67.92453  7.933333
## 5    107      0    0   $310.00    $910.00  34.06593  7.083871
## 6    214   3500    0   $650.00  $1,645.00  39.51368 12.830769

Extract the assigned features (columns) to perform some analytics

To extract the features (columns) from the dataset, use the name of the dataset follow by ‘$’ sign and the name the specific column.

Extract the first feature (column)

#Extracting the Expenses Column

expenses = mydata$Expenses

#Calling the Expenses Column
expenses[1:150]
##   [1]  73  48  90  63  46  75  75  35  90  90  60  60  75  75  35  75  35
##  [18]  65  45  35  46  45 105  74  45  60  75  75  75  45  45  75  75  35
##  [35]  45  45  75  70  45  44  75  35  45  75  35  35  75  48  60  60  75
##  [52] 105  75  60  85  35  45  60  90  70  90  35  75  75  75  75  60  45
##  [69]  93  60  60  45  75  75  86  90  45  70  45  60  60  60  35  75  35
##  [86]  47  60  40  60  35  35 101  95 140  75  45  35 120  89  45  73  75
## [103]  78  35  75  76  90  45  35  79  45  90  35  60  75  35  75  45  35
## [120]  75  57  75  66 114  45  35  75  35  90  75  45  75  60  65  75  60
## [137]  45  60  45  47  60  90  75 119  35 105  60  35  45  60

Now, use the same procedure to extract the other feature

#Extracting the feature (column) - Savings

savings <- mydata$Savings

#Calling the  feature (column)
savings[1:150]
##   [1]  4.2000000  4.9800000  1.9800000  7.9333333  7.0838710 12.8307692
##   [7]  1.8750000  2.7000000  0.8500000 -0.4000000  2.7130435  3.3784615
##  [13]  3.9600000  5.5440000  0.6750000  1.4933333  4.7200000  4.6000000
##  [19]  3.7090909  4.8960000  2.4400000 15.2181818  0.2000000  4.1684211
##  [25]  4.3555556  3.6000000  6.3000000  3.5200000  0.4000000  3.1418182
##  [31]  6.9300000  6.5600000  2.7000000  4.4181818  2.5371429  7.8240000
##  [37]  2.4000000  1.6484211  4.0000000  1.6984615  4.1333333 12.3157895
##  [43]  3.2160000 -0.1600000  4.4347826  4.3714286  5.1360000  6.4000000
##  [49]  4.3200000  2.8125000  1.0000000  9.8000000  4.5200000  2.1600000
##  [55]  0.7200000  3.9000000  3.3000000  3.8000000 18.7500000  5.7377049
##  [61]  2.6181818  2.4500000  2.7600000  2.7720000  2.2666667  3.1800000
##  [67]  6.2500000  8.8235294  2.4900000  7.1200000  2.8200000 12.4000000
##  [73] 13.3200000  0.6171429  1.3200000  9.8400000  0.0000000  3.7028571
##  [79]  2.9333333  0.3600000  5.8909091  0.4235294 13.6800000  7.5000000
##  [85]  0.6240000  7.3000000  2.7349229  0.5769231  5.1600000  2.2105263
##  [91]  2.8187919 14.9760000  8.5950000  7.2000000  4.5000000 13.3920000
##  [97]  2.8421053  2.1120000  3.6000000  0.4160000  3.1090909  3.8000000
## [103]  5.6492308  2.5846154 14.4000000  3.5200000 11.7500000  2.8500000
## [109]  7.6114286  6.6947368  2.2720000  8.2835821  1.5120000 13.8268657
## [115]  0.6600000  7.4250000  3.8347826  2.4757895 13.7142857  5.0000000
## [121]  3.8666667  1.8333333  6.8307692 15.0545455  9.1764706  2.5714286
## [127]  4.8387097  1.8461538  2.2320000  4.1052632  0.6428571  4.8461538
## [133]  7.5000000 11.5714286  7.7076923  4.6666667  2.8695652  2.4545455
## [139]  2.9040000  5.4750000  1.2600000  1.6421053  5.0040000 20.2353333
## [145]  3.0461538 -0.1270588 16.2857143  2.2000000  4.6000000 14.4000000

1B) Compute the mean and standard deviation of the assigned features (columns)

  • Commands: mean() sd()
  • Use the mean() function on feature to calculate the average
  • Name the result mean and the feature name. For example meanChecking
# Calculate the feature averate
meanExpenses = mean(expenses)

# Inspect the variable with the calculated mean
meanExpenses
## [1] 55.60144

Repeat the same procedure for the other feature

# Calculate the feature averate

meanSavings = mean(savings)

# Inspect the variable with the calculated mean

meanSavings
## [1] 3.860083

1C) Compute the standard deviation or spread of the two features

  • Commands: sd()

Compute the standard deviation for the first feature

#Computing the standard deviation
spreadExpenses = sd(expenses)

# Inspect the variable with the calculated sd
spreadExpenses
## [1] 19.52084

Compute the standard deviation for the second feature

# Calculate the feature standard deviation

spreadSavings = sd(savings)

# Inspect the variable with the calculated standard deviation
spreadSavings
## [1] 3.726292

1D) Compute the signal to noise ratio (SNR) using the given formula:

  • SNR: Is the average (mean) divided by the spread (sd).
#Compute the snr of Checking and name it snr_Checking
snrExpenses = meanExpenses/spreadExpenses

#Call snr_Checking
snrExpenses
## [1] 2.848312
# Find the SNR of the second feature

snrSavings = meanSavings/spreadSavings

# Inspect the variable with the calculated SNR

snrSavings
## [1] 1.035905

Of the two features which has a higher SNR? Why do you think that is? Write your answer below.

Expenses has a higher SNR value than savings. This could be due to the fact that savings has a wider range in values than expenses and varies between positive and negative values whereas expenses tends to have a smaller variety of whole numbers.


Task 2

2A) Examine the content of the csv file ‘Scoring.csv’ by opening the file in RStudio and display the first rows of the dataset.

head(mydata)
##   Status Seniority  Home Time Age Marital Records       Job Expenses
## 1   good         9  rent   60  30 married  no_rec freelance       73
## 2   good        17  rent   60  58   widow  no_rec     fixed       48
## 3    bad        10 owner   36  46 married yes_rec freelance       90
## 4   good         0  rent   60  24  single  no_rec     fixed       63
## 5   good         0  rent   36  26  single  no_rec     fixed       46
## 6   good         1 owner   60  36 married  no_rec     fixed       75
##   Income Assets Debt     Amount      Price   Finrat   Savings
## 1    129      0    0   $800.00    $846.00  94.56265  4.200000
## 2    131      0    0 $1,000.00  $1,658.00  60.31363  4.980000
## 3    200   3000    0 $2,000.00  $2,985.00  67.00168  1.980000
## 4    182   2500    0   $900.00  $1,325.00  67.92453  7.933333
## 5    107      0    0   $310.00    $910.00  34.06593  7.083871
## 6    214   3500    0   $650.00  $1,645.00  39.51368 12.830769

2B) Create an star schema using the website erdplus stanalone feature: https://erdplus.com/#/standalone

Below is an example of what the simple star relational schema should look like.

Example of how to create an start schema using erdplus

Example of how to export the final start schema on erdplus

Completed Star Schema Example

2C) Create a code chunk and display the star schema diagram

Task 3

Here we are going to familiarize with watson analytics, you should have access to the portal below.

https://watson.analytics.ibmcloud.com

3A) Login into Watson Analytics and upload the assigned dataset. Take an screenshot of watson’s Data section showing the quality of the dataset

3B) Use Watson Discovery capabilities to find insights in the dataset

3C) Save your work and upload a screenshot of your finding below.

Data quality is 63%

This image shows that married people tend to have more expenses, regardless of age. Married peoplpe between 40 and 55 appear to have the most expenses in terms of outliers.