Notebook Instructions


Task 1

Installing packages in R/RStudio.

We are going to use tidyverse a collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

## Loading required package: tidyverse
## ── Attaching packages ────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.1     ✔ dplyr   0.7.4
## ✔ tidyr   0.7.2     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0
## ── Conflicts ───────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

1A) Read the csv file into R Studio and display the dataset.

  • Name your dataset ‘mydata’ so it easy to work with.
  • Commands: read_csv() head()

Extract the assigned features (columns) to perform some analytics

To extract the features (columns) from the dataset, use the name of the dataset follow by ‘$’ sign and the name the specific column.

Extract the first feature (column)

#Extracting the Checking Column

#Calling the Checking Column

Now, use the same procedure to extract the other feature

#Extracting the feature (column)

#Calling the  feature (column)

1B) Compute the mean and standard deviation of the assigned features (columns)

  • Commands: mean() sd()
  • Use the mean() function on feature to calculate the average
  • Name the result mean and the feature name. For example meanChecking
# Calculate the feature averate

# Inspect the variable with the calculated mean

Repeat the same procedure for the other feature

# Calculate the feature averate

# Inspect the variable with the calculated mean

1C) Compute the standard deviation or spread of the two features

  • Commands: sd()

Compute the standard deviation for the first feature

#Computing the standard deviation

# Inspect the variable with the calculated sd

Compute the standard deviation for the second feature

# Calculate the feature standard deviation

# Inspect the variable with the calculated standard deviation

1D) Compute the signal to noise ratio (SNR) using the given formula:

  • SNR: Is the average (mean) divided by the spread (sd).
#Compute the snr of Checking and name it snr_Checking (meanChecking/spreadChecking)

#Call snr_Checking
# Find the SNR of the second feature

# Inspect the variable with the calculated SNR

Of the two features which has a higher SNR? Why do you think that is? Write your answer below.


Task 2

2A) Examine the content of the csv file ‘Scoring.csv’ by opening the file in RStudio and display the first rows of the dataset.

2B) Create an star schema using the website erdplus stanalone feature: https://erdplus.com/#/standalone

Below is an example of what the simple star relational schema should look like.

Example of how to create an start schema using erdplus

Example of how to export the final start schema on erdplus

Completed Star Schema Example

2C) Create a code chunk and display the star schema diagram


Task 3

Here we are going to familiarize with watson analytics, you should have access to the portal below. https://watson.analytics.ibmcloud.com

3A) Login into Watson Analytics and upload the assigned dataset. Take a screenshot of watson’s Data section showing the quality of the dataset

3B) Use Watson Discovery capabilities to find insights in the dataset. Take a screenshot of the discovery section.

3C) Save your work and upload a screenshot of something that you find below explain the output.