Business Discovery and Data Sourcing

Notebook Instructions

For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.

Load Packages in R/RStudio

We are going to use tidyverse a collection of R packages designed for data science.

Info: https://www.tidyverse.org/

## Loading required package: tidyverse

## ── Attaching packages ───────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.1     ✔ dplyr   0.7.4
## ✔ tidyr   0.7.2     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0

## ── Conflicts ──────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Task 1: Data Prepartion and Cleaning

1A) Read the csv file into R Studio and display the dataset.

Name your dataset ‘mydata’ so it easy to work with.
Commands: read_csv() head() mean() sub() as.numeric()

Extract the assigned features (columns) to perform some analytics

Find the average for each assigned feature

** What happens when we try to use the function? **

To resolve the error, check if the feature data type is correct
Notice that comma ‘1,234’ in some values
Also the data type and dollar sign ‘$’ symbol
To remove the any character in this case the comma from “1,234”. We must substitute it with just an empty space.

1B) Substitute the characters causing the issue with empty character (“”) on all features

** substitute comma with “” **

# VARIABLE <- sub("," , "", VARIABLE)

** substitute dollar sign with “” **

1C) Change the features data type from character to numeric

** character to numeric - as.numeric()**

** mean with NA removed **

# VARIABLE_MEAN <- mean( YOUR_VARIABLE , na.rm = TRUE)

1D) Assign new clean variables to dataset and save your clean new data

** Clean variables to dataset **

# mydata$VARIABLE <- VARIABLE

** Save clean data **

# write_csv(mydata, path = "data/mydata_clean.csv")

Task 2: Outlier Detection

Commands: mean() sd() min() max() quantile()

In this task we must calculate the mean, standard deviation, maximum, and minimum for the given feature.

2A) Extract the assigned features (columns) to perform some analytics

Calculate the average and standard deviation of the assigned feature (column)

** calculate the average **

** calculate the standard deviation **

2B) Calculate the min and max of the assigned feature (column)

** calculate the min **

** calculate the max **

2C) Quantile calculation to find outliers of the assigned feature (column)

To find the outliers we are going to look at the upper and lower limits
An outlier is value that “lies outside” most of the other values in a set of data.
A method to find upper and lower thresholds involves finding the interquartile range.

** quantile calculation for the give feature**

** Lower and upper quantile calculation **

# lowerq = quantile(VARIABLE)[2]
# upperq = quantile(VARIABLE)[4]

Interquantile calculation

# iqr = upperq - lowerq

2D) Finding upper and lower thresholds

The threshold is the boundaries that determine if a value is an outlier.
If the value falls above the upper threshold or below the lower threshold, it is an outlier.

** Calculation the upper threshold **

# upper_threshold = (iqr * 1.5) + upperq

** Calculation the lower threshold **

# lower_threshold = lowerq - (iqr * 1.5)

** Identify outliers **

# VARIABLE[ VARIABLE > upper_threshold][1:10]
# VARIABLE[ VARIABLE > lower_threshold][1:10]

** Finding outliers records **

# mydata[ VARIABLE > upper_threshold, ][1:10]
# mydata[ VARIABLE > lower_threshold, ][1:10]

Are there any outliers, if not explain the lack of outliers? if any explain what the outliers represent and how many records are outliers? Use the count() function to find the number of outliers.

Plotting Outliers

It can also be useful to visualize the data using a box and whisker plot.
The boxplot supports the IQR also shows the upper and lower thresholds

# p <- ggplot(data = Scoring, aes(x = "", y = Finrat)) + geom_boxplot() + coord_flip()
# p

Task 3: Data Modeling - Chicago Taxi Data

3A) Go and explore the interactive dashboard. Take a screenshot of something that you find interesting in the dashboard and write a short summary of your findings

Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq
Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released

3B) Write the Metadata for the Taxi data (description), note the size of the file its dimension (number of rows and columns). Use the function summary() to help you identify unique entities, fields, number of rows and data types.

3C) Write a description of any relational business logic and field integrity

3D) Draw a star schema of the Chicago Taxi Data

Modeling Tool: https://erdplus.com/#/standalone