ATTENTION:

1. READ EVERY SINGLE LINE BELOW TO GET A COMPLETE UNDERSTANDING OF WHAT NEED TO BE DONE IN THIS ASSIGNMENT.

2. SET UP THE DIRECTORY AS INSTRUCTED IN THE “SETUP” BELOW BEFORE YOU PROCEED WITH THIS ASSIGNMENT TO AVOID ANY ERRORS.

4. DO THE TASKS SEQUENTIALLY TO AVOID ERRORS WHEN YOU RUN THE CODES.

5. BEFORE YOU KNIT YOUR RMD FILE AND PUBLISH IT ON THE RPUBS, MAKE SURE TO RUN EACH CHUNK OF YOUR CODES BY CLICKING THE GREEN ARROW ON THE RIGHT OF THE CODE CHUNK.

6. FULL POINTS WILL ONLY BE GIVEN TO TASKS WHICH CODES ARE RUN SUCCESSFULLY (ERROR-FREE), FOR THE LAST TASK - IF YOU ANSWER CORRECTLY BOTH QUESTIONS

7. PUBLISH YOUR WORK TO RPUBS BY NAMING IT “bsad_lab03”, COPY YOUR RPUBS LINK TO SAKAI AND SUBMIT THE LINK.

Setup

A. Remember to always set your working directory to the source file location. Click on Session -> Set Working Directory -> To Source File Location

B. For your assignment you will be using different data sets than what is included in this worksheet demo. Make sure to read carefully the instructions on Sakai.

PART I: DATA OUTLIERS

We use file creditrisk.csv to provide an example related to data outliers. We use the formula from class to detect any outliers. An outlier is value that “lies outside” most of the other values in a set of data. A common way to detect the outliers is to determine the UPPER and LOWER THRESHOLDS using formula mean (+ or -) 3 * standard deviation.

To use the above formula for Age data in the above file, we firstly must calculate the mean and standard deviation of the data in Age column. The results would be two values which are the upper and lower thresholds. Secondly, we determine the maximum and minimum values of the Age data. Thirdly, compare the maximum and minimum values to the upper and lower thresholds to identify any potential thresholds. Finally, to determine how many Age data are the outliers, look at Age data by opening the related file, sort the Age column and compare the data to the upper and lower thresholds. If an Age data value falls above the upper threshold or below the lower threshold, it is an outlier.

Below are the steps to determine outliers for Age column.

We must read the file which has the Age column, extract the Age column, calculate the mean, standard deviation, maximum value, and minimum value for the Age column using R.

#STEP I1: Read the file creditrisk.csv and extract column Age in the file and name the extracted variable as age
mydata = read.csv(file="data/creditrisk.csv") 
age = mydata$Age

#STEP I2: Calculate the average or mean and standard deviation or SD of age below. Refer to Worksheet lab_02 for the correct commands.
meanAge = mean(age) #Calculate the mean and hold it in variable meanAge
meanAge             #Display the mean value

## [1] 34.39765

spreadAge = sd(age) #Calculate the SD and hold it in variable spreadAge
spreadAge           #Display the SD value

## [1] 11.04513

#STEP I3: Calculate the maximum and minimum of age below. The command to find the maximum is max(variable) and the command to find the minimum is min(variable) where variable in this case is the extracted variable age 
maxAge = max(age)   #Calculate the maximum and hold it in variable maxAge
maxAge              #Display the minimum value

## [1] 73

minAge = min(age)   #Calculate the minimum and hold it in variable minAge
minAge              #Display the minimum value

## [1] 18

#STEP I4: Calculate the upper and lower thresholds. upperTAge is the variable to hold the upper threshold for the Age data, and lowerTAge is the variable to hold the lower threshold for the Age data.
upperTAge = meanAge + 3 * spreadAge
lowerTAge = meanAge - 3 * spreadAge
upperTAge

## [1] 67.53302

lowerTAge

## [1] 1.262269

STEP I5: Compare the maximum value with upper threshold and the minimum value with lower threshold to determine any potential outliers. QUESTION IA: Are there any outliers on the upper side of the Age data? ANWER IA: YES, since the maximum value is 73 which is bigger than the upperthreshold 67.53302.

QUESTION IB: Are there any outliers on the lower side of the Age data? ANWER IB: No, since the minimu value is 18 which is NOT smaller than the lower threshold.

QUESTION IC: How many outliers are there and which data? ANSWER IC: ONE outlier, where Age is 73.

You will use a different file: Scoring.csv for the following five tasks.

TASK 1:Repeat STEP I1 to read file Scoring.csv and extract its Age column

TASK 2: Repeat STEP I2 to calculate the mean and spread

TASK 3: Repeat STEP I3 to calculate the maximum and minimum

TASK 4: Repeat STEP I4 to calculate the upper and lower thresholds

TASK 5: Repeat STEP I5 to answer QUESTIONS IA, IB, and IC

# Task 1:

#STEP I1: 
mydata = read.csv(file="data/Scoring.csv") 
age = mydata$Age

# Task 2:

#STEP I2: 
meanAge = mean(age) #Calculate the mean and hold it in variable meanAge
meanAge             #Display the mean value

## [1] 37.08412

spreadAge = sd(age) #Calculate the SD and hold it in variable spreadAge
spreadAge           #Display the SD value

## [1] 10.98637

# Task 3:

#STEP I3:  
maxAge = max(age)   #Calculate the maximum and hold it in variable maxAge
maxAge              #Display the minimum value

## [1] 68

minAge = min(age)   #Calculate the minimum and hold it in variable minAge
minAge              #Display the minimum value

## [1] 18

# Task 4:

#STEP I4: Calculate the upper and lower thresholds. upperTAge is the variable to hold the upper threshold for the Age data, and lowerTAge is the variable to hold the lower threshold for the Age data.
upperTAge = meanAge + 3 * spreadAge
lowerTAge = meanAge - 3 * spreadAge
upperTAge

## [1] 70.04322

lowerTAge

## [1] 4.125023

Task 5:

STEP I5: Compare the maximum value with upper threshold and the minimum value with lower threshold to determine any potential outliers. QUESTION IA: Are there any outliers on the upper side of the Age data? ANWER IA: No, the maximum value, 68, is not bigger than the upper threshold.

QUESTION IB: Are there any outliers on the lower side of the Age data? ANWER IB: No, since the minimum value, 18, is not smaller than the lower threshold.

QUESTION IC: How many outliers are there and which data? ANSWER IC: None.

=============================================

PART II: DATA PREPARATION

We use file creditriskorg.csv to provide an example related to data preparation. This is the original dataset and contains missing values. We will focus on the data in the Checking column. There are some problems with this data. Firstly: missing values in this column which is quite common as most datasets are not perfect. Secondly: there are commas, while R does not recognize that ‘1,234’ is equivalent to ‘1234’. Lastly: there are ‘$’ symbols throughout the file which is not a numerical symbol either. We need to prepare the Checking column data to fix all the problems mentioned above, before we can analyze the Checking data. In this case for example we want to find the mean of the Checking data.

Below are the steps to prepare the Checking data.

We must read the file which has the Checking column and extract the Checking column.

#STEP II1a: Read the file creditriskorg.csv 
newdata = read.csv(file="data/creditriskorg.csv")
head(newdata)

##                 X       X.1         X.2             X.3             X.4
## 1    Loan Purpose Checking      Savings Months Customer Months Employed
## 2 Small Appliance     $-       $739.00               13              12
## 3       Furniture     $-     $1,230.00               25               0
## 4         New Car     $-       $389.00               19             119
## 5       Furniture  $638.00     $347.00               13              14
## 6       Education  $963.00   $4,754.00               40              45
##      X.5            X.6 X.7     X.8   X.9       X.10        X.11
## 1 Gender Marital Status Age Housing Years        Job Credit Risk
## 2      M         Single  23     Own     3  Unskilled         Low
## 3      M       Divorced  32     Own     1    Skilled        High
## 4      M         Single  38     Own     4 Management        High
## 5      M         Single  36     Own     2  Unskilled        High
## 6      M         Single  31    Rent     3    Skilled         Low

We observe that the column names are shifted down below because of the empty line. So, we must make sure to use the command skip and set the header to true.

#STEP II1b:
newdata = read.csv(file="data/creditriskorg.csv",skip=1,header=TRUE,sep=",") 
head(newdata)

##      Loan.Purpose    Checking     Savings Months.Customer Months.Employed
## 1 Small Appliance       $-       $739.00               13              12
## 2       Furniture       $-     $1,230.00               25               0
## 3         New Car       $-       $389.00               19             119
## 4       Furniture    $638.00     $347.00               13              14
## 5       Education    $963.00   $4,754.00               40              45
## 6       Furniture  $2,827.00        $-                 11              13
##   Gender Marital.Status Age Housing Years        Job Credit.Risk
## 1      M         Single  23     Own     3  Unskilled         Low
## 2      M       Divorced  32     Own     1    Skilled        High
## 3      M         Single  38     Own     4 Management        High
## 4      M         Single  36     Own     2  Unskilled        High
## 5      M         Single  31    Rent     3    Skilled         Low
## 6      M        Married  25     Own     1    Skilled         Low

#STEP II1c: Extract the Checking colum.
checking = newdata$Checking

#STEP II1d: Calculate the mean for Checking use the mean() function. What happens when we try to use the function?
#meanChecking = mean(checking)

There’s an error. These errors are related to the problems mentioned above (under PART II).

In order to remove the comma and dollar sign in the number e.g. “$1,234”, we must substitute them with just an empty space. The sub function replaces these symbols with something els, e.g. blank or empty space.

#STEP II2: substitute comma, dollar sign with blank in all of checking data. 
checking = sub(",","",checking)   #Substitute comma with blank
checking = sub("\\$","",checking) #Substitute $ with blank

In order to calculate the mean we have to convert all Checking values to numeric to remove any NA

#STEP II3: Convert values to numeric to remove any NA and calculate the mean
checking = as.numeric(checking) #Convert values to numeric to remove any NA

## Warning: NAs introduced by coercion

meanChecking = mean(checking,na.rm=TRUE)
meanChecking

## [1] 2559.805

You will use a different file: scoring_origial.csv for the following three tasks.

TASK 6: Repeat STEP II1a-II1d to read file scoring-original.csv and extract its Price column

TASK 8: Repeat STEP II3 to remove any NA and calculate the mean of Price

TASK 9: Answer the following questions:

Question IIA: What are some other ways to clean this data?

Question IIB: How does Excel treat the missing values and the “$” symbols?

# Task 6:
#STEP II1a: 
newdata = read.csv(file="data/scoring_original.csv")
head(newdata)

##   Status Seniority  Home Time Age Marital Records       Job Expenses
## 1   good         9  rent   60  30 married  no_rec freelance     $73K
## 2   good        17  rent   60  58   widow  no_rec     fixed     $48K
## 3    bad        10 owner   36  46 married yes_rec freelance     $90K
## 4   good         0  rent   60  24  single  no_rec     fixed     $63K
## 5   good         0  rent   36  26  single  no_rec     fixed     $46K
## 6   good         1 owner   60  36 married  no_rec     fixed     $75K
##   Income Assets Debt     Amount      Price   Finrat   Savings
## 1  $129K      0    0   $800.00    $846.00  94.56265  4.200000
## 2  $131K      0    0 $1,000.00  $1,658.00  60.31363  4.980000
## 3  $200K   3000    0 $2,000.00  $2,985.00  67.00168  1.980000
## 4  $182K   2500    0   $900.00  $1,325.00  67.92453  7.933333
## 5  $107K      0    0   $310.00    $910.00  34.06593  7.083871
## 6  $214K   3500    0   $650.00  $1,645.00  39.51368 12.830769

#STEP II1b:
#newdata = read.csv(file="data/creditriskorg.csv",skip=1,header=TRUE,sep=",")
#head(newdata)

#STEP II1c:
price = newdata$Price

#STEP II1d: Calculate the mean for Checking use the mean() function. What happens when we try to use the function?
#meanChecking = mean(checking)

# Task 7:
#STEP II2: substitute comma, dollar sign with blank in all of checking data. 
price = sub(",","", price) #Substitute comma with blank
price = sub("\\$","", price) #Substitute $ with blank

# Task 8:
#STEP II3: Convert values to numeric to remove any NA and calculate the mean
price = as.numeric(price) #Convert values to numeric to remove any NA

## Warning: NAs introduced by coercion

meanPrice = mean(price,na.rm=TRUE)
meanPrice

## [1] 1462.48

Task 9:

Question IIA: To clean the data we can delete the blank data, duplicates, miscode, bad business logic, or “dirty” records.

Question IIB: Excel treats the missing values as zeros and treats the “$” symbols as a character.

PART III: Data Modeling

Now, we will look at Chicago taxi data. Go and explore the interactive dashboard and read the description of the data.

Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq

Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/

TASK 10 A:

Open in RStudio or Excel the Taxi_Trips_Sample.csv file located in the data folder. Answer these questions: What is the size of the file? 38,727 KB

How many number of columns and of rows does the file have? 24 columns & 100,000 rows.

What is the primary key? Trip ID

What are other attributes’ names? Taxi ID, trip start timestamp, trip end time stamp, trip seconds, trip miles, pick up census tract, dropoff census tract, pickup community area, dropoff community area, fare, tips, tolls, estras, trip total, payment type, company, pickup centroid latitude, pickup centroid longitude, pickup centroid location, dropoff centroid latitude, dropoff centroid longitude, dropoff centroid location, community areas.

TASK 10 B:

Define a relational business logic integrity check for the column field ‘Trip Seconds’. Trip Seconds and Total Fares. If the trips are longer, the fares should be higher; if the trips are shorter, the fares should be lower. And if the trip is 0 seconds, there should not be a charge at all since the trip didn’t happen.

TASK 10 C:

Using https://erdplus.com/#/standalone draw a star like schema using at least the following tables and insert the image of the star schema in this file.

A Fact table for Trip
A Dimension table for Taxi Company with its relevant attributes
A Dimension table for Payment Type with its relevant attributes
A Dimension table for Total Fare with its relevant attributes

==================================

EXTRA information:

Another similar method to find the upper and lower thresholds discussed in introductory statistics courses involves finding the interquartile range. Follow along below to see how we first calculate the interquartile range..

quantile(age)

##   0%  25%  50%  75% 100% 
##   18   28   36   45   68

lowerq = quantile(age)[2]
upperq = quantile(age)[4]
iqr = upperq - lowerq

The formula below calculates the threshold. The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier.

Below is the upper threshold:

upperthreshold = (iqr * 1.5) + upperq 
upperthreshold

##  75% 
## 70.5

Below is the lower threshold:

lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold

## 25% 
## 2.5

It can also be useful to visualize the data using a box and whisker plot.

boxplot(age)

Business Analytics Lab Worksheet 03 (bsad_lab03)

CME Group Foundation Business Analytics Lab

Marina Petrushin

February 14, 2018

ATTENTION:

1. READ EVERY SINGLE LINE BELOW TO GET A COMPLETE UNDERSTANDING OF WHAT NEED TO BE DONE IN THIS ASSIGNMENT.

2. SET UP THE DIRECTORY AS INSTRUCTED IN THE “SETUP” BELOW BEFORE YOU PROCEED WITH THIS ASSIGNMENT TO AVOID ANY ERRORS.

4. DO THE TASKS SEQUENTIALLY TO AVOID ERRORS WHEN YOU RUN THE CODES.

5. BEFORE YOU KNIT YOUR RMD FILE AND PUBLISH IT ON THE RPUBS, MAKE SURE TO RUN EACH CHUNK OF YOUR CODES BY CLICKING THE GREEN ARROW ON THE RIGHT OF THE CODE CHUNK.

6. FULL POINTS WILL ONLY BE GIVEN TO TASKS WHICH CODES ARE RUN SUCCESSFULLY (ERROR-FREE), FOR THE LAST TASK - IF YOU ANSWER CORRECTLY BOTH QUESTIONS

7. PUBLISH YOUR WORK TO RPUBS BY NAMING IT “bsad_lab03”, COPY YOUR RPUBS LINK TO SAKAI AND SUBMIT THE LINK.

Setup

PART I: DATA OUTLIERS

You will use a different file: Scoring.csv for the following five tasks.

TASK 1:Repeat STEP I1 to read file Scoring.csv and extract its Age column

TASK 2: Repeat STEP I2 to calculate the mean and spread

TASK 3: Repeat STEP I3 to calculate the maximum and minimum

TASK 4: Repeat STEP I4 to calculate the upper and lower thresholds

TASK 5: Repeat STEP I5 to answer QUESTIONS IA, IB, and IC

PART II: DATA PREPARATION

You will use a different file: scoring_origial.csv for the following three tasks.

TASK 6: Repeat STEP II1a-II1d to read file scoring-original.csv and extract its Price column

TASK 8: Repeat STEP II3 to remove any NA and calculate the mean of Price

TASK 9: Answer the following questions:

Question IIA: What are some other ways to clean this data?

Question IIB: How does Excel treat the missing values and the “$” symbols?

PART III: Data Modeling

TASK 10 A:

TASK 10 B:

TASK 10 C:

==================================

EXTRA information:

Business Analytics Lab Worksheet 03 (bsad_lab03)

CME Group Foundation Business Analytics Lab

Marina Petrushin

February 14, 2018

ATTENTION:

1. READ EVERY SINGLE LINE BELOW TO GET A COMPLETE UNDERSTANDING OF WHAT NEED TO BE DONE IN THIS ASSIGNMENT.

2. SET UP THE DIRECTORY AS INSTRUCTED IN THE “SETUP” BELOW BEFORE YOU PROCEED WITH THIS ASSIGNMENT TO AVOID ANY ERRORS.

3. THERE ARE TEN TASKS IN THIS ASSIGNMENT WHICH ARE RELATED TO DATA OUTLIERS, DATA PREPARATION, AND DATA MODELING. THIS LAB REQUIRES THE USE OF MICROSOFT EXCEL, R, AND ERDPLUS.

4. DO THE TASKS SEQUENTIALLY TO AVOID ERRORS WHEN YOU RUN THE CODES.

5. BEFORE YOU KNIT YOUR RMD FILE AND PUBLISH IT ON THE RPUBS, MAKE SURE TO RUN EACH CHUNK OF YOUR CODES BY CLICKING THE GREEN ARROW ON THE RIGHT OF THE CODE CHUNK.

6. FULL POINTS WILL ONLY BE GIVEN TO TASKS WHICH CODES ARE RUN SUCCESSFULLY (ERROR-FREE), FOR THE LAST TASK - IF YOU ANSWER CORRECTLY BOTH QUESTIONS

7. PUBLISH YOUR WORK TO RPUBS BY NAMING IT “bsad_lab03”, COPY YOUR RPUBS LINK TO SAKAI AND SUBMIT THE LINK.

Setup

PART I: DATA OUTLIERS

You will use a different file: Scoring.csv for the following five tasks.

TASK 1:Repeat STEP I1 to read file Scoring.csv and extract its Age column

TASK 2: Repeat STEP I2 to calculate the mean and spread

TASK 3: Repeat STEP I3 to calculate the maximum and minimum

TASK 4: Repeat STEP I4 to calculate the upper and lower thresholds

TASK 5: Repeat STEP I5 to answer QUESTIONS IA, IB, and IC

PART II: DATA PREPARATION

You will use a different file: scoring_origial.csv for the following three tasks.

TASK 6: Repeat STEP II1a-II1d to read file scoring-original.csv and extract its Price column

TASK 7: Repeat STEP II2 to remove the comma and $ sign in the Price data

TASK 8: Repeat STEP II3 to remove any NA and calculate the mean of Price

TASK 9: Answer the following questions:

Question IIA: What are some other ways to clean this data?

Question IIB: How does Excel treat the missing values and the “$” symbols?

PART III: Data Modeling

TASK 10 A:

TASK 10 B:

TASK 10 C:

==================================

EXTRA information: