ATTENTION:

1. READ EVERY SINGLE LINE BELOW TO GET A COMPLETE UNDERSTANDING OF WHAT NEED TO BE DONE IN THIS ASSIGNMENT.

2. SET UP THE DIRECTORY AS INSTRUCTED IN THE “SETUP” BELOW BEFORE YOU PROCEED WITH THIS ASSIGNMENT TO AVOID ANY ERRORS.

4. DO THE TASKS SEQUENTIALLY TO AVOID ERRORS WHEN YOU RUN THE CODES.

5. BEFORE YOU KNIT YOUR RMD FILE AND PUBLISH IT ON THE RPUBS, MAKE SURE TO RUN EACH CHUNK OF YOUR CODES BY CLICKING THE GREEN ARROW ON THE RIGHT OF THE CODE CHUNK.

6. FULL POINTS WILL ONLY BE GIVEN TO TASKS WHICH CODES ARE RUN SUCCESSFULLY (ERROR-FREE), FOR THE LAST TASK - IF YOU ANSWER CORRECTLY BOTH QUESTIONS

7. PUBLISH YOUR WORK TO RPUBS BY NAMING IT “bsad_lab03”, COPY YOUR RPUBS LINK TO SAKAI AND SUBMIT THE LINK.

Setup

A. Remember to always set your working directory to the source file location. Click on Session -> Set Working Directory -> To Source File Location

B. For your assignment you will be using different data sets than what is included in this worksheet demo. Make sure to read carefully the instructions on Sakai.

PART I: DATA OUTLIERS

We use file creditrisk.csv to provide an example related to data outliers. We use the formula from class to detect any outliers. An outlier is value that “lies outside” most of the other values in a set of data. A common way to detect the outliers is to determine the UPPER and LOWER THRESHOLDS using formula mean (+ or -) 3 * standard deviation.

Below is the steps on how to determine the outliers in the Age data: 1. Read the data file and extract the Age column to get the Age data. 2. Calculate the mean and standard deviation of the data in Age column. 3. Calculate the upper threshold using formula mean + 3 * standard deviation, and calculate the lower threshold using formula mean - 3 * standard deviation. 4. Calculate the maximum and minimum values of the Age data. 5. Compare the maximum and minimum values to the upper and lower thresholds to identify any potential outliers. If the data value falls above the upper threshold or below the lower threshold, it is an outlier. Finally, to determine how many outliers in the Age data, look at Age data by opening the related file, sort the Age column and count the number of data values which are above the upper threshold, and count the number of data values which are below the lower thresholds. Add these two counts to get the number of outliers in the Age data.

Let’s determine the outliers in the Age data using R.

We must write codes to read the file which has the Age column, extract the Age column, calculate the mean, standard deviation, maximum value, and minimum value for the Age column.

#STEP I1: Read the file creditrisk.csv and extract column Age in the file and name the extracted variable as myage.
mydata = read.csv(file="data/creditrisk.csv") 
myage = mydata$Age

#STEP I2: Calculate the average or mean and standard deviation or SD of myage (the Age data) and display their values. Refer to Worksheet lab_02 for the correct commands.
meanAge = mean(myage)   #Calculate the mean of Age data and hold it in variable meanAge.
meanAge                 #Display the mean value.

## [1] 34.39765

spreadAge = sd(myage)   #Calculate the SD of Age data and hold it in variable spreadAge.
spreadAge               #Display the SD value.

## [1] 11.04513

#STEP I3: Calculate the upper and lower thresholds of myage (the Age data) and display their values. The upperTAge is the variable to hold the upper threshold and the lowerTAge is the variable to hold the lower threshold.
upperTAge = meanAge + 3 * spreadAge  #Calculate the upper threshold of the Age data.
lowerTAge = meanAge - 3 * spreadAge  #Calculate the lower threshold of the Age data.
upperTAge                            #Display the upper threshold value.

## [1] 67.53302

lowerTAge                            #Display the lower threshold value.

## [1] 1.262269

#STEP I4: Calculate the maximum and minimum values of myage (the Age data) and display their values. The command to find the maximum is max(variable) and the command to find the minimum is min(variable) where variable in this case is the extracted variable myage.
maxAge = max(myage)   #Calculate the maximum value of Age data and hold it in variable maxAge.
maxAge                #Display the maximum value.

## [1] 73

minAge = min(myage)   #Calculate the minimum value of Age data and hold it in variable minAge.
minAge                #Display the minimum value.

## [1] 18

STEP I5: Compare the maximum value with upper threshold and compare the minimum value with lower threshold to determine any potential outliers. QUESTION IA: Are there any outliers on the upper side of the Age data? ANWER IA: YES, since the maximum value is 73 which is bigger than the upper threshold 67.53302.

QUESTION IB: Are there any outliers on the lower side of the Age data? ANWER IB: No, since the minimum value is 18 which is NOT smaller than the lower threshold.

QUESTION IC: How many outliers are there and which data? ANSWER IC: ONE outlier, where Age is 73.

Use a different file: Scoring.csv and complete the following five tasks.

TASK 1:Repeat STEP I1 to read file Scoring.csv and extract its Age column.

mydata = read.csv(file="data/Scoring.csv")
myage = mydata$Age

TASK 2: Repeat STEP I2 to calculate the mean and spread and display their values.

meanAge = mean(myage)
meanAge

## [1] 37.08412

spreadAge = sd(myage)
spreadAge

## [1] 10.98637

TASK 3: Repeat STEP I3 to calculate the upper and lower thresholds and display their values.

upperTAge = meanAge + 3 * spreadAge
lowerTAge = meanAge - 3 * spreadAge
upperTAge

## [1] 70.04322

lowerTAge

## [1] 4.125023

TASK 4: Repeat STEP I4 to calculate the maximum and minimum and display their values.

maxAge = max(myage)
maxAge

## [1] 68

minAge = min(myage)
minAge

## [1] 18

TASK 5: Repeat STEP I5 to answer QUESTIONS IA, IB, and IC.

QUESTION IA: Are there any outliers on the upper side of the Age data? ANWER IA: No, because the maximum value is 68 which is not bigger than the upper threshold 70.04322

QUESTION IB: Are there any outliers on the lower side of the Age data? ANWER IB: No, because the minimum value is 18 which is not smaller than the lower threshold 4.125023

QUESTION IC: How many outliers are there and which data (if any)? ANSWER IC: zero outliers

=============================================

PART II: DATA PREPARATION

We use file creditriskorg.csv to provide an example related to data preparation. This is the original dataset and contains missing values. We will focus on the data in the Checking column. There are some problems with this data. Firstly: missing values in this column which is quite common as most datasets are not perfect. Secondly: there are commas, while R does not recognize that 1,234 is equivalent to 1234. Lastly: there are ‘$’ symbols throughout the file which is not a numerical symbol either. We need to prepare the Checking column data to fix all the problems mentioned above, before we can analyze the Checking data. In this case for example, the analysis that we want to do is finding the mean of the Checking data.

Below are the steps to prepare the Checking data.

We must read the file which has the Checking column and extract the Checking column.

#STEP II1a: Read the file creditriskorg.csv 
newdata = read.csv(file="data/creditriskorg.csv")
head(newdata)

##                 X       X.1         X.2             X.3             X.4
## 1    Loan Purpose Checking      Savings Months Customer Months Employed
## 2 Small Appliance     $-       $739.00               13              12
## 3       Furniture     $-     $1,230.00               25               0
## 4         New Car     $-       $389.00               19             119
## 5       Furniture  $638.00     $347.00               13              14
## 6       Education  $963.00   $4,754.00               40              45
##      X.5            X.6 X.7     X.8   X.9       X.10        X.11
## 1 Gender Marital Status Age Housing Years        Job Credit Risk
## 2      M         Single  23     Own     3  Unskilled         Low
## 3      M       Divorced  32     Own     1    Skilled        High
## 4      M         Single  38     Own     4 Management        High
## 5      M         Single  36     Own     2  Unskilled        High
## 6      M         Single  31    Rent     3    Skilled         Low

We observe that the column names are shifted down below because of the empty line. So, we must make sure to use the command skip and set the header to true.

#STEP II1b:
newdata = read.csv(file="data/creditriskorg.csv",skip=1,header=TRUE,sep=",") 
head(newdata)

##      Loan.Purpose    Checking     Savings Months.Customer Months.Employed
## 1 Small Appliance       $-       $739.00               13              12
## 2       Furniture       $-     $1,230.00               25               0
## 3         New Car       $-       $389.00               19             119
## 4       Furniture    $638.00     $347.00               13              14
## 5       Education    $963.00   $4,754.00               40              45
## 6       Furniture  $2,827.00        $-                 11              13
##   Gender Marital.Status Age Housing Years        Job Credit.Risk
## 1      M         Single  23     Own     3  Unskilled         Low
## 2      M       Divorced  32     Own     1    Skilled        High
## 3      M         Single  38     Own     4 Management        High
## 4      M         Single  36     Own     2  Unskilled        High
## 5      M         Single  31    Rent     3    Skilled         Low
## 6      M        Married  25     Own     1    Skilled         Low

#STEP II1c: Extract the Checking colum.
checking = newdata$Checking

#STEP II1d: Calculate the mean for Checking use the mean() function. What happens when we try to use the function?
meanChecking = mean(checking)

## Warning in mean.default(checking): argument is not numeric or logical:
## returning NA

There’s an error. These errors are related to the problems mentioned above (under PART II); i.e. in R the comma (,) and dollar ($) signs aren’t recognized as part of the numeric values. Therefore, we have to remmove them.

In order to remove the comma and dollar signs in the number e.g. “$1,234”, we must substitute them with just an empty space. The sub function replaces these symbols with something else, e.g. blank or empty space.

#STEP II2: substitute comma and dollar signs with blank in all of checking data. 
checking = sub(",","",checking)   #Substitute comma with blank.
checking = sub("\\$","",checking) #Substitute $ with blank.

In order to calculate the mean we have to convert all Checking values to numeric to remove any NA value.

#STEP II3: Convert values to numeric to remove any NA and calculate the mean.
checking = as.numeric(checking)             #Convert values to numeric to remove any NA.

## Warning: NAs introduced by coercion

meanChecking = mean(checking,na.rm=TRUE)    #Compute the mean value.   
meanChecking                                #Display the mean value.

## [1] 2559.805

Use a different file: scoring_origial.csv and complete the following four tasks.

TASK 6: Repeat STEPs II1a and II1c to read file scoring-original.csv and extract its Price column.

newdata = read.csv(file="data/scoring_original.csv")
head(newdata)

##   Status Seniority  Home Time Age Marital Records       Job Expenses
## 1   good         9  rent   60  30 married  no_rec freelance     $73K
## 2   good        17  rent   60  58   widow  no_rec     fixed     $48K
## 3    bad        10 owner   36  46 married yes_rec freelance     $90K
## 4   good         0  rent   60  24  single  no_rec     fixed     $63K
## 5   good         0  rent   36  26  single  no_rec     fixed     $46K
## 6   good         1 owner   60  36 married  no_rec     fixed     $75K
##   Income Assets Debt     Amount      Price   Finrat   Savings
## 1  $129K      0    0   $800.00    $846.00  94.56265  4.200000
## 2  $131K      0    0 $1,000.00  $1,658.00  60.31363  4.980000
## 3  $200K   3000    0 $2,000.00  $2,985.00  67.00168  1.980000
## 4  $182K   2500    0   $900.00  $1,325.00  67.92453  7.933333
## 5  $107K      0    0   $310.00    $910.00  34.06593  7.083871
## 6  $214K   3500    0   $650.00  $1,645.00  39.51368 12.830769

price = newdata$Price

TASK 7: Repeat STEP II2 to remove the comma and $ signs in the Price data.

price = sub(",","",price)
price = sub("\\$","",price)

TASK 8: Repeat STEP II3 to remove any NA, calculate the mean of Price data, and display the mean value.

price = as.numeric(price)

## Warning: NAs introduced by coercion

meanPrice = mean(price,na.rm=TRUE)
meanPrice

## [1] 1462.48

TASK 9: Answer the following questions by referring to lectures slides and other resources:

Question IIA: What are some other ways to clean this data?

Answer: cleansing, transformation, detection, imputation, reduction, abstraction, filtering, derivation, and attributes.

Question IIB: How does Excel treat the missing values and the “$" symbols? Answer: Excel treats the missing values as non-existent with the ### symbols. The "$” symbol is what helps excel read the data as a price. Values without the “$”" symbol are treated as missing data.

PART III: Data Modeling

Now, we will look at Chicago taxi data. Go and explore the interactive dashboard and read the description of the data.

Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq

Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/

TASK 10 A:

Open in RStudio or Excel the Taxi_Trips_Sample.csv file located in the data folder. Answer these questions: What is the size of the file? How many number of fields and records does the file have? What is the primary key? What are other attributes’ names (give at least three)?

Size of the file is 38,727 KB. Number of records is 100000 and number of fields is 24. The primary key is Trip ID. Other attributes are Trip Miles, Tips, and Tolls

TASK 10 B:

Define a relational business logic integrity check for the column field ‘Trip Seconds’ in relation to other field(s).

A relational business logic integrity check for ‘Trip Seconds’ would be if the time of ‘Trip Seconds’ exists when the ‘Trip Miles’ is zero. It could also be when the difference between ‘Trip End’ and ‘Trip Start’ is less or greater than ‘Trip Seconds’ even if conversion is necessary.

TASK 10 C:

Using https://erdplus.com/#/standalone draw a star like schema using at least the following tables and insert the image of the star schema in this file. Use the fields in the Taxi_Trips_Sample.csv file to determine the required attributes for each dimension.

A Dimension table for Trip with at least five attributes (including the primary key)..
A Dimension table for Billing with at least five attributes (including the primary key).
A Fact table for Taxi Company.

==================================

EXTRA information:

Another similar method to find the upper and lower thresholds discussed in introductory statistics courses involves finding the interquartile range. Follow along below to see how we first calculate the interquartile range..

Bsad Lab03

CME Group Foundation Business Analytics Lab

Ahad Sami

2/12/19

ATTENTION:

1. READ EVERY SINGLE LINE BELOW TO GET A COMPLETE UNDERSTANDING OF WHAT NEED TO BE DONE IN THIS ASSIGNMENT.

2. SET UP THE DIRECTORY AS INSTRUCTED IN THE “SETUP” BELOW BEFORE YOU PROCEED WITH THIS ASSIGNMENT TO AVOID ANY ERRORS.

4. DO THE TASKS SEQUENTIALLY TO AVOID ERRORS WHEN YOU RUN THE CODES.

5. BEFORE YOU KNIT YOUR RMD FILE AND PUBLISH IT ON THE RPUBS, MAKE SURE TO RUN EACH CHUNK OF YOUR CODES BY CLICKING THE GREEN ARROW ON THE RIGHT OF THE CODE CHUNK.

6. FULL POINTS WILL ONLY BE GIVEN TO TASKS WHICH CODES ARE RUN SUCCESSFULLY (ERROR-FREE), FOR THE LAST TASK - IF YOU ANSWER CORRECTLY BOTH QUESTIONS

7. PUBLISH YOUR WORK TO RPUBS BY NAMING IT “bsad_lab03”, COPY YOUR RPUBS LINK TO SAKAI AND SUBMIT THE LINK.

Setup

PART I: DATA OUTLIERS

Use a different file: Scoring.csv and complete the following five tasks.

TASK 1:Repeat STEP I1 to read file Scoring.csv and extract its Age column.

TASK 2: Repeat STEP I2 to calculate the mean and spread and display their values.

TASK 3: Repeat STEP I3 to calculate the upper and lower thresholds and display their values.

TASK 4: Repeat STEP I4 to calculate the maximum and minimum and display their values.

TASK 5: Repeat STEP I5 to answer QUESTIONS IA, IB, and IC.

PART II: DATA PREPARATION

Use a different file: scoring_origial.csv and complete the following four tasks.

TASK 6: Repeat STEPs II1a and II1c to read file scoring-original.csv and extract its Price column.

TASK 7: Repeat STEP II2 to remove the comma and $ signs in the Price data.

TASK 8: Repeat STEP II3 to remove any NA, calculate the mean of Price data, and display the mean value.

TASK 9: Answer the following questions by referring to lectures slides and other resources:

Question IIA: What are some other ways to clean this data?

Question IIB: How does Excel treat the missing values and the “\(" symbols? Answer: Excel treats the missing values as non-existent with the ### symbols. The "\)” symbol is what helps excel read the data as a price. Values without the “$”" symbol are treated as missing data.

PART III: Data Modeling

TASK 10 A:

TASK 10 B:

TASK 10 C:

==================================

EXTRA information:

Bsad Lab03

CME Group Foundation Business Analytics Lab

Ahad Sami

2/12/19

ATTENTION:

1. READ EVERY SINGLE LINE BELOW TO GET A COMPLETE UNDERSTANDING OF WHAT NEED TO BE DONE IN THIS ASSIGNMENT.

2. SET UP THE DIRECTORY AS INSTRUCTED IN THE “SETUP” BELOW BEFORE YOU PROCEED WITH THIS ASSIGNMENT TO AVOID ANY ERRORS.

3. THERE ARE TEN TASKS IN THIS ASSIGNMENT WHICH ARE RELATED TO DATA OUTLIERS, DATA PREPARATION, AND DATA MODELING. THIS LAB REQUIRES THE USE OF MICROSOFT EXCEL, R, AND ERDPLUS.

4. DO THE TASKS SEQUENTIALLY TO AVOID ERRORS WHEN YOU RUN THE CODES.

5. BEFORE YOU KNIT YOUR RMD FILE AND PUBLISH IT ON THE RPUBS, MAKE SURE TO RUN EACH CHUNK OF YOUR CODES BY CLICKING THE GREEN ARROW ON THE RIGHT OF THE CODE CHUNK.

6. FULL POINTS WILL ONLY BE GIVEN TO TASKS WHICH CODES ARE RUN SUCCESSFULLY (ERROR-FREE), FOR THE LAST TASK - IF YOU ANSWER CORRECTLY BOTH QUESTIONS

7. PUBLISH YOUR WORK TO RPUBS BY NAMING IT “bsad_lab03”, COPY YOUR RPUBS LINK TO SAKAI AND SUBMIT THE LINK.

Setup

PART I: DATA OUTLIERS

Use a different file: Scoring.csv and complete the following five tasks.

TASK 1:Repeat STEP I1 to read file Scoring.csv and extract its Age column.

TASK 2: Repeat STEP I2 to calculate the mean and spread and display their values.

TASK 3: Repeat STEP I3 to calculate the upper and lower thresholds and display their values.

TASK 4: Repeat STEP I4 to calculate the maximum and minimum and display their values.

TASK 5: Repeat STEP I5 to answer QUESTIONS IA, IB, and IC.

PART II: DATA PREPARATION

Use a different file: scoring_origial.csv and complete the following four tasks.

TASK 6: Repeat STEPs II1a and II1c to read file scoring-original.csv and extract its Price column.

TASK 7: Repeat STEP II2 to remove the comma and $ signs in the Price data.

TASK 8: Repeat STEP II3 to remove any NA, calculate the mean of Price data, and display the mean value.

TASK 9: Answer the following questions by referring to lectures slides and other resources:

Question IIA: What are some other ways to clean this data?

Question IIB: How does Excel treat the missing values and the “\(" symbols? Answer: Excel treats the missing values as non-existent with the ### symbols. The "\)” symbol is what helps excel read the data as a price. Values without the “$”" symbol are treated as missing data.

PART III: Data Modeling

TASK 10 A:

TASK 10 B:

TASK 10 C:

==================================

EXTRA information: