Data Preparation & Flow (lab04)

About

This worksheet includes three main tasks on data outliers, data preparation, and data modeling. The lab requires the use of Microsoft Excel, R, and ERDplus.

Setup

Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.

Note

For your assignment you may be using different data sets than what is included here. Always read carefully the instructions on Sakai. For clarity, tasks/questions to be completed/answered are highlighted in red color (visible in preview) and numbered according to their particular placement in the task section. Quite often you will need to add your own code chunk.

Execute all code chunks, preview, publish, and submit link on Sakai.

Task 1: Data Outliers

First, we must calculate the mean, standard deviation, maximum, and minimum for the Age column using R.

In R, we must read in the file again, extract the column and find the values that are asked for.

#Read File
mydata = read.csv(file="creditrisk.csv") 

#Name the extracted variable
age = mydata$Age

##### 1A) Fill in the code chunk below to calculate and display each result. Refer to previous worksheets and on-line help for some commands

#Calculate the average age below. 
mean(age)

## [1] 34.39765

#Calculate the standard deviation of age below. 
sd(age)

## [1] 11.04513

#Calculate the maxima of age below. Look in Help to find the right command for maxima
max(age)

## [1] 73

#Calculate the minima of age below. Look in Help to find the right command for minima
min(age)

## [1] 18

Next, we will use the formula from class to detect any outliers. An outlier is a value that “lies outside” most of the other values in a set of data. A common way to estimate the upper and lower limits is to take the mean (+ or -) 3 * standard deviation.

mean(age)+(3*sd(age))

## [1] 67.53302

mean(age)-(3*sd(age))

## [1] 1.262269

#It seems there are outliers here, since our maximum is 73, on the high side there is definitely at least 1, possibly more outliers. This seems unlikely on the low side, since we can go as far as 1 and the minimum is a value of 18.

##### 1B) Write and execute the code chunk corresponding to the above formula to calculate the upper and lower limits for age. Display your results. Based on the limits do you think there are outliers? Explain your answer

Another similar method to find the upper and lower thresholds, discussed in many introductory statistics courses, involves finding the interquartile range. Follow along below to see how we first calculate the interquartile range..

quantile(age)

##   0%  25%  50%  75% 100% 
##   18   26   32   41   73

lowerq = quantile(age)[2]
upperq = quantile(age)[4]
iqr = upperq - lowerq

Next we calculate the limiting thresholds. A threshold here is the boundary that determines if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier.

To calculate the upper threshold:

upperthreshold = (iqr * 1.5) + upperq 
upperthreshold

##  75% 
## 63.5

Below is the lower threshold:

lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold

## 25% 
## 3.5

A good way to undertsand the above calculations is to visualize the results using a box and whisker plot. The top and lower ends of the box correspond to the upper and lower quartiles. The median is marked by a bolded line. The whiskers are the lines connecting the upper and lower quartiles to upper and lower thresholds. Any points beyond the thresholds is a potential outlier.

boxplot(age)

##### 1C) From the box plot representation are there any outliers? How many can you count? How does your answer reconcile with the result from Task 1B?

#It seems there are many outliers in the upper range of data, but none in the lower range, as predicted. The number is difficult to count on my screen, so I cannot say for sure, but there are surely more than one, maybe around 5.

Task 2: Data Preparation

Next, we will read the file creditriskorg.csv into R as provided in its original form. Unlike the cleaned file creditrisk.csv, the original dataset will require some data preparation.

newdata = read.csv(file="creditriskorg.csv")
head(newdata)

##                 X       X.1         X.2             X.3             X.4    X.5
## 1    Loan Purpose Checking      Savings Months Customer Months Employed Gender
## 2 Small Appliance     $-       $739.00               13              12      M
## 3       Furniture     $-     $1,230.00               25               0      M
## 4         New Car     $-       $389.00               19             119      M
## 5       Furniture  $638.00     $347.00               13              14      M
## 6       Education  $963.00   $4,754.00               40              45      M
##              X.6 X.7     X.8   X.9       X.10        X.11
## 1 Marital Status Age Housing Years        Job Credit Risk
## 2         Single  23     Own     3  Unskilled         Low
## 3       Divorced  32     Own     1    Skilled        High
## 4         Single  38     Own     4 Management        High
## 5         Single  36     Own     2  Unskilled        High
## 6         Single  31    Rent     3    Skilled         Low

We observe a new line is inserted with the header labels X, X.1, ... and that the true column headers are shifted down. This is because of the empty line in the original dataset. To account for this detail we must skip one line when reading the file.

newdata = read.csv(file="creditriskorg.csv",skip=1) 
head(newdata)

##      Loan.Purpose    Checking     Savings Months.Customer Months.Employed
## 1 Small Appliance       $-       $739.00               13              12
## 2       Furniture       $-     $1,230.00               25               0
## 3         New Car       $-       $389.00               19             119
## 4       Furniture    $638.00     $347.00               13              14
## 5       Education    $963.00   $4,754.00               40              45
## 6       Furniture  $2,827.00        $-                 11              13
##   Gender Marital.Status Age Housing Years        Job Credit.Risk
## 1      M         Single  23     Own     3  Unskilled         Low
## 2      M       Divorced  32     Own     1    Skilled        High
## 3      M         Single  38     Own     4 Management        High
## 4      M         Single  36     Own     2  Unskilled        High
## 5      M         Single  31    Rent     3    Skilled         Low
## 6      M        Married  25     Own     1    Skilled         Low

Next we want to extract the Checking column and then find the average of checking, smilar to what we did in the previous lab. When we try to execute the code chunk below notice that we get an error.

checking = newdata$Checking # command to extract the Checking column from the data file newdata
mean(checking)

## Warning in mean.default(checking): argument is not numeric or logical: returning
## NA

## [1] NA

To resolve the error, we must understand first where it’s source. There are missing values in the csv file represented by the symbol $-. Missing data is quite common as most datasets are not perfect. Additionally, there are commas within the excel spreadsheet, and R does not recognize that ‘1,234’ is equivalent to ‘1234’. Lastly, there are ‘$’ symbols throughout the file which is not a numerical symbol either.

To correct for the error we need to do some data cleanup. For this we will use the sub function sub() in R to replace unwanted symbols with something else. For example, in order to remove the comma in the number “1,234”, we can substitute it with blank. Below is a sequence of commands to help with the cleanup of the data in the Checking column and eventual calculation of the mean.

#substitute comma with blank in all of checking  
checking= sub(",","",checking)

#substitute dollar sign with blank in all of checking 
# Example new = sub("\\$","",new)
checking = sub("\\$","",checking)

#Convert values to numeric. Any value that cannot be converted to numeric will be designated as NA (Not Applicable)
checking = as.numeric(checking)

## Warning: NAs introduced by coercion

#Calculate mean of checking with all NAs removed 
mean(checking,na.rm=TRUE)

## [1] 2559.805

##### 2A) Repeat the above commands to calculate the mean of the Savings column instead. Use a different variable naming

savings = newdata$Savings
savings = sub(",","",savings)
savings = sub("\\$","",savings)
savings = as.numeric(savings)

## Warning: NAs introduced by coercion

mean(savings,na.rm=TRUE)

## [1] 2122.146

##### 2B) Calculate now the mean of the Checking column in Excel using the Excel function Average. Compare the two results, from Excel and R and share your insights

##### 2C) Based on your observation how did Excel treat the missing values represented by the symbol $-? Are they included in the calculation of the mean or excluded? Explain your answer

Task 3: Data Modeling

Here, we will look at Chicago Divvy bike data. The historical data sets with description of fields can be found at:

Chicago Divvy Data: https://www.divvybikes.com/data

First you need to download from the Divvy data site the archived (zipped) ‘Divvy_Trips_2019_Q3.zip' file. Once downlaoded you should unarchive (unzip) the file to extract its content. You will then obtain the corresponding file `Divvy_Trips_2019_Q3.csv’ file. The file is relatively big in size, and you will quickly come to appreciate some of the challenges working with big files using Excel.

###### 3A) Open in RStudio or Excel the file Divvy_Trips_2019_Q3.csv. What is the size of the file (measured in bytes), the number of columns and of rows?. Identify the column field(s) in the data that is/are unique identifier(s) (cannot have duplicate/repeated values)

#There are 12 rows, 501 columns but 500 with data.

#Unique identifiers are Bike ID, From Station ID, To Station ID, and Trip ID, 13,367KB or 13.367 bytes.

###### 3B) Define a relational business logic integrity rule for the column field Trip Duration

#Cannot be 0, measured in minutes or hours perhaps?

###### 3C) Using https://erdplus.com/#/standalone draw a star schema and relational schema . Include images capture of your schemas here

Relational Schema

Star Schema