About

The focus of this lab is on data outliers, data preparation, and data modeling. This lab requires the use of Microsoft Excel, R, and ERDplus.

Setup

Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.

Important Note

For your assignment you may be using different data sets than what is included in this worksheet demo. Make sure to read carefully the instructions on Sakai.


Task 1: Data Outliers

First, we must calculate the mean, standard deviation, maximum, and minimum for the Age column using R.

In R, we must read in the file again, extract the column and find the values that are asked for.

#Read File
mydata = read.csv(file="data/creditrisk.csv") 
head(mydata)
##      Loan.Purpose Checking Savings Months.Customer Months.Employed Gender
## 1 Small Appliance        0     739              13              12      M
## 2       Furniture        0    1230              25               0      M
## 3         New Car        0     389              19             119      M
## 4       Furniture      638     347              13              14      M
## 5       Education      963    4754              40              45      M
## 6       Furniture     2827       0              11              13      M
##   Marital.Status Age Housing Years        Job Credit.Risk
## 1         Single  23     Own     3  Unskilled         Low
## 2       Divorced  32     Own     1    Skilled        High
## 3         Single  38     Own     4 Management        High
## 4         Single  36     Own     2  Unskilled        High
## 5         Single  31    Rent     3    Skilled         Low
## 6        Married  25     Own     1    Skilled         Low
#Name the extracted variable
age = mydata$Age 
#Calculate the average age below. Refer to Worksheet 2 for the correct command.
meanAge=mean(age)
meanAge
## [1] 34.39765
#Calculate standard deviation of age below. Refer to Worksheet 2 for the correct command. 
stdAge=sd(age)
stdAge
## [1] 11.04513
#Calculate the maximum of age below. The command to find the maximum is max(variable) where variable is the extracted variable.  
maxAge=max(age)
maxAge
## [1] 73
#Calculate the minimum of age below. The command to find the minimum is min(variable) where variable is the extracted variable. 
minAge=min(age)
minAge
## [1] 18

Next, use the formula from class to detect any outliers. An outlier is value that “lies outside” most of the other values in a set of data. A common way to estimate the upper and lower threshold is to take the mean (+ or -) 3 * standard deviation. Try using this formula to find the upper and lower limit for age.

#Use the formula above to calculate the upper and lower threshold
upper=meanAge+3*stdAge
upper
## [1] 67.53302
lower=meanAge-3*stdAge
lower
## [1] 1.262269

Another similar method to find the upper and lower thresholds discussed in introductory statistics courses involves finding the interquartile range. Follow along below to see how we first calculate the interquartile range..

quantile(age) 
##   0%  25%  50%  75% 100% 
##   18   26   32   41   73
lowerq = quantile(age)[2]
upperq = quantile(age)[4]
iqr = upperq - lowerq

The formula below calculates the threshold. The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier.

Below is the upper threshold:

upperthreshold = (iqr * 1.5) + upperq 
upperthreshold
##  75% 
## 63.5

Below is the lower threshold:

lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold
## 25% 
## 3.5

Are there any outliers? How many? It can also be useful to visualize the data using a box and whisker plot. The boxplot below supports the IQR we found of 15 and upper and lower threshold.

boxplot(age) 


Task 2: Data Preparation

Next, we must read the ‘creditriskorg.csv’ file into R. This is the original dataset and contains missing values.

newdata = read.csv(file="data/creditriskorg.csv",skip=1,header=TRUE,sep=",")
head(newdata)
##   Small.Appliance         X..    X.739.00 X13 X12 M   Single X23  Own X3
## 1       Furniture       $-     $1,230.00   25   0 M Divorced  32  Own  1
## 2         New Car       $-       $389.00   19 119 M   Single  38  Own  4
## 3       Furniture    $638.00     $347.00   13  14 M   Single  36  Own  2
## 4       Education    $963.00   $4,754.00   40  45 M   Single  31 Rent  3
## 5       Furniture  $2,827.00        $-     11  13 M  Married  25  Own  1
## 6         New Car       $-       $229.00   13  16 M  Married  26  Own  3
##    Unskilled  Low  X X.1
## 1    Skilled High NA Age
## 2 Management High NA    
## 3  Unskilled High NA    
## 4    Skilled  Low NA    
## 5    Skilled  Low NA    
## 6  Unskilled  Low NA

We observe that the column names are shifted down below because of the empty line. So, we must make sure to use the command skip and set the header to true.

newdata=read.csv(file="data/creditriskorg.csv")
head(newdata)
##      Loan.Purpose    Checking     Savings Months.Customer Months.Employed
## 1 Small Appliance       $-       $739.00               13              12
## 2       Furniture       $-     $1,230.00               25               0
## 3         New Car       $-       $389.00               19             119
## 4       Furniture    $638.00     $347.00               13              14
## 5       Education    $963.00   $4,754.00               40              45
## 6       Furniture  $2,827.00        $-                 11              13
##   Gender Marital.Status Age Housing Years        Job Credit.Risk  X X.1
## 1      M         Single  23     Own     3  Unskilled         Low NA    
## 2      M       Divorced  32     Own     1    Skilled        High NA Age
## 3      M         Single  38     Own     4 Management        High NA    
## 4      M         Single  36     Own     2  Unskilled        High NA    
## 5      M         Single  31    Rent     3    Skilled         Low NA    
## 6      M        Married  25     Own     1    Skilled         Low NA

To calculate the mean for Checking in R, follow Worksheet 2. Extract the Checking column first and then find the average using the function built in R. What happens when we try to use the function?

checking = newdata$Checking
meanChecking=mean(checking)
## Warning in mean.default(checking): argument is not numeric or logical:
## returning NA
meanChecking
## [1] NA

To resolve the error, we must understand where it is coming from and correct for. There are missing values in the csv file, which is quite common as most datasets are not perfect. Additionally, there are commas within the excel spreadsheet, and R does not recognize that ‘1,234’ is equivalent to ‘1234’. Lastly, there are ‘$’ symbols throughout the file which is not a numerical symbol either.

The sub function replaces these symbols with something else. So, in order to remove the comma in the number “1,234”, we must substitute it with just an empty space.

As shown on the worksheet, type and copy the exact commands to find the mean with the NA values removed.

checking = newdata$Checking
#substitute comma with blank in all of checking.  Below are examples using a hypothetical variable name 'new'.
# Example new = sub(",","",new)
checking=sub(",","",checking)

#substitute dollar sign with blank in all of checking 
# Example new = sub("\\$","",new)
checking=sub("\\$","",checking)

#Convert values to numeric to remove any NA
# Example new = as.numeric(new)
checking=as.numeric(checking)
## Warning: NAs introduced by coercion
#Calculate mean of checking with NA removed 
# Example mean(new,na.rm=TRUE)
mean(checking,na.rm=TRUE)
## [1] 2559.805
meanChecking
## [1] NA

What are some other ways to clean this data? How about Excel? How does Excel treat the missing values and the “$” symbols?


Task 3: Data Modeling

Now, we will look at Chicago taxi data. Go and explore the interactive dashboard and read the description of the data.

Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq

Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/

– Open in RStudio or Excel the taxi trips sample csv file located in the data folder. Note the size of the file, the number of columns and of rows here. Identify the unique entities, and fields in the data. The size of the file is 39.7 MB. There is 24 columns and 1000000 rows including header. The unique entity is trip id. The rest of the headers are the fields.

mydata = read.csv(file="data/Taxi_Trips_Sample.csv") 
head(mydata)
##                                    Trip.ID
## 1 3e7d6d8ccf1425ae1dcd584f5c3ca303cf6362ed
## 2 3e7d6e5c4e87f01a475c8200b33777e85497da89
## 3 3e7d6e69c1d6755d9e7484a453cd93a3ee9fed4c
## 4 3e7d6efe43222b0ebc698583916674c648dd4520
## 5 3e7d6f001e9bcda8478a489cb53293d26328ac85
## 6 3e7d6f2a03527d63dc01b95e829fdfdd706102da
##                                                                                                                            Taxi.ID
## 1 b47c583b142d75b42882975eaab19c6cb98d82686016576cce6e305b1b99eb16aacfb9a21ff61c84873a6c3dde282756c162c538c8b69554fd8f811f3a8f60a2
## 2 bc1c0381e3bca623e6c04f3410f7b67201a9fc85c6b66d0f420a88099d38448f9b9874e246da49cf2ef32ea3d027eec9c5b484fe77dbfc033c389b5576ac66bd
## 3 f529487ccf3a5d538cd246342379d54314e90dc6c573f94a72f5c2238189c5131e12c1e493c71ccdaed6751d13f53fa1d8b51a1591c48891dc6beb3f9df3a18f
## 4 0f831bff43d83f396f2e4950126c6137dcdb60fb4c8580ffe860203747a83a789b22f2f9e4fbdd0dd8ed8c310366d8935228ddbcadf708fb9691ca5dd1b6c802
## 5 e5274d6c103515af3ce705182d0bbbea7ca077a6f23b1736254f2de8ba3e1687dd77f5fb541b7f00b1ebc24cfde54caf5a9562f046a0559acbfe1e7159e17c1a
## 6 329d9f0b72ce0fea6c2cc7ea3347924c11d98702e5cc39eacf6252038304bb019457c905f38066a2f9aa4b8732ac30ea22d8740e177a314ef7b0327ad5766cee
##     Trip.Start.Timestamp     Trip.End.Timestamp Trip.Seconds Trip.Miles
## 1 08/19/2014 02:45:00 PM 08/19/2014 02:45:00 PM          480       0.15
## 2 09/23/2013 05:15:00 PM 09/23/2013 05:15:00 PM          420       0.00
## 3 02/16/2014 10:15:00 PM 02/16/2014 10:30:00 PM          420       1.70
## 4 05/10/2013 08:00:00 PM 05/10/2013 08:45:00 PM         2340      13.80
## 5 02/21/2016 07:15:00 PM 02/21/2016 07:15:00 PM          300       0.70
## 6 12/10/2015 05:30:00 PM 12/10/2015 06:00:00 PM         1020       1.40
##   Pickup.Census.Tract Dropoff.Census.Tract Pickup.Community.Area
## 1         17031280100          17031839100                    28
## 2         17031081800          17031281900                     8
## 3                  NA                   NA                     6
## 4         17031980000          17031060400                    76
## 5         17031081500          17031081500                     8
## 6                  NA                   NA                    NA
##   Dropoff.Community.Area   Fare  Tips Tolls Extras Trip.Total Payment.Type
## 1                     32  $7.05 $0.00 $0.00  $1.50      $8.55         Cash
## 2                     28  $6.05 $0.00 $0.00  $0.00      $6.05         Cash
## 3                      4  $7.05 $0.00 $0.00  $0.00      $7.05         Cash
## 4                      6 $31.25 $0.00 $0.00  $3.00     $34.25         Cash
## 5                      8  $5.50 $0.00 $0.00  $0.00      $5.50         Cash
## 6                     NA  $9.25 $0.00 $0.00  $1.00     $10.25         Cash
##                     Company Pickup.Centroid.Latitude
## 1                                           41.88530
## 2 Taxi Affiliation Services                 41.89322
## 3 Taxi Affiliation Services                 41.94423
## 4                                           41.97907
## 5                                           41.89251
## 6                                                 NA
##   Pickup.Centroid.Longitude     Pickup.Centroid.Location
## 1                 -87.64281   POINT (-87.642808 41.8853)
## 2                 -87.63784 POINT (-87.637844 41.893216)
## 3                 -87.65600 POINT (-87.655998 41.944227)
## 4                 -87.90304  POINT (-87.90304 41.979071)
## 5                 -87.62621 POINT (-87.626215 41.892508)
## 6                        NA                             
##   Dropoff.Centroid.Latitude Dropoff.Centroid.Longitude
## 1                  41.88099                  -87.63275
## 2                  41.87926                  -87.64265
## 3                  41.97517                  -87.68752
## 4                  41.95067                  -87.66654
## 5                  41.89251                  -87.62621
## 6                        NA                         NA
##     Dropoff.Centroid..Location Community.Areas
## 1 POINT (-87.632746 41.880994)              29
## 2 POINT (-87.642649 41.879255)              37
## 3 POINT (-87.687516 41.975171)              57
## 4 POINT (-87.666536 41.950673)              75
## 5 POINT (-87.626215 41.892508)              37
## 6                                           NA

– Define a relational business logic integrity check for the column field ‘Trip Seconds’.

The trip start time and end time equals to the trip seconds which means the difference between start time and end time cannot be greater than the trip second.

– Using https://erdplus.com/#/standalone draw a star like schema using at least the following tables: