About

The focus of this lab is on data outliers, data preparation, and data modeling. This lab requires the use of Microsoft Excel, R, and ERDplus.

Setup

Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.

Important Note

For your assignment you may be using different data sets than what is included in this worksheet demo. Make sure to read carefully the instructions on Sakai.


Task 1: Data Outliers

First, we must calculate the mean, standard deviation, maximum, and minimum for the Age column using R.

In R, we must read in the file again, extract the column and find the values that are asked for.

#Read File
mydata = read.csv(file="data/creditrisk.csv") 

#Name the extracted variable
age = mydata$Age 
#Calculate the average age below. Refer to Worksheet 2 for the correct command.
MeanAgeCR=mean(age)
MeanAgeCR
## [1] 34.39765
#Calculate standard deviation of age below. Refer to Worksheet 2 for the correct command. 
SDAgeCR=sd(age)
SDAgeCR
## [1] 11.04513
#Calculate the maximum of age below. The command to find the maximum is max(variable) where variable is the extracted variable.  
MaxAgeCR=max(age)
MaxAgeCR
## [1] 73
#Calculate the minimum of age below. The command to find the minimum is min(variable) where variable is the extracted variable.  
MinAgeCR=min(age)
MinAgeCR
## [1] 18

Next, use the formula from class to detect any outliers. An outlier is value that “lies outside” most of the other values in a set of data. A common way to estimate the upper and lower threshold is to take the mean (+ or -) 3 * standard deviation. Try using this formula to find the upper and lower limit for age.

#Use the formula above to calculate the upper and lower threshold
UpperThCR=MeanAgeCR+(3*SDAgeCR)
UpperThCR
## [1] 67.53302
LowerThCR=MeanAgeCR-(3*SDAgeCR)
LowerThCR
## [1] 1.262269

Another similar method to find the upper and lower thresholds discussed in introductory statistics courses involves finding the interquartile range. Follow along below to see how we first calculate the interquartile range..

quantile(age) 
##   0%  25%  50%  75% 100% 
##   18   26   32   41   73
lowerq = quantile(age)[2]
upperq = quantile(age)[4]
iqr = upperq - lowerq

The formula below calculates the threshold. The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier.

Below is the upper threshold:

upperthreshold = (iqr * 1.5) + upperq 
upperthreshold
##  75% 
## 63.5

Below is the lower threshold:

lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold
## 25% 
## 3.5

Are there any outliers? How many? It can also be useful to visualize the data using a box and whisker plot. The boxplot below supports the IQR we found of 15 and upper and lower threshold. Yes, there are 5 outliers. All of them higher than the max.

boxplot(age) 

Task 1: Data Outliers Scoring.csv

#Read File 
mydata = read.csv(file="data/scoring.csv")

#Name the extracted variable
AgeScoring=mydata$Age
#Calculate the average below.
MeanAgeSc= mean(AgeScoring)
MeanAgeSc
## [1] 37.08412
#Calculate standard deviation of age below.
SDAgeSc= sd(AgeScoring)
SDAgeSc
## [1] 10.98637
#Calculate the maximum of age below.
MaxAgeSc= max(AgeScoring)
MaxAgeSc
## [1] 68
#Calculate the minimum of age below.
MinAgeSc= min(AgeScoring)
MinAgeSc
## [1] 18
#Calculate upper threshold
UpperThSc= MeanAgeSc+(3*SDAgeSc)
UpperThSc
## [1] 70.04322
#Calculate lower threshold
LowerThSc= MeanAgeSc-(3*SDAgeSc)
LowerThSc
## [1] 4.125023
#Finding Interquartile Range
quantile(AgeScoring)
##   0%  25%  50%  75% 100% 
##   18   28   36   45   68
lowerqSc= quantile (AgeScoring)[2]
upperqSc= quantile (AgeScoring)[4]
iqrSc = upperqSc - lowerqSc
#Calculate upper threshold
upperthresholdSc = (iqrSc*1.5)+upperqSc
upperthresholdSc
##  75% 
## 70.5
#Calculate lower threshold
lowerthresholdSc = lowerqSc - (iqrSc*1.5)
lowerthresholdSc
## 25% 
## 2.5
boxplot(AgeScoring)

Are there any outliers? How many? No, there arent. The maximum and the minimum are within the range of the threshold. —————

Task 2: Data Preparation

Next, we must read the ‘creditriskorg.csv’ file into R. This is the original dataset and contains missing values.

newdata = read.csv(file="data/creditriskorg.csv")
head(newdata)
##                 X       X.1         X.2             X.3             X.4
## 1    Loan Purpose Checking      Savings Months Customer Months Employed
## 2 Small Appliance     $-       $739.00               13              12
## 3       Furniture     $-     $1,230.00               25               0
## 4         New Car     $-       $389.00               19             119
## 5       Furniture  $638.00     $347.00               13              14
## 6       Education  $963.00   $4,754.00               40              45
##      X.5            X.6 X.7     X.8   X.9       X.10        X.11
## 1 Gender Marital Status Age Housing Years        Job Credit Risk
## 2      M         Single  23     Own     3  Unskilled         Low
## 3      M       Divorced  32     Own     1    Skilled        High
## 4      M         Single  38     Own     4 Management        High
## 5      M         Single  36     Own     2  Unskilled        High
## 6      M         Single  31    Rent     3    Skilled         Low

We observe that the column names are shifted down below because of the empty line. So, we must make sure to use the command skip and set the header to true.

newdata = read.csv(file="data/creditriskorg.csv",skip=1,header=TRUE,sep=",") 
head(newdata)
##      Loan.Purpose    Checking     Savings Months.Customer Months.Employed
## 1 Small Appliance       $-       $739.00               13              12
## 2       Furniture       $-     $1,230.00               25               0
## 3         New Car       $-       $389.00               19             119
## 4       Furniture    $638.00     $347.00               13              14
## 5       Education    $963.00   $4,754.00               40              45
## 6       Furniture  $2,827.00        $-                 11              13
##   Gender Marital.Status Age Housing Years        Job Credit.Risk
## 1      M         Single  23     Own     3  Unskilled         Low
## 2      M       Divorced  32     Own     1    Skilled        High
## 3      M         Single  38     Own     4 Management        High
## 4      M         Single  36     Own     2  Unskilled        High
## 5      M         Single  31    Rent     3    Skilled         Low
## 6      M        Married  25     Own     1    Skilled         Low

To calculate the mean for Checking in R, follow Worksheet 2. Extract the Checking column first and then find the average using the function built in R. What happens when we try to use the function?

checking = newdata$Checking 
#Calculate average for Checking
Checking = mean(checking)
## Warning in mean.default(checking): argument is not numeric or logical:
## returning NA
Checking
## [1] NA

To resolve the error, we must understand where it is coming from and correct for. There are missing values in the csv file, which is quite common as most datasets are not perfect. Additionally, there are commas within the excel spreadsheet, and R does not recognize that ‘1,234’ is equivalent to ‘1234’. Lastly, there are ‘$’ symbols throughout the file which is not a numerical symbol either.

The sub function replaces these symbols with something else. So, in order to remove the comma in the number “1,234”, we must substitute it with just an empty space.

As shown on the worksheet, type and copy the exact commands to find the mean with the NA values removed.

#substitute comma with blank in all of checking.  Below are examples using a hypothetical variable name 'new'.
# Example new = sub(",","",new)
checking = sub(",","",checking)
#substitute dollar sign with blank in all of checking 
# Example new = sub("\\$","",new)
checking= sub("\\$","",checking)
#Convert values to numeric to remove any NA
# Example new = as.numeric(new)
checking = as.numeric(checking)
## Warning: NAs introduced by coercion
#Calculate mean of checking with NA removed 
# Example mean(new,na.rm=TRUE)
MeanChecking=mean(checking,na.rm=TRUE)
MeanChecking
## [1] 2559.805

What are some other ways to clean this data? How about Excel? How does Excel treat the missing values and the “$” symbols? We can clean data using Excel. For example; we could remove duplicate rows, removing spaces and nonprinting text, fixing date, times, numbers and signs,merging and splitting columns, etc. The only disadvantage of Excel is the time it requires to get the work done, especially when the size of the file is too big. When you type $ in Excel, this treats the following characters as TEXT instead of numeric value. For the missing values: R presents missing values as a NA. It is not a string or numeric data, but an indicator of missingness. On the other hand, Excel treats missing values as 0. If we calculate the average for the Checkings column in the Creditriskorg.csv file in Excel, the result we are going to get is 1048.0141. That means excel does not recognize 0 as a False value.

Task 2: Data Preparation

#Read the file into R.
newdataSc = read.csv(file="data/scoring_original.csv")
head(newdataSc)
##   Status Seniority  Home Time Age Marital Records       Job Expenses
## 1   good         9  rent   60  30 married  no_rec freelance     $73K
## 2   good        17  rent   60  58   widow  no_rec     fixed     $48K
## 3    bad        10 owner   36  46 married yes_rec freelance     $90K
## 4   good         0  rent   60  24  single  no_rec     fixed     $63K
## 5   good         0  rent   36  26  single  no_rec     fixed     $46K
## 6   good         1 owner   60  36 married  no_rec     fixed     $75K
##   Income Assets Debt     Amount      Price   Finrat   Savings
## 1  $129K      0    0   $800.00    $846.00  94.56265  4.200000
## 2  $131K      0    0 $1,000.00  $1,658.00  60.31363  4.980000
## 3  $200K   3000    0 $2,000.00  $2,985.00  67.00168  1.980000
## 4  $182K   2500    0   $900.00  $1,325.00  67.92453  7.933333
## 5  $107K      0    0   $310.00    $910.00  34.06593  7.083871
## 6  $214K   3500    0   $650.00  $1,645.00  39.51368 12.830769
PriceSc = newdataSc$Price
#Calculate average for Price
MeanPrice = mean(PriceSc)
## Warning in mean.default(PriceSc): argument is not numeric or logical:
## returning NA
MeanPrice
## [1] NA
#Substitute comma with blank in all of price. 
PriceSc = sub(",","",PriceSc)

#Substitute dollar sign with blank in all of price
PriceSc = sub("\\$","", PriceSc)

#Convert values to numeric to remove any NA
PriceSc = as.numeric(PriceSc)
## Warning: NAs introduced by coercion
#Calculate mean of PriceSc with NA removed
MeanPrice=mean(PriceSc,na.rm=TRUE)
MeanPrice
## [1] 1462.48

Task 3: Data Modeling

Now, we will look at Chicago taxi data. Go and explore the interactive dashboard and read the description of the data.

Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq

Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/

– Open in RStudio or Excel the taxi trips sample csv file located in the data folder. Note the size of the file, the number of columns and of rows here. Identify the unique entities, and fields in the data.

##Read file 
TaxiTripsSample = read.csv(file="data/taxi_trips_sample.csv")
str(TaxiTripsSample)
## 'data.frame':    99999 obs. of  24 variables:
##  $ Trip.ID                   : Factor w/ 99999 levels "3e7d6d8ccf1425ae1dcd584f5c3ca303cf6362ed",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Taxi.ID                   : Factor w/ 6164 levels "0008de7a146802839c9e6059f482d292ebdae13c5c31dd6e5983a80882e2a5dbcd6ea098c2fcd56f34ce02645eb94c6b39512e930483774"| __truncated__,..: 4264 4466 5890 335 5491 1199 207 2234 639 1106 ...
##  $ Trip.Start.Timestamp      : Factor w/ 64107 levels "01/01/2013 01:00:00 PM",..: 43393 49615 7884 23880 8923 61522 36223 11713 37705 29843 ...
##  $ Trip.End.Timestamp        : Factor w/ 64181 levels "","01/01/2013 01:00:00 AM",..: 43465 49683 7924 23952 8959 61606 36292 11780 37776 29890 ...
##  $ Trip.Seconds              : int  480 420 420 2340 300 1020 360 2220 1020 780 ...
##  $ Trip.Miles                : num  0.15 0 1.7 13.8 0.7 1.4 0.1 13.3 8 7.7 ...
##  $ Pickup.Census.Tract       : num  1.7e+10 1.7e+10 NA 1.7e+10 1.7e+10 ...
##  $ Dropoff.Census.Tract      : num  1.7e+10 1.7e+10 NA 1.7e+10 1.7e+10 ...
##  $ Pickup.Community.Area     : int  28 8 6 76 8 NA 7 56 NA 11 ...
##  $ Dropoff.Community.Area    : int  32 28 4 6 8 NA 8 24 NA 76 ...
##  $ Fare                      : Factor w/ 892 levels "","$0.00","$0.01",..: 734 642 734 329 554 860 857 320 175 162 ...
##  $ Tips                      : Factor w/ 1081 levels "","$0.00","$0.01",..: 2 2 2 2 2 2 50 839 2 756 ...
##  $ Tolls                     : Factor w/ 25 levels "","$0.00","$0.60",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Extras                    : Factor w/ 183 levels "","$0.00","$0.01",..: 19 2 2 99 2 12 2 55 12 21 ...
##  $ Trip.Total                : Factor w/ 2702 levels "","$0.00","$0.01",..: 2472 2030 2280 1105 1704 28 17 1265 548 1055 ...
##  $ Payment.Type              : Factor w/ 7 levels "Cash","Credit Card",..: 1 1 1 1 1 1 2 2 1 2 ...
##  $ Company                   : Factor w/ 98 levels "","0118 - 42111 Godfrey S.Awir",..: 1 97 97 1 1 1 97 1 1 1 ...
##  $ Pickup.Centroid.Latitude  : num  41.9 41.9 41.9 42 41.9 ...
##  $ Pickup.Centroid.Longitude : num  -87.6 -87.6 -87.7 -87.9 -87.6 ...
##  $ Pickup.Centroid.Location  : Factor w/ 309 levels "","POINT (-87.540936 41.663671)",..: 80 66 119 308 44 1 98 292 1 301 ...
##  $ Dropoff.Centroid.Latitude : num  41.9 41.9 42 42 41.9 ...
##  $ Dropoff.Centroid.Longitude: num  -87.6 -87.6 -87.7 -87.7 -87.6 ...
##  $ Dropoff.Centroid..Location: Factor w/ 357 levels "","POINT (-87.534903 41.707311)",..: 65 88 256 178 51 1 67 227 1 357 ...
##  $ Community.Areas           : int  29 37 57 75 37 NA 68 53 NA 11 ...

Size of File: 32,78 MB

##Rows
nrow(TaxiTripsSample)
## [1] 99999
##Columns
ncol(TaxiTripsSample)
## [1] 24

Unique entities: TripID and TaxiID Fields: Trip Start Timestamp, Trip End Timestamp, Trip Seconds, Trip Miles, Pickup Census Tract, Dropoff Census Tract, Pickup Community Area, Dropoff Community Area, Fare, Tips, Tolls, Extras, Trip Total, Payment Type, Company, Pickup Centroid Latitude, Pickup Centroid Longitude, Pickup Centroid Location, Dropoff Centroid Latitude, Dropoff Centroid Longitude, Dropoff Centroid Location, Community Areas.

– Define a relational business logic integrity check for the column field ‘Trip Seconds’. It is established in the webpage that in order to protect the privacy of the clients some information was altered. As is the case of the time, which was rounded up to the nearest 15 minutes. Having that in mind, we would immediately assume that there would be a violation in the relationship between Trip Seconds and the Trip Start and End TimeStamp. By analyzing the first entry we know that the ride started at 02:45pm and it can also be seen that the same taxi ended its journey at 02:45pm, but the entire trip lasted 480 seconds or 8 minutes and such information is not reflected in the other columns.

– Using https://erdplus.com/#/standalone draw a star like schema using at least the following tables: