The focus of this lab is on data outliers, data preparation, and data modeling. This lab requires the use of Microsoft Excel, R, and ERDplus.
Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.
For your assignment you may be using different data sets than what is included in this worksheet demo. Make sure to read carefully the instructions on Sakai.
First, we must calculate the mean, standard deviation, maximum, and minimum for the Age column using R.
In R, we must read in the file again, extract the column and find the values that are asked for.
#Read File
mydata = read.csv(file="data/creditrisk.csv")
#Name the extracted variable
age = mydata$Age
mydata1 = read.csv(file="data/Scoring.csv")
age1 = mydata1$Age
#Calculate the average age below. Refer to Worksheet 2 for the correct command.
MeanAge=mean(age1)
#Calculate standard deviation of age below. Refer to Worksheet 2 for the correct command.
SDage=sd(age1)
#Calculate the maximum of age below. The command to find the maximum is max(variable) where variable is the extracted variable.
MaxA=max(age1)
#Calculate the minimum of age below. The command to find the minimum is min(variable) where variable is the extracted variable.
MinA=min(age1)
Next, use the formula from class to detect any outliers. An outlier is value that “lies outside” most of the other values in a set of data. A common way to estimate the upper and lower threshold is to take the mean (+ or -) 3 * standard deviation. Try using this formula to find the upper and lower limit for age.
#Use the formula above to calculate the upper and lower threshold
AgeLower= MeanAge - 3 * SDage
AgeUpper= MeanAge + 3 * SDage
Another similar method to find the upper and lower thresholds discussed in introductory statistics courses involves finding the interquartile range. Follow along below to see how we first calculate the interquartile range..
quantile(age1)
## 0% 25% 50% 75% 100%
## 18 28 36 45 68
lowerq = quantile(age1)[2]
upperq = quantile(age1)[4]
iqr = upperq - lowerq
The formula below calculates the threshold. The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier.
Below is the upper threshold:
upperthreshold = (iqr * 1.5) + upperq
upperthreshold
## 75%
## 70.5
Below is the lower threshold:
lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold
## 25%
## 2.5
Are there any outliers? How many? It can also be useful to visualize the data using a box and whisker plot. The boxplot below supports the IQR we found of 15 and upper and lower threshold.
boxplot(age)
Next, we must read the ‘creditriskorg.csv’ file into R. This is the original dataset and contains missing values.
newdata = read.csv(file="data/creditriskorg.csv")
head(newdata)
## X X.1 X.2 X.3 X.4
## 1 Loan Purpose Checking Savings Months Customer Months Employed
## 2 Small Appliance $- $739.00 13 12
## 3 Furniture $- $1,230.00 25 0
## 4 New Car $- $389.00 19 119
## 5 Furniture $638.00 $347.00 13 14
## 6 Education $963.00 $4,754.00 40 45
## X.5 X.6 X.7 X.8 X.9 X.10 X.11
## 1 Gender Marital Status Age Housing Years Job Credit Risk
## 2 M Single 23 Own 3 Unskilled Low
## 3 M Divorced 32 Own 1 Skilled High
## 4 M Single 38 Own 4 Management High
## 5 M Single 36 Own 2 Unskilled High
## 6 M Single 31 Rent 3 Skilled Low
newdata1 = read.csv(file="data/scoring_original.csv")
head(newdata1)
## Status Seniority Home Time Age Marital Records Job Expenses
## 1 good 9 rent 60 30 married no_rec freelance $73K
## 2 good 17 rent 60 58 widow no_rec fixed $48K
## 3 bad 10 owner 36 46 married yes_rec freelance $90K
## 4 good 0 rent 60 24 single no_rec fixed $63K
## 5 good 0 rent 36 26 single no_rec fixed $46K
## 6 good 1 owner 60 36 married no_rec fixed $75K
## Income Assets Debt Amount Price Finrat Savings
## 1 $129K 0 0 $800.00 $846.00 94.56265 4.200000
## 2 $131K 0 0 $1,000.00 $1,658.00 60.31363 4.980000
## 3 $200K 3000 0 $2,000.00 $2,985.00 67.00168 1.980000
## 4 $182K 2500 0 $900.00 $1,325.00 67.92453 7.933333
## 5 $107K 0 0 $310.00 $910.00 34.06593 7.083871
## 6 $214K 3500 0 $650.00 $1,645.00 39.51368 12.830769
We observe that the column names are shifted down below because of the empty line. So, we must make sure to use the command skip and set the header to true.
newdata1 = read.csv(file="data/scoring_original.csv",skip=0,header=TRUE,sep=",")
head(newdata1)
## Status Seniority Home Time Age Marital Records Job Expenses
## 1 good 9 rent 60 30 married no_rec freelance $73K
## 2 good 17 rent 60 58 widow no_rec fixed $48K
## 3 bad 10 owner 36 46 married yes_rec freelance $90K
## 4 good 0 rent 60 24 single no_rec fixed $63K
## 5 good 0 rent 36 26 single no_rec fixed $46K
## 6 good 1 owner 60 36 married no_rec fixed $75K
## Income Assets Debt Amount Price Finrat Savings
## 1 $129K 0 0 $800.00 $846.00 94.56265 4.200000
## 2 $131K 0 0 $1,000.00 $1,658.00 60.31363 4.980000
## 3 $200K 3000 0 $2,000.00 $2,985.00 67.00168 1.980000
## 4 $182K 2500 0 $900.00 $1,325.00 67.92453 7.933333
## 5 $107K 0 0 $310.00 $910.00 34.06593 7.083871
## 6 $214K 3500 0 $650.00 $1,645.00 39.51368 12.830769
newdata = read.csv(file="data/creditriskorg.csv",skip=1,header=TRUE,sep=",")
head(newdata)
## Loan.Purpose Checking Savings Months.Customer Months.Employed
## 1 Small Appliance $- $739.00 13 12
## 2 Furniture $- $1,230.00 25 0
## 3 New Car $- $389.00 19 119
## 4 Furniture $638.00 $347.00 13 14
## 5 Education $963.00 $4,754.00 40 45
## 6 Furniture $2,827.00 $- 11 13
## Gender Marital.Status Age Housing Years Job Credit.Risk
## 1 M Single 23 Own 3 Unskilled Low
## 2 M Divorced 32 Own 1 Skilled High
## 3 M Single 38 Own 4 Management High
## 4 M Single 36 Own 2 Unskilled High
## 5 M Single 31 Rent 3 Skilled Low
## 6 M Married 25 Own 1 Skilled Low
To calculate the mean for Checking in R, follow Worksheet 2. Extract the Checking column first and then find the average using the function built in R. What happens when we try to use the function?
checking = newdata$Checking
price = newdata1$Price
To resolve the error, we must understand where it is coming from and correct for. There are missing values in the csv file, which is quite common as most datasets are not perfect. Additionally, there are commas within the excel spreadsheet, and R does not recognize that ‘1,234’ is equivalent to ‘1234’. Lastly, there are ‘$’ symbols throughout the file which is not a numerical symbol either.
The sub function replaces these symbols with something else. So, in order to remove the comma in the number “1,234”, we must substitute it with just an empty space.
As shown on the worksheet, type and copy the exact commands to find the mean with the NA values removed.
#substitute comma with blank in all of checking. Below are examples using a hypothetical variable name 'new'.
# Example new = sub(",","",new)
price = sub(",","",price)
#substitute dollar sign with blank in all of checking
# Example new = sub("\\$","",new)
price = sub("\\$","",price)
#Convert values to numeric to remove any NA
# Example new = as.numeric(new)
price = as.numeric(price)
## Warning: NAs introduced by coercion
#Calculate mean of checking with NA removed
# Example: it is better to do soemthing like this
# MeanNew = mean(new, na.rm = TRUE)
MeanPrice = mean(price, na.rm = TRUE)
MeanPrice
## [1] 1462.48
What are some other ways to clean this data? How about Excel? How does Excel treat the missing values and the “$” symbols?
Now, we will look at Chicago taxi data. Go and explore the interactive dashboard and read the description of the data.
Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq
Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/
– Open in RStudio or Excel the taxi trips sample csv file located in the data folder. Note the size of the file, the number of columns and of rows here. Identify the unique entities, and fields in the data.
TaxiTripsData = read.csv(file="data/Taxi_Trips_Sample.csv")
head(TaxiTripsData)
## Trip.ID
## 1 3e7d6d8ccf1425ae1dcd584f5c3ca303cf6362ed
## 2 3e7d6e5c4e87f01a475c8200b33777e85497da89
## 3 3e7d6e69c1d6755d9e7484a453cd93a3ee9fed4c
## 4 3e7d6efe43222b0ebc698583916674c648dd4520
## 5 3e7d6f001e9bcda8478a489cb53293d26328ac85
## 6 3e7d6f2a03527d63dc01b95e829fdfdd706102da
## Taxi.ID
## 1 b47c583b142d75b42882975eaab19c6cb98d82686016576cce6e305b1b99eb16aacfb9a21ff61c84873a6c3dde282756c162c538c8b69554fd8f811f3a8f60a2
## 2 bc1c0381e3bca623e6c04f3410f7b67201a9fc85c6b66d0f420a88099d38448f9b9874e246da49cf2ef32ea3d027eec9c5b484fe77dbfc033c389b5576ac66bd
## 3 f529487ccf3a5d538cd246342379d54314e90dc6c573f94a72f5c2238189c5131e12c1e493c71ccdaed6751d13f53fa1d8b51a1591c48891dc6beb3f9df3a18f
## 4 0f831bff43d83f396f2e4950126c6137dcdb60fb4c8580ffe860203747a83a789b22f2f9e4fbdd0dd8ed8c310366d8935228ddbcadf708fb9691ca5dd1b6c802
## 5 e5274d6c103515af3ce705182d0bbbea7ca077a6f23b1736254f2de8ba3e1687dd77f5fb541b7f00b1ebc24cfde54caf5a9562f046a0559acbfe1e7159e17c1a
## 6 329d9f0b72ce0fea6c2cc7ea3347924c11d98702e5cc39eacf6252038304bb019457c905f38066a2f9aa4b8732ac30ea22d8740e177a314ef7b0327ad5766cee
## Trip.Start.Timestamp Trip.End.Timestamp Trip.Seconds Trip.Miles
## 1 8/19/14 14:45 8/19/14 14:45 480 0.15
## 2 9/23/13 17:15 9/23/13 17:15 420 0.00
## 3 2/16/14 22:15 2/16/14 22:30 420 1.70
## 4 5/10/13 20:00 5/10/13 20:45 2340 13.80
## 5 2/21/16 19:15 2/21/16 19:15 300 0.70
## 6 12/10/15 17:30 12/10/15 18:00 1020 1.40
## Pickup.Census.Tract Dropoff.Census.Tract Pickup.Community.Area
## 1 17031280100 17031839100 28
## 2 17031081800 17031281900 8
## 3 NA NA 6
## 4 17031980000 17031060400 76
## 5 17031081500 17031081500 8
## 6 NA NA NA
## Dropoff.Community.Area Fare Tips Tolls Extras Trip.Total Payment.Type
## 1 32 $7.05 $0.00 $0.00 $1.50 $8.55 Cash
## 2 28 $6.05 $0.00 $0.00 $0.00 $6.05 Cash
## 3 4 $7.05 $0.00 $0.00 $0.00 $7.05 Cash
## 4 6 $31.25 $0.00 $0.00 $3.00 $34.25 Cash
## 5 8 $5.50 $0.00 $0.00 $0.00 $5.50 Cash
## 6 NA $9.25 $0.00 $0.00 $1.00 $10.25 Cash
## Company Pickup.Centroid.Latitude
## 1 41.88530
## 2 Taxi Affiliation Services 41.89322
## 3 Taxi Affiliation Services 41.94423
## 4 41.97907
## 5 41.89251
## 6 NA
## Pickup.Centroid.Longitude Pickup.Centroid.Location
## 1 -87.64281 POINT (-87.642808 41.8853)
## 2 -87.63784 POINT (-87.637844 41.893216)
## 3 -87.65600 POINT (-87.655998 41.944227)
## 4 -87.90304 POINT (-87.90304 41.979071)
## 5 -87.62621 POINT (-87.626215 41.892508)
## 6 NA
## Dropoff.Centroid.Latitude Dropoff.Centroid.Longitude
## 1 41.88099 -87.63275
## 2 41.87926 -87.64265
## 3 41.97517 -87.68752
## 4 41.95067 -87.66654
## 5 41.89251 -87.62621
## 6 NA NA
## Dropoff.Centroid..Location Community.Areas X X.1 X.2 X.3
## 1 POINT (-87.632746 41.880994) 29 NA NA NA
## 2 POINT (-87.642649 41.879255) 37 NA NA NA
## 3 POINT (-87.687516 41.975171) 57 NA NA NA
## 4 POINT (-87.666536 41.950673) 75 NA NA NA
## 5 POINT (-87.626215 41.892508) 37 NA NA NA
## 6 NA NA NA NA
summary(TaxiTripsData)
## Trip.ID
## 3e7d6d8ccf1425ae1dcd584f5c3ca303cf6362ed: 1
## 3e7d6e5c4e87f01a475c8200b33777e85497da89: 1
## 3e7d6e69c1d6755d9e7484a453cd93a3ee9fed4c: 1
## 3e7d6efe43222b0ebc698583916674c648dd4520: 1
## 3e7d6f001e9bcda8478a489cb53293d26328ac85: 1
## 3e7d6f2a03527d63dc01b95e829fdfdd706102da: 1
## (Other) :99993
## Taxi.ID
## aebf720288b80a8ee36860541db64951c696c749f1a392d312fa4d2a8cd3f95dfb0be580fda7eb63455f809a1be9b3acad19a3ca167073126d0350b50f30741a: 58
## 4f189764b8d9b6f71f7936ab414cac07634be0a00790ca179f9460521b7c9c3e5e102f5ba4e1c9cd18cdd9856dbf4f66ae8f13d8c82f8d2d4872f74b96938a24: 57
## f737a9a31b07650672910268d7cceb9c06a379c0e75070c0dc0366db8132b06ba2800c5e63c5e56f821a591fc78a92c1c60fb5f48e01aa02e62ff10d18ececd0: 55
## 1158f25979ad78fd3dafc867a540ad761b65922c312e6170ccee63c3f14adea37317d3cf4e2053d2bdb1531d17670872e0411e496905ef9cb4821e0e96056139: 53
## 0861cb74337c620cb9ec639af7dc3aa99173b768caf750a2fd1ff17a8d9db86cad36772c7ff6ddaf2fda48de41bc82981145fe46693ed147d86ae194ee15c703: 52
## (Other) :99720
## NA's : 4
## Trip.Start.Timestamp Trip.End.Timestamp Trip.Seconds
## 7/25/14 18:45 : 9 : 16 Min. : 0.0
## 2/27/15 8:45 : 8 2/10/14 10:30: 9 1st Qu.: 300.0
## 2/5/15 19:15 : 8 2/5/15 19:45 : 8 Median : 540.0
## 4/25/14 18:45 : 8 3/22/14 20:15: 8 Mean : 739.2
## 9/18/13 19:30 : 8 3/24/16 19:30: 8 3rd Qu.: 900.0
## 10/10/14 17:30: 7 3/3/14 18:45 : 8 Max. :74340.0
## (Other) :99951 (Other) :99942 NA's :1327
## Trip.Miles Pickup.Census.Tract Dropoff.Census.Tract
## Min. : 0.000 Min. :1.703e+10 Min. :1.703e+10
## 1st Qu.: 0.000 1st Qu.:1.703e+10 1st Qu.:1.703e+10
## Median : 0.900 Median :1.703e+10 Median :1.703e+10
## Mean : 2.686 Mean :1.703e+10 Mean :1.703e+10
## 3rd Qu.: 2.400 3rd Qu.:1.703e+10 3rd Qu.:1.703e+10
## Max. :1830.000 Max. :1.703e+10 Max. :1.703e+10
## NA's :1 NA's :38042 NA's :38775
## Pickup.Community.Area Dropoff.Community.Area Fare
## Min. : 1.00 Min. : 1.00 $6.25 : 2892
## 1st Qu.: 8.00 1st Qu.: 8.00 $5.25 : 2699
## Median : 8.00 Median :14.00 $3.25 : 2629
## Mean :22.04 Mean :21.14 $5.85 : 2390
## 3rd Qu.:32.00 3rd Qu.:32.00 $5.65 : 2389
## Max. :77.00 Max. :77.00 $6.05 : 2367
## NA's :15534 NA's :17532 (Other):84633
## Tips Tolls Extras Trip.Total
## $0.00 :63911 $0.00 :99932 $0.00 :62102 $7.25 : 2010
## $2.00 :10382 $1.90 : 13 $1.00 :18344 $6.25 : 1908
## $3.00 : 3769 $1.50 : 12 $2.00 : 8888 $3.25 : 1889
## $1.00 : 3162 $50.00 : 8 $1.50 : 4635 $6.65 : 1762
## $5.00 : 1004 $3.00 : 7 $3.00 : 2052 $8.25 : 1729
## $4.00 : 991 $2.00 : 5 $4.00 : 1134 $7.05 : 1658
## (Other):16780 (Other): 22 (Other): 2844 (Other):89043
## Payment.Type Company
## Cash :60760 :35411
## Credit Card:38322 Taxi Affiliation Services :29911
## Dispute : 58 Dispatch Taxi Affiliation : 9417
## No Charge : 622 Blue Ribbon Taxi Association Inc. : 6766
## Pcard : 18 Choice Taxi Association : 5185
## Prcard : 6 Chicago Elite Cab Corp. (Chicago Carriag: 5091
## Unknown : 213 (Other) : 8218
## Pickup.Centroid.Latitude Pickup.Centroid.Longitude
## Min. :41.66 Min. :-87.91
## 1st Qu.:41.88 1st Qu.:-87.66
## Median :41.89 Median :-87.63
## Mean :41.90 Mean :-87.66
## 3rd Qu.:41.92 3rd Qu.:-87.63
## Max. :42.02 Max. :-87.54
## NA's :15533 NA's :15533
## Pickup.Centroid.Location Dropoff.Centroid.Latitude
## :15533 Min. :41.67
## POINT (-87.632746 41.880994): 8572 1st Qu.:41.88
## POINT (-87.620993 41.884987): 5034 Median :41.89
## POINT (-87.633308 41.899602): 3850 Mean :41.90
## POINT (-87.626215 41.892508): 3832 3rd Qu.:41.92
## POINT (-87.631864 41.892042): 3692 Max. :42.02
## (Other) :59486 NA's :17376
## Dropoff.Centroid.Longitude Dropoff.Centroid..Location
## Min. :-87.91 :17376
## 1st Qu.:-87.66 POINT (-87.632746 41.880994): 7644
## Median :-87.63 POINT (-87.620993 41.884987): 4412
## Mean :-87.66 POINT (-87.626215 41.892508): 3073
## 3rd Qu.:-87.63 POINT (-87.631864 41.892042): 3072
## Max. :-87.53 POINT (-87.655998 41.944227): 2850
## NA's :17376 (Other) :61572
## Community.Areas X X.1 X.2
## Min. : 1.00 Mode:logical Mode:logical Mode:logical
## 1st Qu.:37.00 NA's:99999 NA's:99999 NA's:99999
## Median :37.00
## Mean :41.18
## 3rd Qu.:38.00
## Max. :77.00
## NA's :15533
## X.3
## :99998
## fare can be dollar sold as transalsion in the fact table : 1
##
##
##
##
##
str(TaxiTripsData)
## 'data.frame': 99999 obs. of 28 variables:
## $ Trip.ID : Factor w/ 99999 levels "3e7d6d8ccf1425ae1dcd584f5c3ca303cf6362ed",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Taxi.ID : Factor w/ 6164 levels "0008de7a146802839c9e6059f482d292ebdae13c5c31dd6e5983a80882e2a5dbcd6ea098c2fcd56f34ce02645eb94c6b39512e9304837746d4e289b6236c2c5"| __truncated__,..: 4264 4466 5890 335 5491 1199 207 2234 639 1106 ...
## $ Trip.Start.Timestamp : Factor w/ 64107 levels "1/1/13 0:00",..: 54999 61512 19542 35388 20760 14370 47940 29083 49559 47158 ...
## $ Trip.End.Timestamp : Factor w/ 64181 levels "","1/1/13 0:15",..: 55098 61614 19569 35464 20791 14388 48006 29154 49637 47204 ...
## $ Trip.Seconds : int 480 420 420 2340 300 1020 360 2220 1020 780 ...
## $ Trip.Miles : num 0.15 0 1.7 13.8 0.7 1.4 0.1 13.3 8 7.7 ...
## $ Pickup.Census.Tract : num 1.7e+10 1.7e+10 NA 1.7e+10 1.7e+10 ...
## $ Dropoff.Census.Tract : num 1.7e+10 1.7e+10 NA 1.7e+10 1.7e+10 ...
## $ Pickup.Community.Area : int 28 8 6 76 8 NA 7 56 NA 11 ...
## $ Dropoff.Community.Area : int 32 28 4 6 8 NA 8 24 NA 76 ...
## $ Fare : Factor w/ 892 levels "","$0.00","$0.01",..: 735 643 735 329 554 861 858 321 175 162 ...
## $ Tips : Factor w/ 1081 levels "","$0.00","$0.01",..: 2 2 2 2 2 2 50 839 2 756 ...
## $ Tolls : Factor w/ 25 levels "","$0.00","$0.60",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Extras : Factor w/ 183 levels "","$0.00","$0.01",..: 19 2 2 100 2 12 2 55 12 21 ...
## $ Trip.Total : Factor w/ 2702 levels "","$0.00","$0.01",..: 2472 2031 2281 1106 1704 30 19 1265 548 1056 ...
## $ Payment.Type : Factor w/ 7 levels "Cash","Credit Card",..: 1 1 1 1 1 1 2 2 1 2 ...
## $ Company : Factor w/ 98 levels "","0118 - 42111 Godfrey S.Awir",..: 1 97 97 1 1 1 97 1 1 1 ...
## $ Pickup.Centroid.Latitude : num 41.9 41.9 41.9 42 41.9 ...
## $ Pickup.Centroid.Longitude : num -87.6 -87.6 -87.7 -87.9 -87.6 ...
## $ Pickup.Centroid.Location : Factor w/ 309 levels "","POINT (-87.540936 41.663671)",..: 80 66 119 308 44 1 98 292 1 301 ...
## $ Dropoff.Centroid.Latitude : num 41.9 41.9 42 42 41.9 ...
## $ Dropoff.Centroid.Longitude: num -87.6 -87.6 -87.7 -87.7 -87.6 ...
## $ Dropoff.Centroid..Location: Factor w/ 357 levels "","POINT (-87.534903 41.707311)",..: 65 88 256 178 51 1 67 227 1 357 ...
## $ Community.Areas : int 29 37 57 75 37 NA 68 53 NA 11 ...
## $ X : logi NA NA NA NA NA NA ...
## $ X.1 : logi NA NA NA NA NA NA ...
## $ X.2 : logi NA NA NA NA NA NA ...
## $ X.3 : Factor w/ 2 levels "","fare can be dollar sold as transalsion in the fact table ": 1 1 1 1 1 1 1 1 1 1 ...
– Define a relational business logic integrity check for the column field ‘Trip Seconds’.
– Using https://erdplus.com/#/standalone draw a star like schema using at least the following tables: