Today the lab focuses on data outliers, data preparation, and data modeling. This lab requires the use of Microsoft Excel, R, and ERDplus.
Make sure to download the folder titled ‘bsad_lab02’ zip folder and extract the folder to unzip it. Next, we must set this folder as the working directory. The way to do this is to open R Studio, go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Now, follow the directions to complete the lab.
First, we must calculate the mean, standard deviation, maximum, and minimum for the Age column using R.
In R, we must read in the file again, extract the column and find the values that are asked for.
#Read File
mydata= read.csv(file = "data/creditrisk.csv")
#Age
age = mydata$Age
#Calculate the average age below.
age_mean= mean(age)
age_mean
[1] 34.39765
#Calculate standard deviation of age below.
age_sd= sd(age)
age_sd
[1] 11.04513
#Calculate the maximum of age below. The command to find the maximum is max(variable) where variable is the extracted variable.
age_max= max(age)
age_max
[1] 73
#Calculate the minimum of age below. The command to find the minimum is min(variable) where variable is the extracted variable.
age_min = min(age)
age_min
[1] 18
Next, use the formula from class to detect any outliers. An outlier is value that “lies outside” most of the other values in a set of data. A common way to estimate the upper and lower threshold is to take the mean (+ or -) 3 * standard deviation
. Try using this formula to find the upper and lower limit for age.
#Use the formula above to calculate the upper and lower threshold
age_lower = age_mean - (3) * age_sd
age_upper = age_mean + (3) * age_sd
age_upper
[1] 67.53302
age_lower
[1] 1.262269
A method to find the upper and lower thresholds discussed in introductory statistics courses involves finding the interquartile range. Follow along below to see how we first calculate the interquartile range..
quantile(age)
0% 25% 50% 75% 100%
18 26 32 41 73
iqr = upperq - lowerq
Error: object 'upperq' not found
lowerq=quantile(age)[3]
upperq=quantile(age)[4]
iqr= upperq - lowerq
iqr
75%
9
The formula below calculates the threshold. The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier.
Below is the upper threshold:
upperthreshold = (iqr * 1.5) + upperq
upperthreshold
75%
54.5
Below is the lower threshold:
lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold
50%
18.5
Are there any outliers? How many? It can also be useful to visualize the data using a box and whisker plot. The boxplot below supports the IQR we found of 15 and upper and lower threshold.
age[age>upperthreshold]
[1] 57 59 64 56 63 56 65 56 67 60 56 64 73 63 63 65 59 55 56 56 66 57 58 67 56 59 65 62
mydata[age>upperthreshold,]
age[age<lowerthreshold]
[1] 18
boxplot(age,horizontal = TRUE)
Next, we must read the ‘creditriskorg.csv’ file into R. This is the original dataset and contains missing values.
mydata = read.csv(file = "data/creditriskorg.csv")
head(mydata)
#tail(mydata)
We observe that the column names are shifted down below. So, we must make sure to use the command skip and set the header to true.
mydata = read.csv("data/creditriskorg.csv",skip = 1)
head(mydata)
#str(mydata)
summary(mydata)
Loan.Purpose Checking Savings Months.Customer Months.Employed Gender
Small Appliance:105 $- :251 $- : 62 Min. : 5.0 Min. : 0.0 F:135
New Car :104 $216.00 : 2 $127.00 : 3 1st Qu.:13.0 1st Qu.: 6.0 M:290
Furniture : 85 $271.00 : 2 $836.00 : 3 Median :19.0 Median : 20.0
Business : 44 $296.00 : 2 $904.00 : 3 Mean :22.9 Mean : 31.9
Used Car : 40 $305.00 : 2 $922.00 : 3 3rd Qu.:28.0 3rd Qu.: 47.0
Education : 23 $497.00 : 2 $104.00 : 2 Max. :73.0 Max. :119.0
(Other) : 24 (Other) :164 (Other) :349
Marital.Status Age Housing Years Job Credit.Risk
Divorced:156 Min. :18.0 Other: 52 Min. :1.00 Management: 54 High:211
Married : 36 1st Qu.:26.0 Own :292 1st Qu.:2.00 Skilled :271 Low :214
Single :233 Median :32.0 Rent : 81 Median :3.00 Unemployed: 11
Mean :34.4 Mean :2.84 Unskilled : 89
3rd Qu.:41.0 3rd Qu.:4.00
Max. :73.0 Max. :4.00
To calculate the mean for Checking in R, follow Worksheet 1. Extract the Checking column first and then find the average using the function built in R. What happens when we try to use the function?
#Extracting the Checking Column
checking = mydata$Checking
#Calling the Checking
checking
[1] $- $- $- $638.00 $963.00 $2,827.00 $-
[8] $- $6,509.00 $966.00 $- $- $322.00 $-
[15] $396.00 $- $652.00 $708.00 $207.00 $287.00 $-
[22] $101.00 $- $- $- $141.00 $- $2,484.00
[29] $237.00 $- $335.00 $3,565.00 $- $16,647.00 $-
[36] $- $- $940.00 $- $- $218.00 $-
[43] $16,935.00 $664.00 $150.00 $- $216.00 $- $-
[50] $- $265.00 $4,256.00 $870.00 $162.00 $- $-
[57] $- $461.00 $- $- $- $580.00 $-
[64] $- $- $- $758.00 $399.00 $513.00 $-
[71] $- $565.00 $- $- $- $166.00 $9,783.00
[78] $674.00 $- $15,328.00 $- $713.00 $- $-
[85] $- $- $- $303.00 $900.00 $- $1,257.00
[92] $- $273.00 $522.00 $- $- $- $-
[99] $514.00 $457.00 $5,133.00 $- $644.00 $305.00 $9,621.00
[106] $- $- $- $- $- $6,851.00 $13,496.00
[113] $509.00 $- $19,155.00 $- $- $374.00 $-
[120] $828.00 $- $829.00 $- $- $939.00 $-
[127] $889.00 $876.00 $893.00 $12,760.00 $- $- $959.00
[134] $- $- $- $- $698.00 $- $-
[141] $- $12,974.00 $- $317.00 $- $- $-
[148] $192.00 $- $- $- $- $- $942.00
[155] $- $3,329.00 $- $- $- $- $-
[162] $- $339.00 $- $- $- $105.00 $-
[169] $216.00 $113.00 $109.00 $- $- $8,176.00 $-
[176] $468.00 $7,885.00 $- $- $- $- $-
[183] $- $- $- $- $734.00 $- $-
[190] $172.00 $644.00 $- $617.00 $- $586.00 $-
[197] $- $- $- $- $522.00 $585.00 $5,588.00
[204] $- $352.00 $- $2,715.00 $560.00 $895.00 $305.00
[211] $- $- $- $8,948.00 $- $- $-
[218] $- $- $483.00 $- $- $- $663.00
[225] $624.00 $- $- $152.00 $- $- $498.00
[232] $- $156.00 $1,336.00 $- $- $- $2,641.00
[239] $- $- $- $- $- $887.00 $-
[246] $- $- $- $18,408.00 $497.00 $- $946.00
[253] $986.00 $8,122.00 $- $778.00 $645.00 $- $682.00
[260] $19,812.00 $- $- $859.00 $- $- $-
[267] $- $- $- $795.00 $- $- $-
[274] $- $852.00 $- $- $425.00 $- $-
[281] $- $11,072.00 $- $219.00 $8,060.00 $- $-
[288] $- $- $1,613.00 $757.00 $- $- $977.00
[295] $197.00 $- $- $- $- $- $256.00
[302] $296.00 $- $- $- $298.00 $- $8,636.00
[309] $- $- $19,766.00 $- $- $- $-
[316] $4,089.00 $- $271.00 $949.00 $- $911.00 $-
[323] $- $- $- $271.00 $- $- $-
[330] $- $4,802.00 $177.00 $- $- $996.00 $705.00
[337] $- $- $5,960.00 $- $759.00 $- $651.00
[344] $257.00 $955.00 $- $8,249.00 $- $956.00 $382.00
[351] $- $842.00 $3,111.00 $- $- $2,846.00 $231.00
[358] $- $17,366.00 $- $332.00 $242.00 $- $929.00
[365] $- $- $- $- $- $- $-
[372] $646.00 $538.00 $- $- $- $- $135.00
[379] $2,472.00 $- $10,417.00 $211.00 $16,630.00 $- $642.00
[386] $- $296.00 $898.00 $478.00 $315.00 $122.00 $-
[393] $- $- $670.00 $444.00 $3,880.00 $819.00 $-
[400] $- $- $- $- $- $- $-
[407] $- $161.00 $- $- $789.00 $765.00 $-
[414] $- $983.00 $- $- $798.00 $- $193.00
[421] $497.00 $- $- $- $-
168 Levels: $- $1,257.00 $1,336.00 $1,613.00 $10,417.00 $101.00 ... $996.00
To resolve the error, we must remove understand where it is coming from. There are missing values in the csv file, which is quite common as most datasets are not perfect. Additionally, there are commas within the excel spreadsheet, and R does not recognize that ‘1,234’ is equivalent to ‘1234’. Lastly, there are ‘$’ symbols throughout the file which is not a numerica symbol either.
The sub function replaces these symbols with something else. So, in order to remove the comma in the number “1,234”, we must substitute it with just an empty space.
As shown on the worksheet, type and copy the exact commands to find the mean with the NA values removed.
checking[1:6]
[1] $- $- $- $638.00 $963.00 $2,827.00
168 Levels: $- $1,257.00 $1,336.00 $1,613.00 $10,417.00 $101.00 ... $996.00
clean= checking[1:10]
#substitute comma with blank in all of checking
clean = sub(",","",clean)
#substitute dollar sign with blank in all of checking
clean = sub("\\$","",clean)
class(clean)
[1] "character"
#numeric convert
clean = as.numeric(clean)
NAs introduced by coercion
class(clean)
[1] "numeric"
#mean with NA removed
clean
[1] NA NA NA 638 963 2827 NA NA 6509 966
#substitute comma with blank in all of checking
checking = sub(",","", checking)
#substitute dollar sign with blank in all of checking
checking = sub("\\$","",checking)
#numeric convert
checking =as.numeric(checking)
NAs introduced by coercion
#mean with NA removed
checking
[1] NA NA NA 638 963 2827 NA NA 6509 966 NA NA 322 NA 396
[16] NA 652 708 207 287 NA 101 NA NA NA 141 NA 2484 237 NA
[31] 335 3565 NA 16647 NA NA NA 940 NA NA 218 NA 16935 664 150
[46] NA 216 NA NA NA 265 4256 870 162 NA NA NA 461 NA NA
[61] NA 580 NA NA NA NA 758 399 513 NA NA 565 NA NA NA
[76] 166 9783 674 NA 15328 NA 713 NA NA NA NA NA 303 900 NA
[91] 1257 NA 273 522 NA NA NA NA 514 457 5133 NA 644 305 9621
[106] NA NA NA NA NA 6851 13496 509 NA 19155 NA NA 374 NA 828
[121] NA 829 NA NA 939 NA 889 876 893 12760 NA NA 959 NA NA
[136] NA NA 698 NA NA NA 12974 NA 317 NA NA NA 192 NA NA
[151] NA NA NA 942 NA 3329 NA NA NA NA NA NA 339 NA NA
[166] NA 105 NA 216 113 109 NA NA 8176 NA 468 7885 NA NA NA
[181] NA NA NA NA NA NA 734 NA NA 172 644 NA 617 NA 586
[196] NA NA NA NA NA 522 585 5588 NA 352 NA 2715 560 895 305
[211] NA NA NA 8948 NA NA NA NA NA 483 NA NA NA 663 624
[226] NA NA 152 NA NA 498 NA 156 1336 NA NA NA 2641 NA NA
[241] NA NA NA 887 NA NA NA NA 18408 497 NA 946 986 8122 NA
[256] 778 645 NA 682 19812 NA NA 859 NA NA NA NA NA NA 795
[271] NA NA NA NA 852 NA NA 425 NA NA NA 11072 NA 219 8060
[286] NA NA NA NA 1613 757 NA NA 977 197 NA NA NA NA NA
[301] 256 296 NA NA NA 298 NA 8636 NA NA 19766 NA NA NA NA
[316] 4089 NA 271 949 NA 911 NA NA NA NA 271 NA NA NA NA
[331] 4802 177 NA NA 996 705 NA NA 5960 NA 759 NA 651 257 955
[346] NA 8249 NA 956 382 NA 842 3111 NA NA 2846 231 NA 17366 NA
[361] 332 242 NA 929 NA NA NA NA NA NA NA 646 538 NA NA
[376] NA NA 135 2472 NA 10417 211 16630 NA 642 NA 296 898 478 315
[391] 122 NA NA NA 670 444 3880 819 NA NA NA NA NA NA NA
[406] NA NA 161 NA NA 789 765 NA NA 983 NA NA 798 NA 193
[421] 497 NA NA NA NA
What are some other ways to clean this data in R? How about in Excel?
mean(checking,na.rm = TRUE)
[1] 2559.805
sum(checking,na.rm = TRUE)/length(checking)
[1] 1048.014
Now, we will look at Chicago taxi data. Go and explore the interactive dashboard and read the description of the data.
Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq
Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/
Open in RStudio the csv file is located in the data folder, note the size of the file, the number of columns and of rows here. Use the functions learned in lab00 and lab01 to describe the data, identify unique entities, fields and summarize.
mydata= read.csv("data/Taxi_Trips_Sample.csv")
head(mydata)
#str (mydata)
summary(mydata)
Trip.ID
3e7d6d8ccf1425ae1dcd584f5c3ca303cf6362ed: 1
3e7d6e5c4e87f01a475c8200b33777e85497da89: 1
3e7d6e69c1d6755d9e7484a453cd93a3ee9fed4c: 1
3e7d6efe43222b0ebc698583916674c648dd4520: 1
3e7d6f001e9bcda8478a489cb53293d26328ac85: 1
3e7d6f2a03527d63dc01b95e829fdfdd706102da: 1
(Other) :99993
Taxi.ID
aebf720288b80a8ee36860541db64951c696c749f1a392d312fa4d2a8cd3f95dfb0be580fda7eb63455f809a1be9b3acad19a3ca167073126d0350b50f30741a: 58
4f189764b8d9b6f71f7936ab414cac07634be0a00790ca179f9460521b7c9c3e5e102f5ba4e1c9cd18cdd9856dbf4f66ae8f13d8c82f8d2d4872f74b96938a24: 57
f737a9a31b07650672910268d7cceb9c06a379c0e75070c0dc0366db8132b06ba2800c5e63c5e56f821a591fc78a92c1c60fb5f48e01aa02e62ff10d18ececd0: 55
1158f25979ad78fd3dafc867a540ad761b65922c312e6170ccee63c3f14adea37317d3cf4e2053d2bdb1531d17670872e0411e496905ef9cb4821e0e96056139: 53
0861cb74337c620cb9ec639af7dc3aa99173b768caf750a2fd1ff17a8d9db86cad36772c7ff6ddaf2fda48de41bc82981145fe46693ed147d86ae194ee15c703: 52
(Other) :99720
NA's : 4
Trip.Start.Timestamp Trip.End.Timestamp Trip.Seconds
07/25/2014 06:45:00 PM: 9 : 16 Min. : 0.0
02/05/2015 07:15:00 PM: 8 02/10/2014 10:30:00 AM: 9 1st Qu.: 300.0
02/27/2015 08:45:00 AM: 8 02/05/2015 07:45:00 PM: 8 Median : 540.0
04/25/2014 06:45:00 PM: 8 03/03/2014 06:45:00 PM: 8 Mean : 739.2
09/18/2013 07:30:00 PM: 8 03/22/2014 08:15:00 PM: 8 3rd Qu.: 900.0
03/15/2014 07:00:00 PM: 7 03/24/2016 07:30:00 PM: 8 Max. :74340.0
(Other) :99951 (Other) :99942 NA's :1327
Trip.Miles Pickup.Census.Tract Dropoff.Census.Tract Pickup.Community.Area
Min. : 0.000 Min. :1.703e+10 Min. :1.703e+10 Min. : 1.00
1st Qu.: 0.000 1st Qu.:1.703e+10 1st Qu.:1.703e+10 1st Qu.: 8.00
Median : 0.900 Median :1.703e+10 Median :1.703e+10 Median : 8.00
Mean : 2.686 Mean :1.703e+10 Mean :1.703e+10 Mean :22.04
3rd Qu.: 2.400 3rd Qu.:1.703e+10 3rd Qu.:1.703e+10 3rd Qu.:32.00
Max. :1830.000 Max. :1.703e+10 Max. :1.703e+10 Max. :77.00
NA's :1 NA's :38042 NA's :38775 NA's :15534
Dropoff.Community.Area Fare Tips Tolls Extras
Min. : 1.00 $6.25 : 2892 $0.00 :63911 $0.00 :99932 $0.00 :62102
1st Qu.: 8.00 $5.25 : 2699 $2.00 :10382 $1.90 : 13 $1.00 :18344
Median :14.00 $3.25 : 2629 $3.00 : 3769 $1.50 : 12 $2.00 : 8888
Mean :21.14 $5.85 : 2390 $1.00 : 3162 $50.00 : 8 $1.50 : 4635
3rd Qu.:32.00 $5.65 : 2389 $5.00 : 1004 $3.00 : 7 $3.00 : 2052
Max. :77.00 $6.05 : 2367 $4.00 : 991 $2.00 : 5 $4.00 : 1134
NA's :17532 (Other):84633 (Other):16780 (Other): 22 (Other): 2844
Trip.Total Payment.Type Company
$7.25 : 2010 Cash :60760 :35411
$6.25 : 1908 Credit Card:38322 Taxi Affiliation Services :29911
$3.25 : 1889 Dispute : 58 Dispatch Taxi Affiliation : 9417
$6.65 : 1762 No Charge : 622 Blue Ribbon Taxi Association Inc. : 6766
$8.25 : 1729 Pcard : 18 Choice Taxi Association : 5185
$7.05 : 1658 Prcard : 6 Chicago Elite Cab Corp. (Chicago Carriag: 5091
(Other):89043 Unknown : 213 (Other) : 8218
Pickup.Centroid.Latitude Pickup.Centroid.Longitude Pickup.Centroid.Location
Min. :41.66 Min. :-87.91 :15533
1st Qu.:41.88 1st Qu.:-87.66 POINT (-87.632746 41.880994): 8572
Median :41.89 Median :-87.63 POINT (-87.620993 41.884987): 5034
Mean :41.90 Mean :-87.66 POINT (-87.633308 41.899602): 3850
3rd Qu.:41.92 3rd Qu.:-87.63 POINT (-87.626215 41.892508): 3832
Max. :42.02 Max. :-87.54 POINT (-87.631864 41.892042): 3692
NA's :15533 NA's :15533 (Other) :59486
Dropoff.Centroid.Latitude Dropoff.Centroid.Longitude Dropoff.Centroid..Location
Min. :41.67 Min. :-87.91 :17376
1st Qu.:41.88 1st Qu.:-87.66 POINT (-87.632746 41.880994): 7644
Median :41.89 Median :-87.63 POINT (-87.620993 41.884987): 4412
Mean :41.90 Mean :-87.66 POINT (-87.626215 41.892508): 3073
3rd Qu.:41.92 3rd Qu.:-87.63 POINT (-87.631864 41.892042): 3072
Max. :42.02 Max. :-87.53 POINT (-87.655998 41.944227): 2850
NA's :17376 NA's :17376 (Other) :61572
Community.Areas
Min. : 1.00
1st Qu.:37.00
Median :37.00
Mean :41.18
3rd Qu.:38.00
Max. :77.00
NA's :15533
#Extracting the Fare Column
fare = mydata$Fare
#Calling the Fare Column
fare
[1] $7.05 $6.05 $7.05 $31.25 $5.50 $9.25 $9.05 $30.45 $18.25 $17.25 $8.05
[12] $21.25 $6.85 $10.45 $7.45 $6.25 $7.45 $7.25 $10.05 $13.25 $35.25 $11.65
[23] $3.25 $9.05 $14.25 $15.85 $14.85 $6.85 $38.85 $5.65 $6.45 $15.45 $3.25
[34] $9.75 $14.85 $16.25 $3.45 $6.50 $5.45 $9.65 $107.45 $13.45 $35.04 $6.25
[45] $13.65 $4.85 $5.05 $9.00 $13.05 $5.45 $10.25 $6.45 $16.25 $8.65 $5.65
[56] $7.05 $8.25 $9.65 $5.25 $7.00 $6.65 $5.85 $9.44 $5.50 $9.25 $16.85
[67] $4.65 $9.65 $39.25 $10.65 $4.84 $18.65 $5.85 $6.45 $18.85 $7.05 $7.25
[78] $9.25 $7.45 $4.25 $17.45 $4.50 $35.25 $7.45 $9.45 $11.45 $9.65 $4.25
[89] $4.25 $6.65 $13.45 $7.50 $5.65 $4.84 $7.45 $4.25 $38.00 $11.50 $8.75
[100] $8.00 $4.45 $14.25 $6.65 $7.05 $12.65 $4.45 $9.85 $5.05 $9.25 $31.45
[111] $6.65 $10.65 $10.85 $5.25 $11.25 $5.45 $6.85 $6.45 $7.85 $17.25 $8.65
[122] $8.85 $5.05 $7.65 $6.45 $11.05 $8.65 $8.05 $37.45 $8.50 $16.85 $4.84
[133] $6.65 $9.45 $7.05 $7.25 $13.85 $15.45 $4.25 $12.45 $23.85 $6.85 $24.25
[144] $38.25 $7.05 $17.85 $7.25 $37.05 $5.65 $6.85 $10.75 $5.25 $5.25 $13.45
[155] $36.65 $9.25 $7.45 $8.25 $10.25 $35.05 $7.25 $14.00 $4.85 $8.25 $7.05
[166] $8.85 $5.85 $7.45 $6.00 $16.25 $3.45 $5.50 $9.25 $6.85 $7.05 $8.65
[177] $8.05 $5.65 $5.25 $3.25 $36.25 $6.65 $8.05 $4.45 $8.45 $5.45 $23.85
[188] $7.05 $15.05 $6.75 $5.75 $9.05 $12.45 $9.65 $8.45 $7.45 $31.45 $6.65
[199] $10.25 $7.65 $7.65 $7.25 $6.45 $90.95 $20.45 $5.75 $3.25 $3.45 $9.44
[210] $9.75 $7.00 $8.85 $44.85 $7.65 $10.25 $10.25 $5.25 $8.25 $8.85 $6.05
[221] $4.05 $9.25 $13.05 $32.25 $6.05 $3.25 $6.25 $22.50 $6.25 $13.45 $5.85
[232] $5.45 $4.85 $11.65 $4.85 $25.65 $5.05 $6.25 $7.25 $4.05 $4.25 $5.00
[243] $17.25 $16.00 $12.05 $44.75 $12.65 $5.45 $48.65 $5.45 $4.25 $4.65 $6.25
[254] $7.45 $5.85 $10.45 $7.45 $4.65 $58.45 $22.85 $11.00 $14.45 $6.25 $7.05
[265] $19.45 $11.45 $11.85 $12.25 $7.05 $4.65 $4.50 $6.45 $8.25 $15.25 $35.04
[276] $3.45 $36.45 $20.45 $4.65 $6.25 $36.05 $5.05 $16.85 $8.25 $5.75 $12.45
[287] $7.65 $10.45 $7.25 $4.25 $8.25 $7.05 $6.05 $11.85 $21.05 $14.00 $6.75
[298] $20.05 $8.44 $46.50 $4.25 $38.85 $7.85 $4.85 $7.05 $5.85 $4.45 $13.65
[309] $8.45 $33.05 $5.05 $10.85 $5.05 $11.65 $5.75 $4.65 $5.25 $7.65 $3.85
[320] $4.45 $36.05 $6.65 $6.25 $13.05 $4.75 $4.45 $37.05 $10.45 $30.25 $24.45
[331] $8.25 $17.05 $8.65 $13.25 $20.75 $6.00 $28.65 $7.05 $5.45 $15.25 $15.00
[342] $6.00 $36.85 $7.75 $8.85 $7.25 $12.65 $14.05 $15.45 $52.50 $6.45 $16.65
[353] $8.25 $5.65 $27.05 $12.25 $10.25 $5.05 $32.05 $8.25 $7.05 $19.65 $29.45
[364] $3.25 $7.85 $4.65 $6.25 $8.65 $5.85 $9.85 $8.05 $16.75 $3.25 $4.65
[375] $3.25 $9.25 $7.25 $10.85 $7.85 $9.85 $9.65 $40.25 $8.05 $10.25 $11.75
[386] $23.85 $30.05 $7.05 $5.25 $8.85 $6.65 $7.05 $4.65 $7.50 $9.25 $17.25
[397] $9.85 $5.45 $32.05 $34.25 $9.45 $13.05 $5.65 $3.25 $34.85 $13.05 $4.85
[408] $3.25 $6.05 $10.25 $14.65 $16.25 $13.25 $14.65 $7.45 $67.00 $7.45 $5.45
[419] $6.25 $10.05 $6.85 $4.65 $12.50 $6.45 $11.25 $5.25 $9.75 $30.25 $7.85
[430] $6.50 $5.45 $10.25 $6.25 $7.45 $6.05 $5.45 $8.50 $5.45 $5.25 $8.44
[441] $10.85 $3.25 $7.05 $9.25 $8.50 $17.85 $4.85 $0.00 $17.64 $7.85 $4.05
[452] $7.25 $35.65 $15.05 $6.25 $7.25 $39.45 $37.25 $5.85 $9.85 $26.85 $3.25
[463] $7.85 $6.15 $5.25 $64.25 $6.05 $8.25 $35.85 $10.45 $12.75 $4.45 $7.00
[474] $7.05 $5.85 $37.45 $7.45 $5.85 $6.65 $3.25 $16.00 $6.45 $8.25 $0.00
[485] $10.05 $9.65 $26.85 $5.25 $10.75 $5.05 $5.65 $8.65 $9.25 $7.85 $5.05
[496] $7.05 $10.50 $6.45 $38.45 $32.85 $8.44 $4.45 $4.25 $7.05 $11.45 $10.45
[507] $37.45 $18.25 $13.25 $15.65 $8.05 $17.25 $35.65 $7.45 $6.05 $6.75 $3.25
[518] $6.50 $10.65 $5.45 $9.85 $6.50 $11.45 $7.85 $12.75 $9.75 $11.65 $33.65
[529] $9.25 $5.25 $37.05 $7.45 $42.75 $5.85 $19.85 $7.05 $7.05 $5.85 $12.25
[540] $7.05 $5.25 $12.65 $11.25 $10.85 $7.05 $28.85 $45.00 $16.05 $5.85 $6.65
[551] $5.25 $5.25 $40.45 $5.00 $16.45 $7.45 $12.65 $13.25 $5.25 $11.65 $45.65
[562] $5.85 $6.00 $6.50 $17.45 $2.85 $7.45 $4.50 $8.65 $6.05 $8.45 $35.25
[573] $5.45 $7.75 $8.05 $9.05 $7.85 $5.65 $7.25 $7.25 $20.85 $6.85 $6.05
[584] $6.00 $10.00 $14.45 $12.45 $36.85 $8.25 $26.85 $11.65 $7.00 $8.05 $89.85
[595] $6.25 $37.05 $8.25 $31.25 $9.50 $6.45 $33.75 $5.45 $29.85 $8.25 $5.50
[606] $13.45 $48.50 $5.45 $4.85 $7.05 $38.65 $5.85 $9.65 $7.25 $10.75 $34.65
[617] $8.25 $6.25 $11.45 $19.45 $8.45 $7.85 $7.65 $8.65 $6.85 $8.65 $5.45
[628] $6.05 $17.25 $10.65 $7.25 $12.45 $19.45 $9.00 $10.45 $8.00 $22.65 $28.85
[639] $3.45 $5.45 $8.05 $6.45 $9.65 $12.45 $5.25 $36.65 $11.25 $37.25 $5.45
[650] $4.05 $45.85 $11.75 $11.45 $11.05 $5.85 $24.05 $44.25 $14.85 $15.85 $5.85
[661] $22.65 $6.45 $6.25 $6.05 $30.65 $6.45 $6.25 $10.25 $8.45 $12.00 $7.45
[672] $4.05 $6.25 $4.45 $8.25 $9.05 $6.45 $4.25 $8.65 $6.05 $9.44 $6.45
[683] $30.05 $27.25 $6.85 $6.50 $8.25 $35.45 $10.25 $10.05 $12.65 $5.25 $6.45
[694] $4.25 $5.85 $11.65 $4.65 $15.45 $6.65 $11.25 $5.05 $9.45 $4.65 $11.45
[705] $6.05 $5.25 $5.65 $18.25 $7.45 $14.65 $7.85 $6.25 $41.05 $15.05 $5.25
[716] $6.05 $5.65 $18.05 $36.85 $8.45 $10.45 $5.45 $8.85 $7.75 $15.65 $13.25
[727] $6.85 $8.45 $12.45 $8.25 $9.00 $8.05 $5.45 $8.65 $4.85 $11.65 $5.75
[738] $23.65 $12.50 $14.05 $6.65 $8.65 $39.25 $6.85 $10.05 $16.45 $5.45 $10.45
[749] $5.45 $8.85 $6.05 $9.65 $82.25 $4.85 $6.25 $8.65 $10.65 $5.05 $8.25
[760] $5.25 $14.75 $5.65 $8.65 $42.00 $6.05 $7.05 $7.25 $38.00 $6.65 $9.05
[771] $6.25 $30.65 $6.50 $7.05 $25.85 $7.05 $10.85 $9.05 $6.00 $8.44 $6.85
[782] $5.05 $9.25 $6.50 $5.00 $8.85 $12.05 $11.25 $5.85 $10.45 $5.45 $6.25
[793] $12.85 $8.25 $7.85 $6.05 $8.25 $39.75 $5.25 $20.65 $8.45 $6.45 $6.85
[804] $8.85 $10.85 $25.45 $5.85 $18.45 $5.65 $12.65 $4.65 $7.75 $7.45 $36.25
[815] $6.50 $38.25 $7.85 $7.85 $13.25 $4.85 $4.65 $6.50 $3.25 $6.45 $7.25
[826] $20.45 $29.05 $9.25 $7.25 $8.00 $4.84 $5.25 $8.65 $5.25 $35.25 $5.65
[837] $5.25 $36.50 $29.05 $10.85 $11.25 $8.00 $6.05 $37.65 $9.05 $12.05 $4.75
[848] $10.25 $7.05 $5.65 $26.25 $85.25 $5.45 $13.50 $10.00 $7.05 $6.25 $15.25
[859] $4.45 $4.84 $35.00 $5.25 $6.05 $8.45 $5.25 $8.85 $7.05 $5.65 $5.65
[870] $7.85 $5.45 $7.85 $10.25 $11.50 $6.85 $41.05 $5.65 $8.05 $7.00 $7.25
[881] $5.65 $36.50 $10.50 $4.65 $4.65 $10.05 $8.05 $6.45 $4.45 $27.85 $5.45
[892] $7.85 $4.85 $12.05 $4.65 $9.05 $6.45 $12.25 $7.05 $5.25 $9.75 $4.65
[903] $7.25 $3.25 $9.00 $5.25 $4.84 $17.85 $8.25 $12.05 $4.65 $14.25 $7.05
[914] $4.50 $34.85 $7.45 $12.05 $3.45 $7.45 $5.65 $8.85 $8.85 $5.05 $9.25
[925] $5.65 $4.85 $9.65 $30.45 $12.85 $7.75 $6.25 $13.45 $4.45 $35.65 $17.75
[936] $6.85 $39.45 $9.05 $5.25 $11.75 $6.85 $11.65 $7.25 $7.45 $12.85 $7.45
[947] $3.25 $8.50 $7.05 $4.00 $8.25 $6.45 $5.65 $5.65 $5.65 $6.45 $5.65
[958] $7.85 $22.25 $12.25 $37.45 $3.25 $36.45 $21.85 $10.05 $5.85 $5.85 $10.45
[969] $15.45 $6.65 $15.05 $34.45 $18.45 $8.25 $5.25 $10.75 $13.85 $7.05 $15.65
[980] $5.25 $33.00 $3.25 $17.45 $5.25 $4.05 $5.50 $9.05 $5.05 $10.25 $6.25
[991] $5.85 $9.85 $15.05 $10.45 $5.45 $27.45 $11.50 $6.75 $4.85 $6.45
[ reached getOption("max.print") -- omitted 98999 entries ]
892 Levels: $0.00 $0.01 $0.03 $0.05 $0.10 $0.11 $0.28 $0.30 $0.32 $0.34 $0.42 $0.60 $1.00 ... $99.99
#Extracting the Tips Column
tips= mydata$Tips
#Calling the Tips Column
tips
[1] $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $1.00 $6.49 $0.00 $5.45 $0.00 $4.25 $1.71
[14] $0.00 $0.00 $0.00 $0.00 $0.00 $2.01 $2.00 $11.15 $0.00 $0.00 $2.00 $0.00 $0.00
[27] $0.00 $0.00 $8.17 $0.00 $0.00 $0.00 $0.00 $3.00 $0.00 $4.05 $0.00 $2.00 $1.00
[40] $2.41 $21.49 $0.00 $7.41 $0.00 $3.40 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00
[53] $0.00 $0.00 $4.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00
[66] $3.00 $0.00 $2.00 $7.85 $0.00 $0.00 $5.55 $0.00 $0.00 $0.00 $3.00 $0.00 $0.00
[79] $2.00 $0.00 $0.00 $0.00 $7.45 $0.00 $2.36 $0.00 $0.00 $0.00 $0.00 $1.10 $2.69
[92] $2.00 $0.00 $0.00 $0.00 $0.00 $0.00 $3.25 $2.00 $0.00 $0.00 $2.85 $0.00 $2.00
[105] $0.00 $1.00 $2.00 $0.00 $3.05 $9.70 $0.00 $2.00 $0.00 $0.00 $0.00 $2.00 $1.50
[118] $0.00 $0.00 $0.00 $2.00 $0.00 $0.00 $0.00 $0.00 $0.00 $1.35 $0.00 $9.36 $1.70
[131] $0.00 $0.00 $2.00 $2.00 $0.00 $0.00 $0.00 $3.25 $0.00 $0.00 $5.37 $0.00 $12.00
[144] $0.00 $0.00 $4.70 $2.00 $8.00 $0.00 $1.00 $0.00 $0.00 $0.00 $0.00 $6.00 $0.00
[157] $1.00 $2.00 $0.00 $0.00 $0.00 $3.10 $0.00 $2.00 $3.00 $1.77 $2.15 $0.00 $1.20
[170] $0.00 $0.00 $2.00 $0.00 $3.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $7.65 $0.00
[183] $0.00 $0.00 $4.00 $0.00 $6.45 $0.00 $0.00 $1.69 $1.15 $3.00 $0.00 $2.00 $3.00
[196] $1.49 $8.35 $0.00 $2.05 $0.00 $2.00 $0.00 $0.00 $18.19 $0.00 $0.00 $0.00 $0.00
[209] $0.00 $1.95 $4.00 $2.00 $9.55 $0.00 $0.00 $2.05 $0.10 $2.00 $0.00 $2.00 $1.01
[222] $0.00 $0.00 $6.85 $0.00 $0.00 $1.25 $0.00 $2.00 $0.00 $0.00 $0.00 $0.00 $2.30
[235] $2.00 $0.00 $0.00 $0.00 $2.00 $0.00 $1.00 $0.00 $2.00 $3.40 $0.00 $0.00 $0.00
[248] $0.00 $5.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $2.00 $0.00 $0.00 $0.00 $0.00
[261] $2.00 $3.60 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $1.00 $3.65
[274] $3.00 $0.00 $0.00 $0.00 $6.00 $2.00 $0.00 $5.00 $0.00 $3.35 $0.00 $0.00 $2.65
[287] $0.00 $0.00 $0.00 $0.00 $0.00 $2.05 $0.00 $0.00 $0.00 $2.00 $2.00 $0.00 $0.00
[300] $9.30 $0.00 $0.00 $0.00 $0.00 $0.00 $2.00 $0.00 $0.00 $0.00 $6.95 $0.00 $2.00
[313] $2.00 $2.00 $0.00 $2.00 $0.00 $0.00 $0.00 $0.00 $0.00 $3.00 $0.00 $0.00 $2.00
[326] $0.00 $2.00 $0.00 $6.45 $4.00 $0.00 $0.00 $0.00 $2.65 $4.15 $2.00 $5.00 $0.00
[339] $0.00 $0.00 $3.75 $0.00 $9.55 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $13.12 $0.00
[352] $0.00 $2.00 $0.00 $0.00 $0.00 $0.00 $0.00 $6.80 $3.00 $0.00 $0.00 $0.00 $0.00
[365] $0.00 $0.00 $2.00 $0.00 $0.00 $1.50 $0.00 $3.50 $0.00 $0.00 $0.00 $1.50 $0.00
[378] $0.00 $0.00 $0.00 $0.00 $0.00 $2.00 $2.30 $2.35 $0.00 $6.40 $2.00 $0.00 $0.00
[391] $0.00 $0.00 $0.00 $0.00 $0.00 $3.65 $0.00 $0.00 $6.80 $9.55 $2.00 $0.00 $0.00
[404] $0.00 $0.00 $0.00 $2.00 $0.00 $0.00 $2.05 $0.00 $3.25 $0.00 $2.93 $1.00 $0.00
[417] $0.00 $3.00 $0.00 $2.00 $1.37 $0.00 $0.00 $0.00 $1.50 $1.00 $0.00 $6.00 $0.00
[430] $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $3.00 $0.00 $0.00 $0.00 $1.63 $0.00
[443] $2.00 $1.85 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $7.50 $0.00 $0.00
[456] $0.00 $4.00 $7.85 $0.00 $0.00 $0.00 $0.00 $2.00 $2.00 $0.00 $0.00 $0.00 $2.00
[469] $7.57 $2.00 $0.00 $1.00 $1.50 $0.00 $0.00 $7.89 $0.00 $3.00 $0.00 $0.00 $3.20
[482] $2.55 $0.00 $0.00 $2.00 $0.00 $5.35 $0.00 $3.20 $0.00 $0.00 $3.00 $3.00 $0.15
[495] $0.00 $3.00 $0.00 $2.00 $6.00 $8.84 $0.00 $0.00 $1.00 $2.95 $3.00 $0.00 $10.36
[508] $0.00 $0.00 $0.00 $0.00 $3.45 $7.70 $3.00 $2.00 $0.00 $0.00 $2.00 $4.00 $1.00
[521] $2.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $11.40 $0.00 $9.35
[534] $0.00 $4.37 $0.00 $1.41 $0.00 $3.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00
[547] $0.00 $0.00 $2.00 $0.00 $1.20 $0.00 $13.30 $2.00 $2.47 $0.00 $0.00 $0.00 $0.00
[560] $0.00 $9.13 $0.00 $0.00 $0.00 $6.00 $0.00 $0.00 $0.00 $2.00 $0.00 $0.00 $0.00
[573] $0.00 $0.00 $0.00 $3.00 $0.00 $0.00 $2.00 $2.03 $0.00 $0.00 $0.00 $0.00 $2.00
[586] $2.17 $0.00 $5.00 $0.00 $0.00 $2.33 $0.00 $0.00 $0.00 $2.00 $0.00 $1.65 $7.15
[599] $1.50 $2.00 $0.00 $0.00 $0.00 $2.00 $0.00 $0.00 $9.80 $1.00 $2.00 $0.00 $7.73
[612] $0.00 $0.00 $1.00 $0.00 $8.00 $0.00 $0.50 $2.29 $0.00 $2.00 $0.00 $0.00 $2.00
[625] $0.00 $0.00 $0.00 $0.00 $0.00 $2.13 $0.00 $0.00 $0.00 $0.00 $2.00 $0.00 $7.60
[638] $0.00 $0.00 $3.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $2.80 $7.85 $0.50 $0.00
[651] $0.00 $2.35 $0.00 $0.00 $2.00 $6.50 $0.00 $0.00 $0.00 $0.00 $0.00 $2.00 $0.00
[664] $0.00 $5.00 $0.00 $0.00 $2.00 $2.00 $3.45 $1.00 $0.00 $0.00 $0.00 $0.00 $0.00
[677] $0.00 $0.00 $0.00 $0.00 $2.00 $3.00 $0.00 $5.00 $0.00 $0.00 $2.00 $7.49 $0.00
[690] $0.00 $0.00 $2.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $2.00 $2.00 $0.00 $2.00
[703] $0.00 $0.00 $1.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00
[716] $1.00 $1.00 $4.51 $7.95 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00
[729] $3.00 $0.00 $0.00 $2.01 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $4.00 $2.00 $0.00
[742] $0.00 $7.85 $0.00 $0.00 $0.00 $1.00 $0.00 $0.00 $0.00 $3.00 $2.00 $18.05 $0.00
[755] $0.00 $0.00 $3.60 $0.00 $0.00 $2.00 $0.00 $2.00 $0.00 $0.00 $0.00 $0.00 $3.00
[768] $0.00 $0.00 $0.00 $4.00 $6.49 $1.00 $2.00 $5.37 $0.00 $0.00 $0.00 $0.00 $2.00
[781] $2.00 $0.00 $0.00 $0.00 $0.00 $0.00 $2.00 $2.00 $1.15 $2.00 $0.00 $0.00 $0.00
[794] $2.00 $2.00 $2.95 $0.00 $0.00 $2.00 $0.00 $0.00 $0.00 $0.00 $0.00 $2.17 $5.05
[807] $0.00 $0.00 $0.00 $0.00 $0.00 $3.00 $4.00 $7.65 $2.00 $0.00 $0.00 $1.00 $0.00
[820] $4.00 $2.00 $0.00 $0.00 $1.29 $1.75 $0.00 $6.41 $0.00 $0.00 $0.00 $0.00 $0.00
[833] $2.00 $2.00 $0.00 $0.00 $1.00 $8.10 $5.00 $2.15 $0.00 $0.00 $0.00 $10.15 $1.00
[846] $0.00 $2.00 $0.00 $1.00 $0.00 $5.25 $0.00 $0.00 $2.60 $0.00 $1.41 $0.00 $0.00
[859] $0.00 $0.00 $0.00 $2.00 $0.00 $1.90 $0.00 $0.00 $0.00 $3.00 $0.00 $0.00 $0.00
[872] $3.99 $0.00 $3.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $1.00 $0.00
[885] $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $2.00 $0.00 $0.00 $0.00 $3.00
[898] $0.00 $0.00 $0.00 $3.00 $1.25 $1.45 $0.00 $0.10 $1.00 $0.00 $0.00 $0.00 $2.40
[911] $0.00 $0.00 $0.00 $0.00 $7.37 $0.00 $2.70 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00
[924] $0.00 $2.00 $1.00 $0.00 $0.00 $0.00 $2.00 $0.00 $2.65 $1.00 $7.53 $0.00 $0.00
[937] $0.00 $0.00 $0.00 $0.00 $1.14 $2.91 $2.00 $0.00 $0.00 $0.00 $0.00 $0.00 $1.41
[950] $0.00 $1.65 $2.00 $2.00 $0.00 $0.00 $0.00 $2.00 $0.00 $0.00 $0.00 $9.85 $0.00
[963] $3.00 $4.35 $2.00 $2.00 $1.50 $2.60 $0.00 $2.00 $1.25 $0.00 $0.00 $0.00 $0.00
[976] $0.00 $3.00 $0.00 $3.10 $2.00 $7.40 $0.00 $0.00 $0.00 $0.00 $0.00 $1.81 $2.00
[989] $1.00 $1.00 $0.00 $2.00 $0.00 $0.00 $2.00 $5.00 $2.60 $0.00 $0.00 $0.00
[ reached getOption("max.print") -- omitted 98999 entries ]
1081 Levels: $0.00 $0.01 $0.02 $0.03 $0.04 $0.05 $0.06 $0.08 $0.09 $0.10 $0.11 $0.13 $0.15 ... $9.98
Mean of Fares
mydata = read.csv(file = "data/Taxi_Sample_Trips.csv")
miles = mydata$Trip.Miles
mean_miles = mean(Miles)
mean_Miles
Define a relational business logic for the column field ‘Trip Seconds’.
‘Trip Seconds’ could be related to the following other columns:
‘Fare’ and ‘Trip Miles’ The longer the time for ‘Trip Seconds’ then the higher the fare will be. The longer the time for ‘Trip Seconds’ then the higher the amount of miles recorded will be. These will allow a taxi company business to see the relation that if there are more trip seconds, then they can expect more fares and miles driven.
Using https://erdplus.com/#/standalone draw a star schema using the following three tables: