Today the lab focuses on data outliers, data preparation, and data modeling. This lab requires the use of Microsoft Excel, R, and ERDplus.
Make sure to download the folder titled ‘bsad_lab02’ zip folder and extract the folder to unzip it. Next, we must set this folder as the working directory. The way to do this is to open R Studio, go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Now, follow the directions to complete the lab.
First, we must calculate the mean, standard deviation, maximum, and minimum for the Age column using R.
In R, we must read in the file again, extract the column and find the values that are asked for.
#Read File
mydata = read.csv(file="data/Scoring.csv")
mydata
#Name the extracted variable
#Calculate the average age below. Refer to Worksheet 1 for the correct command.
age = mydata$Age
age
[1] 30 58 46 24 26 36 44 27 32 41 34 29 30 37 21
[16] 68 52 68 36 31 25 22 45 41 51 54 43 43 23 29
[31] 23 51 39 35 62 56 32 36 64 31 42 61 32 43 26
[46] 49 47 27 52 28 31 32 51 37 29 26 27 63 55 28
[61] 31 27 40 35 43 34 37 53 46 36 46 43 31 43 38
[76] 37 46 55 56 55 52 29 39 38 24 28 53 25 62 29
[91] 32 21 38 56 41 46 54 51 52 57 33 33 29 23 33
[106] 24 28 27 33 32 39 51 39 29 32 40 47 66 39 50
[121] 25 27 54 42 35 23 52 22 48 25 26 28 62 33 41
[136] 29 37 31 50 42 33 54 58 40 31 35 64 29 33 35
[151] 28 59 25 23 30 36 25 22 45 47 36 36 60 59 58
[166] 37 32 26 40 55 30 43 28 47 29 48 34 53 44 49
[181] 51 34 42 34 46 31 31 33 30 27 48 33 44 48 57
[196] 34 38 35 52 28 46 56 52 35 45 29 22 33 41 37
[211] 46 19 22 31 44 48 31 53 36 46 37 34 36 64 27
[226] 26 53 50 21 40 49 40 37 25 40 22 49 34 49 43
[241] 42 28 30 25 22 27 26 28 35 51 24 34 34 49 28
[256] 57 63 37 27 39 64 49 49 45 27 30 33 40 30 30
[271] 25 40 50 23 28 35 27 19 39 39 55 32 54 58 37
[286] 36 28 38 26 41 42 46 52 30 47 66 54 65 25 31
[301] 23 35 47 28 50 25 39 46 36 39 25 49 52 34 24
[316] 38 24 33 31 63 23 61 28 58 41 38 27 55 40 59
[331] 47 38 25 28 49 38 52 37 42 54 30 26 49 36 40
[346] 61 59 49 52 35 25 30 28 42 29 32 45 52 26 33
[361] 21 34 27 44 51 33 35 38 28 26 34 27 32 43 30
[376] 42 26 20 24 27 35 18 35 33 39 36 29 23 34 57
[391] 38 52 42 44 37 34 43 42 24 36 52 34 34 31 36
[406] 50 33 54 40 29 31 32 51 31 46 35 33 25 31 43
[421] 30 40 37 22 24 27 32 43 57 59 60 25 28 40 36
[436] 27 33 44 34 28 39 47 28 37 20 36 28 20 32 19
[451] 21 23 47 37 51 31 49 27 34 28 29 51 34 42 46
[466] 40 32 26 31 47 45 26 41 31 23 34 34 38 55 54
[481] 24 37 43 26 59 54 27 28 53 48 30 28 34 33 55
[496] 40 28 31 56 38 25 35 41 31 52 33 30 37 54 32
[511] 23 39 28 43 50 41 27 25 35 31 43 28 22 42 41
[526] 29 27 32 27 37 59 52 25 24 51 39 33 31 30 24
[541] 31 39 42 37 38 42 23 40 36 27 60 45 65 40 40
[556] 33 28 33 51 31 36 23 33 51 31 32 50 52 34 25
[571] 28 36 27 26 24 47 40 42 52 26 45 47 25 50 44
[586] 48 19 48 46 50 30 30 39 27 52 42 22 43 36 23
[601] 40 50 52 38 35 34 34 49 37 33 34 42 32 19 45
[616] 41 25 19 59 56 42 38 22 44 31 49 55 55 38 60
[631] 58 39 34 21 37 25 39 22 35 50 49 30 32 37 24
[646] 63 25 23 22 52 39 29 43 35 27 34 39 51 35 33
[661] 30 25 53 26 61 63 30 43 35 49 34 26 41 30 51
[676] 51 40 53 33 24 25 44 33 27 24 32 32 36 27 55
[691] 39 36 38 37 36 49 27 37 49 28 27 47 44 29 35
[706] 47 31 23 46 50 26 36 28 36 61 46 25 22 38 57
[721] 26 24 51 26 46 37 55 38 24 27 43 52 57 36 62
[736] 19 32 43 34 34 34 30 40 42 49 40 24 35 36 62
[751] 24 44 57 25 30 65 32 43 48 25 36 47 22 47 42
[766] 40 40 43 38 41 55 32 42 23 41 50 49 62 43 23
[781] 33 25 50 37 27 47 22 39 46 19 24 43 42 32 27
[796] 34 43 21 36 23 40 25 38 58 41 39 19 31 28 49
[811] 47 42 25 48 52 37 45 30 38 50 29 52 42 37 36
[826] 33 41 23 30 27 33 23 41 52 58 46 28 38 39 24
[841] 41 51 56 24 49 21 34 38 58 39 60 23 37 48 26
[856] 23 34 46 51 45 55 39 33 52 26 47 31 50 39 36
[871] 47 45 40 36 43 44 55 32 26 33 27 24 29 20 46
[886] 62 49 27 52 26 43 34 37 37 39 26 34 30 26 58
[901] 36 40 49 55 25 33 38 38 33 42 47 25 42 52 36
[916] 28 33 46 22 22 23 29 29 37 30 52 46 51 28 33
[931] 39 33 40 59 32 45 38 26 29 21 31 32 40 29 33
[946] 29 57 23 36 38 45 51 42 22 26 51 33 43 32 52
[961] 65 32 29 25 22 53 47 34 23 54 22 46 42 45 49
[976] 28 22 27 33 52 34 43 40 31 31 27 42 35 42 24
[991] 31 53 59 38 49 42 54 24 39 48
[ reached getOption("max.print") -- omitted 3446 entries ]
#Calculate the average age below. Refer to Worksheet 1 for the correct command.
meanAge = mean(age)
meanAge
[1] 37.08412
#Calculate standard deviation of age below. Refer to Worksheet 1 for the correct command.
sd_age = sd(age)
sd_age
[1] 10.98637
#Calculate the maximum of age below. The command to find the maximum is max(variable) where variable is the extracted variable.
maxAge = max(age)
maxAge
[1] 68
#Calculate the minimum of age below. The command to find the minimum is min(variable) where variable is the extracted variable.
minAge = min(age)
minAge
[1] 18
Next, use the formula from class to detect any outliers. An outlier is value that “lies outside” most of the other values in a set of data. A common way to estimate the upper and lower threshold is to take the mean (+ or -) 3 * standard deviation. Try using this formula to find the upper and lower limit for age.
#Use the formula above to calculate the upper and lower threshold
upperthreshholdAge = mean(age + 3 * sd(age))
upperthreshholdAge
[1] 70.04322
lowerthreshholdAge = mean(age - 3 * sd(age))
lowerthreshholdAge
[1] 4.125023
A method to find the upper and lower thresholds discussed in introductory statistics courses involves finding the interquartile range. Follow along below to see how we first calculate the interquartile range..
quantile(age)
0% 25% 50% 75% 100%
18 28 36 45 68
iqr = 45 - 28
iqr
[1] 17
The formula below calculates the threshold. The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier.
Below is the upper threshold:
upperthreshold = (iqr * 1.5) + 45
upperthreshold
[1] 70.5
Below is the lower threshold:
lowerthreshold = 28 - (iqr * 1.5)
lowerthreshold
[1] 2.5
Are there any outliers? How many? It can also be useful to visualize the data using a box and whisker plot. The boxplot below supports the IQR we found of 15 and upper and lower threshold.
boxplot(age)
Next, we must read the ‘creditriskorg.csv’ file into R. This is the original dataset and contains missing values.
mydata = read.csv(file="data/creditriskorg.csv")
mydata
We observe that the column names are shifted down below. So, we must make sure to use the command skip and set the header to true.
Checking = mydata$X.1
Checking
[1] Checking $- $-
[4] $- $638.00 $963.00
[7] $2,827.00 $- $-
[10] $6,509.00 $966.00 $-
[13] $- $322.00 $-
[16] $396.00 $- $652.00
[19] $708.00 $207.00 $287.00
[22] $- $101.00 $-
[25] $- $- $141.00
[28] $- $2,484.00 $237.00
[31] $- $335.00 $3,565.00
[34] $- $16,647.00 $-
[37] $- $- $940.00
[40] $- $- $218.00
[43] $- $16,935.00 $664.00
[46] $150.00 $- $216.00
[49] $- $- $-
[52] $265.00 $4,256.00 $870.00
[55] $162.00 $- $-
[58] $- $461.00 $-
[61] $- $- $580.00
[64] $- $- $-
[67] $- $758.00 $399.00
[70] $513.00 $- $-
[73] $565.00 $- $-
[76] $- $166.00 $9,783.00
[79] $674.00 $- $15,328.00
[82] $- $713.00 $-
[85] $- $- $-
[88] $- $303.00 $900.00
[91] $- $1,257.00 $-
[94] $273.00 $522.00 $-
[97] $- $- $-
[100] $514.00 $457.00 $5,133.00
[103] $- $644.00 $305.00
[106] $9,621.00 $- $-
[109] $- $- $-
[112] $6,851.00 $13,496.00 $509.00
[115] $- $19,155.00 $-
[118] $- $374.00 $-
[121] $828.00 $- $829.00
[124] $- $- $939.00
[127] $- $889.00 $876.00
[130] $893.00 $12,760.00 $-
[133] $- $959.00 $-
[136] $- $- $-
[139] $698.00 $- $-
[142] $- $12,974.00 $-
[145] $317.00 $- $-
[148] $- $192.00 $-
[151] $- $- $-
[154] $- $942.00 $-
[157] $3,329.00 $- $-
[160] $- $- $-
[163] $- $339.00 $-
[166] $- $- $105.00
[169] $- $216.00 $113.00
[172] $109.00 $- $-
[175] $8,176.00 $- $468.00
[178] $7,885.00 $- $-
[181] $- $- $-
[184] $- $- $-
[187] $- $734.00 $-
[190] $- $172.00 $644.00
[193] $- $617.00 $-
[196] $586.00 $- $-
[199] $- $- $-
[202] $522.00 $585.00 $5,588.00
[205] $- $352.00 $-
[208] $2,715.00 $560.00 $895.00
[211] $305.00 $- $-
[214] $- $8,948.00 $-
[217] $- $- $-
[220] $- $483.00 $-
[223] $- $- $663.00
[226] $624.00 $- $-
[229] $152.00 $- $-
[232] $498.00 $- $156.00
[235] $1,336.00 $- $-
[238] $- $2,641.00 $-
[241] $- $- $-
[244] $- $887.00 $-
[247] $- $- $-
[250] $18,408.00 $497.00 $-
[253] $946.00 $986.00 $8,122.00
[256] $- $778.00 $645.00
[259] $- $682.00 $19,812.00
[262] $- $- $859.00
[265] $- $- $-
[268] $- $- $-
[271] $795.00 $- $-
[274] $- $- $852.00
[277] $- $- $425.00
[280] $- $- $-
[283] $11,072.00 $- $219.00
[286] $8,060.00 $- $-
[289] $- $- $1,613.00
[292] $757.00 $- $-
[295] $977.00 $197.00 $-
[298] $- $- $-
[301] $- $256.00 $296.00
[304] $- $- $-
[307] $298.00 $- $8,636.00
[310] $- $- $19,766.00
[313] $- $- $-
[316] $- $4,089.00 $-
[319] $271.00 $949.00 $-
[322] $911.00 $- $-
[325] $- $- $271.00
[328] $- $- $-
[331] $- $4,802.00 $177.00
[334] $- $- $996.00
[337] $705.00 $- $-
[340] $5,960.00 $- $759.00
[343] $- $651.00 $257.00
[346] $955.00 $- $8,249.00
[349] $- $956.00 $382.00
[352] $- $842.00 $3,111.00
[355] $- $- $2,846.00
[358] $231.00 $- $17,366.00
[361] $- $332.00 $242.00
[364] $- $929.00 $-
[367] $- $- $-
[370] $- $- $-
[373] $646.00 $538.00 $-
[376] $- $- $-
[379] $135.00 $2,472.00 $-
[382] $10,417.00 $211.00 $16,630.00
[385] $- $642.00 $-
[388] $296.00 $898.00 $478.00
[391] $315.00 $122.00 $-
[394] $- $- $670.00
[397] $444.00 $3,880.00 $819.00
[400] $- $- $-
[403] $- $- $-
[406] $- $- $-
[409] $161.00 $- $-
[412] $789.00 $765.00 $-
[415] $- $983.00 $-
[418] $- $798.00 $-
[421] $193.00 $497.00 $-
[424] $- $- $-
169 Levels: $- $1,257.00 ... Checking
To calculate the mean for Checking in R, follow Worksheet 1. Extract the Checking column first and then find the average using the function built in R. What happens when we try to use the function?
meanChecking = mean(Checking)
argument is not numeric or logical: returning NA
meanChecking
[1] NA
To resolve the error, we must remove understand where it is coming from. There are missing values in the csv file, which is quite common as most datasets are not perfect. Additionally, there are commas within the excel spreadsheet, and R does not recognize that ‘1,234’ is equivalent to ‘1234’. Lastly, there are ‘$’ symbols throughout the file which is not a numerica symbol either.
The sub function replaces these symbols with something else. So, in order to remove the comma in the number “1,234”, we must substitute it with just an empty space.
As shown on the worksheet, type and copy the exact commands to find the mean with the NA values removed.
#substitute comma with blank in all of checking
comma = sub(",") = ("")
Error in sub(",") = ("") :
target of assignment expands to non-language object
What are some other ways to clean this data in R? How about in Excel?
Now, we will look at Chicago taxi data. Go and explore the interactive dashboard and read the description of the data.
Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq
Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/
Open in RStudio the csv file is located in the data folder, note the size of the file, the number of columns and of rows here. Use the functions learned in lab00 and lab01 to describe the data, identify unique entities, fields and summarize.
mydata = read.csv(file="data/Taxi_Trips_Sample.csv")
mydata
str(mydata)
'data.frame': 99999 obs. of 24 variables:
$ Trip.ID : Factor w/ 99999 levels "3e7d6d8ccf1425ae1dcd584f5c3ca303cf6362ed",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Taxi.ID : Factor w/ 6164 levels "0008de7a146802839c9e6059f482d292ebdae13c5c31dd6e5983a80882e2a5dbcd6ea098c2fcd56f34ce02645eb94c6b39512e930483774"| __truncated__,..: 4264 4466 5890 335 5491 1199 207 2234 639 1106 ...
$ Trip.Start.Timestamp : Factor w/ 64107 levels "01/01/2013 01:00:00 PM",..: 43393 49615 7884 23880 8923 61522 36223 11713 37705 29843 ...
$ Trip.End.Timestamp : Factor w/ 64181 levels "","01/01/2013 01:00:00 AM",..: 43465 49683 7924 23952 8959 61606 36292 11780 37776 29890 ...
$ Trip.Seconds : int 480 420 420 2340 300 1020 360 2220 1020 780 ...
$ Trip.Miles : num 0.15 0 1.7 13.8 0.7 1.4 0.1 13.3 8 7.7 ...
$ Pickup.Census.Tract : num 1.7e+10 1.7e+10 NA 1.7e+10 1.7e+10 ...
$ Dropoff.Census.Tract : num 1.7e+10 1.7e+10 NA 1.7e+10 1.7e+10 ...
$ Pickup.Community.Area : int 28 8 6 76 8 NA 7 56 NA 11 ...
$ Dropoff.Community.Area : int 32 28 4 6 8 NA 8 24 NA 76 ...
$ Fare : Factor w/ 892 levels "","$0.00","$0.01",..: 734 642 734 329 554 860 857 320 175 162 ...
$ Tips : Factor w/ 1081 levels "","$0.00","$0.01",..: 2 2 2 2 2 2 50 839 2 756 ...
$ Tolls : Factor w/ 25 levels "","$0.00","$0.60",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Extras : Factor w/ 183 levels "","$0.00","$0.01",..: 19 2 2 99 2 12 2 55 12 21 ...
$ Trip.Total : Factor w/ 2702 levels "","$0.00","$0.01",..: 2472 2030 2280 1105 1704 28 17 1265 548 1055 ...
$ Payment.Type : Factor w/ 7 levels "Cash","Credit Card",..: 1 1 1 1 1 1 2 2 1 2 ...
$ Company : Factor w/ 98 levels "","0118 - 42111 Godfrey S.Awir",..: 1 97 97 1 1 1 97 1 1 1 ...
$ Pickup.Centroid.Latitude : num 41.9 41.9 41.9 42 41.9 ...
$ Pickup.Centroid.Longitude : num -87.6 -87.6 -87.7 -87.9 -87.6 ...
$ Pickup.Centroid.Location : Factor w/ 309 levels "","POINT (-87.540936 41.663671)",..: 80 66 119 308 44 1 98 292 1 301 ...
$ Dropoff.Centroid.Latitude : num 41.9 41.9 42 42 41.9 ...
$ Dropoff.Centroid.Longitude: num -87.6 -87.6 -87.7 -87.7 -87.6 ...
$ Dropoff.Centroid..Location: Factor w/ 357 levels "","POINT (-87.534903 41.707311)",..: 65 88 256 178 51 1 67 227 1 357 ...
$ Community.Areas : int 29 37 57 75 37 NA 68 53 NA 11 ...
Define a relational business logic for the column field ‘Trip Seconds’.
Using https://erdplus.com/#/standalone draw a star schema using the following three tables:
“My ERD daigram is in the worksheet 1 tab because there was no images folder in lab02, so I used the images folder in lab01 and put it in there.