Today the lab focuses on data outliers, data preparation, and data modeling. This lab requires the use of Microsoft Excel, R, and ERDplus.
Make sure to download the folder titled ‘bsad_lab02’ zip folder and extract the folder to unzip it. Next, we must set this folder as the working directory. The way to do this is to open R Studio, go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Now, follow the directions to complete the lab.
First, we must calculate the mean, standard deviation, maximum, and minimum for the Age column using R.
In R, we must read in the file again, extract the column and find the values that are asked for.
#Read File
scoring = read.csv(file = "data/scoring.csv")
#Name the extracted variable
seniority = scoring$Seniority
time = scoring$Time
age = scoring$Age
expenses = scoring$Expenses
income = scoring$Income
assets = scoring$Assets
debt = scoring$Debt
#Calculate the averages below. Refer to Worksheet 1 for the correct command.
Seniority_mean = mean(seniority)
Time_mean = mean(time)
Age_mean = mean(age)
Expenses_mean = mean(expenses)
Income_mean = mean(income)
Assets_mean = mean(assets)
Debt_mean = mean(Debt)
Seniority_mean
[1] 7.991453
Time_mean
[1] 46.45344
Age_mean
[1] 37.08412
Expenses_mean
[1] 55.60144
Income_mean
[1] 140.6298
Assets_mean
[1] 5354.949
Debt_mean
[1] 342.2571
#Calculate standard deviations below. Refer to Worksheet 1 for the correct command.
Seniority_sd = sd(seniority)
Time_sd = sd(time)
Age_sd = sd(age)
Expenses_sd = sd(expenses)
Income_sd = sd(income)
Assets_sd = sd(assets)
Debt_sd = sd(debt)
Seniority_sd
[1] 8.17637
Time_sd
[1] 14.64798
Age_sd
[1] 10.98637
Expenses_sd
[1] 19.52084
Income_sd
[1] 80.1779
Assets_sd
[1] 11534.33
Debt_sd
[1] 1244.695
#Calculate the maximum of age below. The command to find the maximum is max(variable) where variable is the extracted variable.
Seniority_max = max(seniority)
Time_max = max(time)
Age_max = max(age)
Expenses_max = max(expenses)
Income_max = max(income)
Assets_max = max(assets)
Debt_max = max(debt)
Seniority_max
[1] 48
Time_max
[1] 72
Age_max
[1] 68
Expenses_max
[1] 180
Income_max
[1] 959
Assets_max
[1] 300000
Debt_max
[1] 30000
#Calculate the minimum of age below. The command to find the minimum is min(variable) where variable is the extracted variable.
Seniority__min = min(seniority)
Time_min = min(time)
Age_min = min(age)
Expenses_min = min(expenses)
Income_min = min(income)
Assets_min = min(assets)
Debt_min = min(debt)
Seniority__min
[1] 0
Time_min
[1] 6
Age_min
[1] 18
Expenses_min
[1] 35
Income_min
[1] 1
Assets_min
[1] 0
Debt_min
[1] 0
Next, use the formula from class to detect any outliers. An outlier is value that “lies outside” most of the other values in a set of data. A common way to estimate the upper and lower threshold is to take the mean (+ or -) 3 * standard deviation. Try using this formula to find the upper and lower limit for age.
#Upper Threshold
Seniority_upper = Seniority_mean + (3) * Seniority_sd
Time_upper = Time_mean + (3) * Time_sd
Age_upper = Age_mean + (3) * Age_sd
Expenses_upper = Expenses_mean +(3) * Expenses_sd
Income_upper = Income_mean + (3) * Income_sd
Assets_upper = Assets_mean + (3) * Assets_sd
Debt_upper = Debt_mean + (3) * Debt_sd
Seniority_lower = Seniority_mean - (3) * Seniority_sd
Time_lower = Time_mean - (3) * Time_sd
Age_lower = Age_mean - (3) * Age_sd
Expenses_lower = Expenses_mean - (3) * Expenses_sd
Income_lower = Income_mean - (3) * Income_sd
Assets_lower = Assets_mean - (3) * Assets_sd
Debt_lower = Debt_mean - (3) * Debt_sd
Seniority_upper
[1] 32.52056
Time_upper
[1] 90.39738
Age_upper
[1] 70.04322
Expenses_upper
[1] 114.164
Income_upper
[1] 381.1635
Assets_upper
[1] 39957.93
Debt_upper
[1] 4076.341
Seniority_lower
[1] -16.53766
Time_lower
[1] 2.509503
Age_lower
[1] 4.125023
Expenses_lower
[1] -2.961078
Income_lower
[1] -99.90391
Assets_lower
[1] -29248.04
Debt_lower
[1] -3391.827
#Find the quantile data for seniority
quantile(seniority)
0% 25% 50% 75% 100%
0 2 5 12 48
lowerq = quantile(seniority)[2]
upperq = quantile(seniority)[4]
iqr = upperq - lowerq
iqr
75%
10
The formula below calculates the threshold. The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier.
Below is the upper threshold:
upperthreshold = (iqr * 1.5) + upperq
upperthreshold
75%
27
Below is the lower threshold:
lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold
25%
-13
Are there any outliers? How many? It can also be useful to visualize the data using a box and whisker plot. The boxplot below supports the IQR we found of 15 and upper and lower threshold.
150 Outliers
seniority[seniority>upperthreshold]
[1] 29 33 32 30 29 30 38 40 32 35 37 40 31 35 30 29 33 30 30 30 30 40 40 30 31 29 32 37 35 35 30 30 40 30 30 31 30 29 35 43 28 35 30 38 28 40 28 36 28 40 35 41 30 37 28 30 30 30 37 40 28 28 30 30 30 35 35 28 30 30 28
[72] 40 40 30 29 32 30 30 30 35 29 30 30 38 32 28 32 43 30 31 30 32 36 30 30 29 30 31 32 30 28 35 28 30 29 29 40 38 30 35 35 30 31 45 30 42 31 48 29 47 31 38 28 32 36 30 30 33 31 30 39 30 31 40 30 28 45 33 30 30 45 30
[143] 32 33 30 36 30 40 40 34
seniority[seniority<lowerthreshold]
integer(0)
boxplot(seniority,horizontal = TRUE)
Next, we must read the ‘creditriskorg.csv’ file into R. This is the original dataset and contains missing values.
creditriskorig = read.csv(file = "data/creditriskorg.csv")
head(creditriskorig)
We observe that the column names are shifted down below. So, we must make sure to use the command skip and set the header to true.
creditriskorg = read.csv("data/creditriskorg.csv",skip = 1)
head(creditriskorg)
To calculate the mean for Checking in R, follow Worksheet 1. Extract the Checking column first and then find the average using the function built in R. What happens when we try to use the function?
savings = creditriskorg$Savings
savings
[1] $739.00 $1,230.00 $389.00 $347.00 $4,754.00 $- $229.00 $533.00 $493.00 $- $989.00 $3,305.00 $578.00 $821.00 $228.00 $129.00
[17] $732.00 $683.00 $- $12,348.00 $17,545.00 $3,871.00 $- $485.00 $10,723.00 $245.00 $- $- $236.00 $485.00 $1,708.00 $-
[33] $407.00 $895.00 $150.00 $490.00 $162.00 $715.00 $323.00 $128.00 $- $109.00 $189.00 $537.00 $6,520.00 $138.00 $- $660.00
[49] $724.00 $897.00 $947.00 $- $917.00 $595.00 $789.00 $- $746.00 $140.00 $659.00 $717.00 $667.00 $- $763.00 $1,366.00
[65] $552.00 $14,643.00 $2,665.00 $- $442.00 $8,357.00 $- $863.00 $322.00 $800.00 $656.00 $922.00 $885.00 $2,886.00 $626.00 $-
[81] $904.00 $784.00 $806.00 $3,281.00 $759.00 $680.00 $104.00 $899.00 $1,732.00 $706.00 $- $576.00 $904.00 $194.00 $710.00 $5,564.00
[97] $192.00 $637.00 $405.00 $318.00 $698.00 $369.00 $- $492.00 $308.00 $127.00 $565.00 $12,632.00 $116.00 $178.00 $901.00 $650.00
[113] $241.00 $609.00 $131.00 $544.00 $10,853.00 $- $409.00 $391.00 $322.00 $583.00 $12,242.00 $479.00 $496.00 $466.00 $1,583.00 $1,533.00
[129] $- $4,873.00 $- $717.00 $7,876.00 $4,449.00 $- $104.00 $897.00 $4,033.00 $945.00 $836.00 $325.00 $19,568.00 $803.00 $10,980.00
[145] $265.00 $609.00 $1,851.00 $199.00 $500.00 $509.00 $270.00 $457.00 $260.00 $3,036.00 $643.00 $- $6,345.00 $922.00 $909.00 $775.00
[161] $979.00 $948.00 $2,790.00 $309.00 $762.00 $970.00 $320.00 $861.00 $262.00 $692.00 $540.00 $470.00 $192.00 $12,230.00 $772.00 $14,186.00
[177] $6,330.00 $18,716.00 $886.00 $750.00 $3,870.00 $3,273.00 $406.00 $461.00 $340.00 $6,490.00 $348.00 $506.00 $14,717.00 $- $1,571.00 $-
[193] $411.00 $544.00 $- $835.00 $823.00 $5,180.00 $408.00 $821.00 $385.00 $2,223.00 $- $605.00 $7,525.00 $3,529.00 $1,435.00 $887.00
[209] $243.00 $4,553.00 $418.00 $771.00 $463.00 $110.00 $10,099.00 $13,428.00 $208.00 $552.00 $3,105.00 $415.00 $1,238.00 $238.00 $127.00 $-
[225] $785.00 $718.00 $493.00 $757.00 $9,125.00 $364.00 $598.00 $374.00 $- $- $508.00 $956.00 $636.00 $- $1,519.00 $922.00
[241] $180.00 $701.00 $296.00 $519.00 $800.00 $736.00 $11,838.00 $364.00 $212.00 $888.00 $999.00 $- $578.00 $136.00 $734.00 $861.00
[257] $855.00 $4,486.00 $2,017.00 $- $500.00 $859.00 $3,305.00 $1,218.00 $9,016.00 $11,587.00 $8,944.00 $807.00 $867.00 $16,804.00 $347.00 $836.00
[273] $142.00 $169.00 $3,613.00 $403.00 $836.00 $- $11,481.00 $3,285.00 $164.00 $891.00 $- $841.00 $607.00 $486.00 $108.00 $-
[289] $113.00 $- $208.00 $603.00 $343.00 $463.00 $- $299.00 $490.00 $6,628.00 $859.00 $750.00 $954.00 $591.00 $13,970.00 $857.00
[305] $5,857.00 $3,326.00 $726.00 $214.00 $207.00 $713.00 $2,141.00 $483.00 $127.00 $367.00 $813.00 $- $102.00 $759.00 $- $503.00
[321] $823.00 $693.00 $973.00 $648.00 $523.00 $7,090.00 $596.00 $904.00 $541.00 $154.00 $- $- $337.00 $716.00 $837.00 $-
[337] $7,710.00 $531.00 $129.00 $941.00 $596.00 $987.00 $- $460.00 $- $798.00 $- $959.00 $1,482.00 $883.00 $12,721.00 $-
[353] $- $302.00 $538.00 $- $702.00 $2,688.00 $- $425.00 $214.00 $- $272.00 $124.00 $17,124.00 $612.00 $862.00 $146.00
[369] $14,190.00 $396.00 $519.00 $- $344.00 $204.00 $148.00 $435.00 $914.00 $- $- $412.00 $19,811.00 $822.00 $- $3,369.00
[385] $- $707.00 $818.00 $177.00 $4,071.00 $466.00 $460.00 $991.00 $17,653.00 $497.00 $4,014.00 $921.00 $- $- $607.00 $15,800.00
[401] $369.00 $4,973.00 $- $761.00 $471.00 $674.00 $547.00 $524.00 $815.00 $- $989.00 $10,406.00 $957.00 $770.00 $950.00 $160.00
[417] $276.00 $137.00 $579.00 $2,684.00 $- $- $- $712.00 $912.00
322 Levels: $- $1,218.00 $1,230.00 $1,238.00 $1,366.00 $1,435.00 $1,482.00 $1,519.00 $1,533.00 $1,571.00 $1,583.00 $1,708.00 $1,732.00 $1,851.00 $10,099.00 $10,406.00 ... $999.00
To resolve the error, we must remove understand where it is coming from. There are missing values in the csv file, which is quite common as most datasets are not perfect. Additionally, there are commas within the excel spreadsheet, and R does not recognize that ‘1,234’ is equivalent to ‘1234’. Lastly, there are ‘$’ symbols throughout the file which is not a numerica symbol either.
The sub function replaces these symbols with something else. So, in order to remove the comma in the number “1,234”, we must substitute it with just an empty space.
As shown on the worksheet, type and copy the exact commands to find the mean with the NA values removed.
clean = savings
#substitute comma with blank in all of checking
clean = sub(",","",clean)
#substitute dollar sign with blank in all of checking
clean = sub("\\$","",clean)
#numeric convert
clean = as.numeric(clean)
NAs introduced by coercion
#mean with NA removed
clean
[1] 739 1230 389 347 4754 NA 229 533 493 NA 989 3305 578 821 228 129 732 683 NA 12348 17545 3871 NA 485 10723 245 NA NA 236 485 1708 NA 407 895 150
[36] 490 162 715 323 128 NA 109 189 537 6520 138 NA 660 724 897 947 NA 917 595 789 NA 746 140 659 717 667 NA 763 1366 552 14643 2665 NA 442 8357
[71] NA 863 322 800 656 922 885 2886 626 NA 904 784 806 3281 759 680 104 899 1732 706 NA 576 904 194 710 5564 192 637 405 318 698 369 NA 492 308
[106] 127 565 12632 116 178 901 650 241 609 131 544 10853 NA 409 391 322 583 12242 479 496 466 1583 1533 NA 4873 NA 717 7876 4449 NA 104 897 4033 945 836
[141] 325 19568 803 10980 265 609 1851 199 500 509 270 457 260 3036 643 NA 6345 922 909 775 979 948 2790 309 762 970 320 861 262 692 540 470 192 12230 772
[176] 14186 6330 18716 886 750 3870 3273 406 461 340 6490 348 506 14717 NA 1571 NA 411 544 NA 835 823 5180 408 821 385 2223 NA 605 7525 3529 1435 887 243 4553
[211] 418 771 463 110 10099 13428 208 552 3105 415 1238 238 127 NA 785 718 493 757 9125 364 598 374 NA NA 508 956 636 NA 1519 922 180 701 296 519 800
[246] 736 11838 364 212 888 999 NA 578 136 734 861 855 4486 2017 NA 500 859 3305 1218 9016 11587 8944 807 867 16804 347 836 142 169 3613 403 836 NA 11481 3285
[281] 164 891 NA 841 607 486 108 NA 113 NA 208 603 343 463 NA 299 490 6628 859 750 954 591 13970 857 5857 3326 726 214 207 713 2141 483 127 367 813
[316] NA 102 759 NA 503 823 693 973 648 523 7090 596 904 541 154 NA NA 337 716 837 NA 7710 531 129 941 596 987 NA 460 NA 798 NA 959 1482 883
[351] 12721 NA NA 302 538 NA 702 2688 NA 425 214 NA 272 124 17124 612 862 146 14190 396 519 NA 344 204 148 435 914 NA NA 412 19811 822 NA 3369 NA
[386] 707 818 177 4071 466 460 991 17653 497 4014 921 NA NA 607 15800 369 4973 NA 761 471 674 547 524 815 NA 989 10406 957 770 950 160 276 137 579 2684
[421] NA NA NA 712 912
What are some other ways to clean this data in R? How about in Excel? You can represent each unknown symbol as 0 if that is there value. In excel, you’d have to type in specific values.
Now, we will look at Chicago taxi data. Go and explore the interactive dashboard and read the description of the data.
Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq
Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/
Open in RStudio the csv file is located in the data folder, note the size of the file, the number of columns and of rows here. Use the functions learned in lab00 and lab01 to describe the data, identify unique entities, fields and summarize.
taxidata = read.csv(file = "data/taxi_trips_sample.csv")
Define a relational business logic for the column field ‘Trip Seconds’.
Star Schema: