For each of the two datasets below
1979 salaries (in hundreds of Swiss francs) of 7 different professions in 6 cities
Datafile: salaries in LearnEDA package
library(LearnEDAfunctions)
head(salaries)
## Salary City Profession
## 1 341 Amsterdam Teacher
## 2 110 Athens Teacher
## 3 31 Bangkok Teacher
## 4 116 Hong_Kong Teacher
## 5 326 Los_Angeles Teacher
## 6 89 Singapore Teacher
Five number summaries:
View(salaries)
fivenum(subset(salaries$Salary, salaries$City == 'Amsterdam'))
## [1] 266 310 341 424 593
fivenum(subset(salaries$Salary, salaries$City == 'Athens'))
## [1] 106.0 117.5 161.0 192.0 320.0
fivenum(subset(salaries$Salary, salaries$City == 'Bangkok'))
## [1] 31.0 34.5 37.0 101.5 148.0
fivenum(subset(salaries$Salary, salaries$City == 'Hong_Kong'))
## [1] 59.0 96.0 116.0 159.5 203.0
fivenum(subset(salaries$Salary, salaries$City == 'Los_Angeles'))
## [1] 179.0 308.5 326.0 412.0 593.0
fivenum(subset(salaries$Salary, salaries$City == 'Singapore'))
## [1] 43.0 67.5 89.0 97.5 250.0
Amsterdam: LO = 266, FL = 310, M = 341, FU = 424, HI = 593, STEP = 1.5 * (424 - 310) = 171 inner fences: fencelower = FL - STEP = 310 - 171 = 139 fenceupper = FU + STEP = 424 + 171 = 595 outer fences: FENCElower = FL - 2 * STEP = 310 - 2 * 171 = -32 FENCEupper = FU + 2 * STEP = 424 + 2 * 171 = 766 outside values: none
Athens: LO = 106, FL = 117.5, M = 161, FU = 192, HI = 320, STEP = 1.5 * (192 - 117.5) = 111.75 inner fences: fencelower = FL - STEP = 117.5 - 111.75 = 5.75 fenceupper = FU + STEP = 192 + 111.75 = 303.75 outer fences: FENCElower = FL - 2 * STEP = 117.5 - 2 * 111.75 = -106 FENCEupper = FU + 2 * STEP = 192 + 2 * 111.75 = 415.5 outside values: One value that is outside of the inner fences but still within the outside fences is 320 (manager)
Bangkok: LO = 31, FL = 34.5, M = 37, FU = 101.5, HI = 148, STEP = 1.5 * (101.5 - 34.5) = 100.5 inner fences: fencelower = FL - STEP = 34.5 - 100.5 = -66 fenceupper = FU + STEP = 101.5 + 100.5 = 202 outer fences: FENCElower = FL - 2 * STEP = 34.5 - 2 * 100.5 = -166.5 FENCEupper = FU + 2 * STEP = 101.5 + 2 * 100.5 = 302.5 outside values: none
Hong_Kong: LO = 59, FL = 96, M = 116, FU = 159.5, HI = 203, STEP = 1.5 * (159.5 - 96) = 95.25 inner fences: fencelower = FL - STEP = 96 - 95.25 = 0.75 fenceupper = FU + STEP = 159.5 + 95.25 = 254.75 outer fences: FENCElower = FL - 2 * STEP = 96 - 2 * 95.25 = -94.5 FENCEupper = FU + 2 * STEP = 159.5 + 2 * 95.25 = 350 outside values: none
Los_Angeles: LO = 179, FL = 308.5, M = 326, FU = 412, HI = 593, STEP = 1.5 * (412 - 308.5) = 155.25 inner fences: fencelower = FL - STEP = 308.5 - 155.25 = 153.25 fenceupper = FU + STEP = 412 + 155.25 = 567.25 outer fences: FENCElower = FL - 2 * STEP = 308.5 - 2 * 155.25 = -2 FENCEupper = FU + 2 * STEP = 412 + 2 * 155.25 = 722.5 outside values: One value that is outside of the inner upper fence but still within the outer upper fence is 593 (manager).
Singapore: LO = 43, FL = 67.5, M = 89, FU = 97.5, HI = 250, STEP = 1.5 * (97.5 - 67.5) = 45 inner fences: fencelower = FL - STEP = 67.5 - 45 = 22.5 fenceupper = FU + STEP = 97.5 + 45 = 142.5 outer fences: FENCElower = FL - 2 * STEP = 67.5 - 2 * 45 = -22.5 FENCEupper = FU + 2 * STEP = 97.5 + 2 * 45 = 187.5 outside values: One value that is outside of the inner upper fence and also outside the outer upper fence is 250 (manager).
Parallel Boxplots:
ggplot(salaries, aes(x = factor(City), y = Salary)) +
geom_boxplot() + coord_flip() +
xlab("City") + ylab("Salary")
spread-vs-level plot
spread_level_plot(salaries, Salary, City)
## # A tibble: 6 × 5
## City M df log.M log.df
## <fct> <int> <dbl> <dbl> <dbl>
## 1 Amsterdam 341 114 2.53 2.06
## 2 Athens 161 74.5 2.21 1.87
## 3 Bangkok 37 67 1.57 1.83
## 4 Hong_Kong 116 63.5 2.06 1.80
## 5 Los_Angeles 326 104. 2.51 2.01
## 6 Singapore 89 30 1.95 1.48
power of transformation: p = 1 − b = 1 - 0.85 = 0.15 which is approximately 0.2. So this method suggests that we should reexpress the density data by taking a 0.2 power. now, let us try this method.
salaries %>%
mutate(Reexpressed = Salary ^ (0.2)) ->
salaries
ggplot(salaries, aes(City, Reexpressed)) +
geom_boxplot() + coord_flip()
reexpressed Five number summaries:
View(salaries)
fivenum(subset(salaries$Reexpressed, salaries$City == 'Amsterdam'))
## [1] 3.054755 3.149345 3.210339 3.349086 3.586005
fivenum(subset(salaries$Reexpressed, salaries$City == 'Athens'))
## [1] 2.541331 2.593378 2.762900 2.861882 3.169786
fivenum(subset(salaries$Reexpressed, salaries$City == 'Bangkok'))
## [1] 1.987341 2.030283 2.058924 2.508322 2.716767
fivenum(subset(salaries$Reexpressed, salaries$City == 'Hong_Kong'))
## [1] 2.260322 2.483523 2.587567 2.756374 2.894005
fivenum(subset(salaries$Reexpressed, salaries$City == 'Los_Angeles'))
## [1] 2.822088 3.146112 3.181585 3.330311 3.586005
fivenum(subset(salaries$Reexpressed, salaries$City == 'Singapore'))
## [1] 2.121747 2.311973 2.454019 2.498562 3.017088
Amsterdam: LO = 3.054755, FL = 3.149345 , M = 3.210339 , FU = 3.349086 , HI = 3.586005, STEP = 1.5 * (FU - FL) = 0.2996115 inner fences: fencelower = FL - STEP = 2.849734 fenceupper = FU + STEP = 3.648697 outer fences: FENCElower = FL - 2 * STEP = 2.550122 FENCEupper = FU + 2 * STEP = 3.948309 outliers: none
Athens: LO = 2.541331, FL = 2.593378 , M = 2.762900 , FU = 2.861882 , HI = 3.169786, STEP = 1.5 * (FU - FL) = 0.402756 inner fences: fencelower = FL - STEP = 2.190622 fenceupper = FU + STEP = 3.264638 outer fences: FENCElower = FL - 2 * STEP = 1.787866 FENCEupper = FU + 2 * STEP = 3.667394 outliers: none
Bangkok: LO = 1.987341 , FL = 2.030283 , M = 2.058924 , FU = 2.508322 , HI = 2.716767, STEP = 1.5 * (FU - FL) = 0.7170585 inner fences: fencelower = FL - STEP = 1.313224 fenceupper = FU + STEP = 3.225381 outer fences: FENCElower = FL - 2 * STEP = 0.596166 FENCEupper = FU + 2 * STEP = 3.942439 outliers: none
Hong_Kong: LO = 2.260322 , FL = 2.483523 , M = 2.587567 , FU = 2.756374 , HI = 2.894005, STEP = 1.5 * (FU - FL) = 0.4092765 inner fences: fencelower = FL - STEP = 2.074246 fenceupper = FU + STEP = 3.165651 outer fences: FENCElower = FL - 2 * STEP = 1.66497 FENCEupper = FU + 2 * STEP = 3.574927 outliers: none
Los_Angeles: LO = 2.822088 , FL = 3.146112 , M = 3.181585 , FU = 3.330311 , HI = 3.586005, STEP = 1.5 * (FU - FL) = 0.2762985 inner fences: fencelower = FL - STEP = 2.869814 fenceupper = FU + STEP = 3.60661 outer fences: FENCElower = FL - 2 * STEP = 2.593515 FENCEupper = FU + 2 * STEP = 3.882908 outliers: 2.822088 (cashier) is outside the inner lower fence but still within the outer lower fence
Singapore: LO = 2.121747 , FL = 2.311973 , M = 2.454019 , FU =
2.498562 , HI = 3.017088, STEP = 1.5 * (FU - FL) = 0.2798835 inner
fences: fencelower = FL - STEP = 2.032089 fenceupper = FU + STEP =
2.778446 outer fences: FENCElower = FL - 2 * STEP = 1.752206 FENCEupper
= FU + 2 * STEP = 3.058329
outliers: 3.017088 (manager) is outside of both, the inner upper fence
and the outer upper fence
spread_level_plot(salaries, Reexpressed, City)
## # A tibble: 6 × 5
## City M df log.M log.df
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Amsterdam 3.21 0.200 0.507 -0.700
## 2 Athens 2.76 0.269 0.441 -0.571
## 3 Bangkok 2.06 0.478 0.314 -0.321
## 4 Hong_Kong 2.59 0.273 0.413 -0.564
## 5 Los_Angeles 3.18 0.184 0.503 -0.735
## 6 Singapore 2.45 0.187 0.390 -0.729
After reexpressing the data, this display looks perhaps a little better than our original picture. There are still outliers and the spreads of each batch are still not equal and there seems to be a dependence between level and spread, Here, since this method did not work, we could try different methods of transformations such as the power family, log transformation, or ladders of powers.
Moreover, comparing the spreads (dFs) side by side we see the largest spread in the original dataset is 140/30 = 4.67 times the smallest. Looking at the reexpressed data, the spreads range from 0.1841987 to 0.4780390, so 0.4780390/0.1841987 = 2.595. Also, there still seems to be a dependence between level and spread. We can check out this point by performing a spread vs. level plot for the reexpressed data. We can see that there now is a negative dependence between level and spread. The most obvious differences between the cities is the spread. In the original data set, We can clearly see that for cities that have a high median salary, the spread is also larger. For example, Amsterdam has the highest median and also the largest spread out of all cities. On the other hand, Bangkok which has the smallest median also as a comparatively small spread. Some unusual data values of the reexpressed data set are 2.822088 in Los Angeles (cashier)and 3.017088 (manager) in Singapore.
Areas of Important Islands by Continent
Datafile: island.areas in LearnEDA package
head(island.areas)
## Ocean Name Area
## 1 Arctic Axel_Heilberg 16671
## 2 Arctic Baffin 195928
## 3 Arctic Banks 27038
## 4 Arctic Bathurst 6194
## 5 Arctic Devon 21331
## 6 Arctic Ellesmere 75767
View(island.areas)
fivenum(subset(island.areas$Area, island.areas$Ocean == 'Arctic'))
## [1] 2800 11221 16671 31019 195928
fivenum(subset(island.areas$Area, island.areas$Ocean == 'Caribbean'))
## [1] 59.0 124.0 290.0 2689.5 44218.0
fivenum(subset(island.areas$Area, island.areas$Ocean == 'Indian'))
## [1] 171.0 510.0 844.5 13916.0 226658.0
fivenum(subset(island.areas$Area, island.areas$Ocean == 'Mediterranean'))
## [1] 86.0 385.5 1936.0 3470.5 9822.0
fivenum(subset(island.areas$Area, island.areas$Ocean == 'East_Indies'))
## [1] 2113.0 3707.0 21429.5 69000.0 280100.0
Arctic: LO = 2800 , FL = 11221, M = 16671 , FU = 31019, HI = 195928,
STEP = 1.5 * (31019 - 11221) = 29697 inner fences: fencelower = FL -
STEP = 11221 - 29697 = -18476 fenceupper = FU + STEP = 31019 + 29697 =
60716 outer fences: FENCElower = FL - 2 * STEP = 11221 - 2 * 29697 =
-48173 FENCEupper = FU + 2 * STEP = 31019 + 2 * 29697 = 90413 outside
values: 195928 (Baffin) is outside of the upper inner fence as well as
the upper outer fence. 83896 (Victoria) and 75767
(Ellesmere) are outside of the inner upper fence but still within the
outer upper fence
Caribbean: LO = 59 , FL = 124, M = 290 , FU = 2689.5, HI = 44218, STEP = 1.5 * (2689.5 - 124) = 3848.25 inner fences: fencelower = FL - STEP = 124 - 3848.25 = -3724.25 fenceupper = FU + STEP = 2689.5 + 3848.25 = 6537.75 outer fences: FENCElower = FL - 2 * STEP = 124 - 2 * 3848.25 = -7572.5 FENCEupper = FU + 2 * STEP = 2689.5 + 2 * 3848.25 = 10386 outside values: 44218 (Cuba) and 29530 (Hispaniola,_Haiti,_and_Dominican_Republic) are outside of the upper inner fence as well as the upper outer fence.
Indian: LO = 171 , FL = 510, M = 844.5 , FU = 13916, HI = 226658, STEP = 1.5 * (13916 - 510) = 20109 inner fences: fencelower = FL - STEP = 510 - 20109 = -19608 fenceupper = FU + STEP = 13916 + 20109 = 34025 outer fences: FENCElower = FL - 2 * STEP = 510 - 2 * 20109 = -39708 FENCEupper = FU + 2 * STEP = 13916 + 2 * 20109 = 54134 outside values: 226658 (Madagascar) is outside of the upper inner fence as well as the upper outer fence.
Mediterranean: LO = 86 , FL = 385.5, M = 1936 , FU = 3470.5, HI = 9822, STEP = 1.5 * (3470.5 - 385.5) = 4627.5 inner fences: fencelower = FL - STEP = 385.5 - 4627.5 = -4242 fenceupper = FU + STEP = 3470.5 + 4627.5 = 8098 outer fences: FENCElower = FL - 2 * STEP = 385.5 - 2 * 4627.5 = -8869.5 FENCEupper = FU + 2 * STEP = 3470.5 + 2 * 4627.5 = 12725.5 outside values: 9262 (Sardinia) and 9822 (Sicily) are outside of the upper inner fence but still within the upper outer fence
East_Indies: LO = 2113 , FL = 3707, M = 21429.5 , FU = 69000, HI = 280100, STEP = 1.5 * (69000 - 3707) = 97939.5 inner fences: fencelower = FL - STEP = 3707 - 97939.5 = -94232.5 fenceupper = FU + STEP = 69000 + 97939.5 = 166939.5 outer fences: FENCElower = FL - 2 * STEP = 3707 - 2 * 97939.5 = -192.172 FENCEupper = FU + 2 * STEP = 69000 + 2 * 97939.5 = 264879 outside values: 280100 (Borneo), and 165000 (Sumatra) are outside of the upper inner fence as well as the upper outer fence.
Parallel Boxplots:
ggplot(island.areas, aes(x = factor(Ocean), y = Area)) +
geom_boxplot() + coord_flip() +
xlab("Ocean") + ylab("Area")
spread-vs-level plot
spread_level_plot(island.areas, Area, Ocean)
## # A tibble: 5 × 5
## Ocean M df log.M log.df
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Arctic 16671 19798 4.22 4.30
## 2 Caribbean 290 2566. 2.46 3.41
## 3 East_Indies 21430. 58302. 4.33 4.77
## 4 Indian 844. 7633 2.93 3.88
## 5 Mediterranean 1936 3085 3.29 3.49
power of transformation: p = 1 − b = 1 - 1.13 = -0.13 which is approximately -0.1. So this method suggests that we should reexpress the density data by taking a -0.1 power. now, let us try this method.
island.areas %>%
mutate(Reexpressed = Area ^ (-0.1)) ->
island.areas
ggplot(island.areas, aes(Ocean, Reexpressed)) +
geom_boxplot() + coord_flip()
spread_level_plot(island.areas, Reexpressed, Ocean)
## # A tibble: 5 × 5
## Ocean M df log.M log.df
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Arctic 0.378 0.0382 -0.422 -1.42
## 2 Caribbean 0.567 0.161 -0.246 -0.793
## 3 East_Indies 0.371 0.0968 -0.430 -1.01
## 4 Indian 0.510 0.0974 -0.292 -1.01
## 5 Mediterranean 0.469 0.114 -0.329 -0.942
reexpressed island Five number summaries:
View(island.areas)
fivenum(subset(island.areas$Reexpressed, island.areas$Ocean == 'Arctic'))
## [1] 0.2956585 0.3558230 0.3782717 0.3940214 0.4521517
fivenum(subset(island.areas$Reexpressed, island.areas$Ocean == 'Caribbean'))
## [1] 0.3431153 0.4564570 0.5672313 0.6176714 0.6651427
fivenum(subset(island.areas$Reexpressed, island.areas$Ocean == 'Indian'))
## [1] 0.2913821 0.4100382 0.5103464 0.5380831 0.5979989
fivenum(subset(island.areas$Reexpressed, island.areas$Ocean == 'Mediterranean'))
## [1] 0.3988228 0.4425692 0.4691476 0.5568152 0.6405458
fivenum(subset(island.areas$Reexpressed, island.areas$Ocean == 'East_Indies'))
## [1] 0.2852783 0.3281823 0.3714344 0.4396405 0.4650611
Arctic: LO = 0.2956585 , FL = 0.3558230 , M = 0.3782717 , FU = 0.3940214 , HI = 0.4521517, STEP = 1.5 * (FU - FL) = 0.0572976 inner fences: fencelower = FL - STEP = 0.2985254 fenceupper = FU + STEP = 0.451319 outer fences: FENCElower = FL - 2 * STEP = 0.2412278 FENCEupper = FU + 2 * STEP = 0.4521517 outside values: 0.2956585 (Baffin) is outside the inner lower fence but still within the outer lower fence 0.4521517 (Wrangle) is outside both, the inner upper fence and the outer upper fence .
Caribbean: LO = 0.3431153 , FL = 0.4564570 , M = 0.5672313 , FU = 0.6176714 , HI = 0.6651427, STEP = 1.5 * (FU - FL) = 0.2418216 inner fences: fencelower = FL - STEP = 0.2146354 fenceupper = FU + STEP = 0.859493 outer fences: FENCElower = FL - 2 * STEP = -0.0271862 FENCEupper = FU + 2 * STEP = 1.101315 outside values: none
Indian: LO = 0.2913821 , FL = 0.4100382 , M = 0.5103464 , FU = 0.5380831 , HI = 0.5979989, STEP = 1.5 * (FU - FL) = 0.1920674 inner fences: fencelower = FL - STEP = 0.2179708 fenceupper = FU + STEP = 0.7301505 outer fences: FENCElower = FL - 2 * STEP = 0.0259035 FENCEupper = FU + 2 * STEP = 0.9222178 outside values: none
Mediterranean: LO = 0.3988228 , FL = 0.4425692 , M = 0.4691476 , FU = 0.5568152 , HI = 0.6405458, STEP = 1.5 * (FU - FL) = 0.171369 inner fences: fencelower = FL - STEP = 0.2712002 fenceupper = FU + STEP = 0.7281842 outer fences: FENCElower = FL - 2 * STEP = 0.0998312 FENCEupper = FU + 2 * STEP = 0.8995532 outside values: none
East_Indies: LO = 0.2852783 , FL = 0.3281823 , M = 0.3714344 , FU = 0.4396405 , HI = 0.4650611, STEP = 1.5 * (FU - FL) = 0.1671873 inner fences: fencelower = FL - STEP = 0.160995 fenceupper = FU + STEP = 0.6068278 outer fences: FENCElower = FL - 2 * STEP = -0.0061923 FENCEupper = FU + 2 * STEP = 0.7740151 outside values: none
After reexpressing the data, at first glance, this display looks perhaps a little better than our original picture. Looking at the boxplots, we can see there are less outliers in the reexpressed data set. Although there is an improvement, the spreads of each batch are still not equal and there still seems to be a dependence between level and spread, Here, since this method did not work, we could try different methods of transformations such as the power family, log transformation, or ladders of powers.
The most obvious differences between the oceans is the spread. In the original data set. Looking at the reexpressed boxplot, we can see that oceans with a lower median density like the Arctic also have a smaller spread. The same is true for the opposite, meaning that oceans that have a higher median density like the Carribean for example also have a larger spread. This implies that there still seems to be a dependence between level and spread. We can check out this point by performing a spread vs. level plot for the reexpressed data. We can see that there now is still a dependence between level and spread. The most obvious differences between the oceans is the spread. Some unusual data values of the reexpressed data set are 0.2956585 in the Arctic ocean (Baffin) and 0.4521517 (Wrangle). Moreover, comparing the spreads (dFs) side by side we see the largest spread in the original dataset is 58302.25/2565.50 = 4.67 times the smallest. Looking at the reexpressed data, the spreads range from 0.03819838 to 0.16121441, so 0.16121441/0.03819838 = 4.220 indicating a small improvement but it still is not ideal.