For each of the two datasets below

Dataset 1

1979 salaries (in hundreds of Swiss francs) of 7 different professions in 6 cities

Datafile: salaries in LearnEDA package

library(LearnEDAfunctions)
head(salaries)
##   Salary        City Profession
## 1    341   Amsterdam    Teacher
## 2    110      Athens    Teacher
## 3     31     Bangkok    Teacher
## 4    116   Hong_Kong    Teacher
## 5    326 Los_Angeles    Teacher
## 6     89   Singapore    Teacher

Five number summaries:

View(salaries)

fivenum(subset(salaries$Salary, salaries$City == 'Amsterdam'))
## [1] 266 310 341 424 593
fivenum(subset(salaries$Salary, salaries$City == 'Athens'))
## [1] 106.0 117.5 161.0 192.0 320.0
fivenum(subset(salaries$Salary, salaries$City == 'Bangkok'))
## [1]  31.0  34.5  37.0 101.5 148.0
fivenum(subset(salaries$Salary, salaries$City == 'Hong_Kong'))
## [1]  59.0  96.0 116.0 159.5 203.0
fivenum(subset(salaries$Salary, salaries$City == 'Los_Angeles'))
## [1] 179.0 308.5 326.0 412.0 593.0
fivenum(subset(salaries$Salary, salaries$City == 'Singapore'))
## [1]  43.0  67.5  89.0  97.5 250.0

Amsterdam: LO = 266, FL = 310, M = 341, FU = 424, HI = 593, STEP = 1.5 * (424 - 310) = 171 inner fences: fencelower = FL - STEP = 310 - 171 = 139 fenceupper = FU + STEP = 424 + 171 = 595 outer fences: FENCElower = FL - 2 * STEP = 310 - 2 * 171 = -32 FENCEupper = FU + 2 * STEP = 424 + 2 * 171 = 766 outside values: none

Athens: LO = 106, FL = 117.5, M = 161, FU = 192, HI = 320, STEP = 1.5 * (192 - 117.5) = 111.75 inner fences: fencelower = FL - STEP = 117.5 - 111.75 = 5.75 fenceupper = FU + STEP = 192 + 111.75 = 303.75 outer fences: FENCElower = FL - 2 * STEP = 117.5 - 2 * 111.75 = -106 FENCEupper = FU + 2 * STEP = 192 + 2 * 111.75 = 415.5 outside values: One value that is outside of the inner fences but still within the outside fences is 320 (manager)

Bangkok: LO = 31, FL = 34.5, M = 37, FU = 101.5, HI = 148, STEP = 1.5 * (101.5 - 34.5) = 100.5 inner fences: fencelower = FL - STEP = 34.5 - 100.5 = -66 fenceupper = FU + STEP = 101.5 + 100.5 = 202 outer fences: FENCElower = FL - 2 * STEP = 34.5 - 2 * 100.5 = -166.5 FENCEupper = FU + 2 * STEP = 101.5 + 2 * 100.5 = 302.5 outside values: none

Hong_Kong: LO = 59, FL = 96, M = 116, FU = 159.5, HI = 203, STEP = 1.5 * (159.5 - 96) = 95.25 inner fences: fencelower = FL - STEP = 96 - 95.25 = 0.75 fenceupper = FU + STEP = 159.5 + 95.25 = 254.75 outer fences: FENCElower = FL - 2 * STEP = 96 - 2 * 95.25 = -94.5 FENCEupper = FU + 2 * STEP = 159.5 + 2 * 95.25 = 350 outside values: none

Los_Angeles: LO = 179, FL = 308.5, M = 326, FU = 412, HI = 593, STEP = 1.5 * (412 - 308.5) = 155.25 inner fences: fencelower = FL - STEP = 308.5 - 155.25 = 153.25 fenceupper = FU + STEP = 412 + 155.25 = 567.25 outer fences: FENCElower = FL - 2 * STEP = 308.5 - 2 * 155.25 = -2 FENCEupper = FU + 2 * STEP = 412 + 2 * 155.25 = 722.5 outside values: One value that is outside of the inner upper fence but still within the outer upper fence is 593 (manager).

Singapore: LO = 43, FL = 67.5, M = 89, FU = 97.5, HI = 250, STEP = 1.5 * (97.5 - 67.5) = 45 inner fences: fencelower = FL - STEP = 67.5 - 45 = 22.5 fenceupper = FU + STEP = 97.5 + 45 = 142.5 outer fences: FENCElower = FL - 2 * STEP = 67.5 - 2 * 45 = -22.5 FENCEupper = FU + 2 * STEP = 97.5 + 2 * 45 = 187.5 outside values: One value that is outside of the inner upper fence and also outside the outer upper fence is 250 (manager).

Parallel Boxplots:

ggplot(salaries, aes(x = factor(City), y = Salary)) +
  geom_boxplot() + coord_flip() +
  xlab("City") + ylab("Salary")

spread-vs-level plot

spread_level_plot(salaries, Salary, City)

## # A tibble: 6 × 5
##   City            M    df log.M log.df
##   <fct>       <int> <dbl> <dbl>  <dbl>
## 1 Amsterdam     341 114    2.53   2.06
## 2 Athens        161  74.5  2.21   1.87
## 3 Bangkok        37  67    1.57   1.83
## 4 Hong_Kong     116  63.5  2.06   1.80
## 5 Los_Angeles   326 104.   2.51   2.01
## 6 Singapore      89  30    1.95   1.48

power of transformation: p = 1 − b = 1 - 0.85 = 0.15 which is approximately 0.2. So this method suggests that we should reexpress the density data by taking a 0.2 power. now, let us try this method.

salaries %>%
  mutate(Reexpressed = Salary ^ (0.2)) ->
  salaries
ggplot(salaries, aes(City, Reexpressed)) +
  geom_boxplot() + coord_flip()

reexpressed Five number summaries:

View(salaries)

fivenum(subset(salaries$Reexpressed, salaries$City == 'Amsterdam'))
## [1] 3.054755 3.149345 3.210339 3.349086 3.586005
fivenum(subset(salaries$Reexpressed, salaries$City == 'Athens'))
## [1] 2.541331 2.593378 2.762900 2.861882 3.169786
fivenum(subset(salaries$Reexpressed, salaries$City == 'Bangkok'))
## [1] 1.987341 2.030283 2.058924 2.508322 2.716767
fivenum(subset(salaries$Reexpressed, salaries$City == 'Hong_Kong'))
## [1] 2.260322 2.483523 2.587567 2.756374 2.894005
fivenum(subset(salaries$Reexpressed, salaries$City == 'Los_Angeles'))
## [1] 2.822088 3.146112 3.181585 3.330311 3.586005
fivenum(subset(salaries$Reexpressed, salaries$City == 'Singapore'))
## [1] 2.121747 2.311973 2.454019 2.498562 3.017088

Amsterdam: LO = 3.054755, FL = 3.149345 , M = 3.210339 , FU = 3.349086 , HI = 3.586005, STEP = 1.5 * (FU - FL) = 0.2996115 inner fences: fencelower = FL - STEP = 2.849734 fenceupper = FU + STEP = 3.648697 outer fences: FENCElower = FL - 2 * STEP = 2.550122 FENCEupper = FU + 2 * STEP = 3.948309 outliers: none

Athens: LO = 2.541331, FL = 2.593378 , M = 2.762900 , FU = 2.861882 , HI = 3.169786, STEP = 1.5 * (FU - FL) = 0.402756 inner fences: fencelower = FL - STEP = 2.190622 fenceupper = FU + STEP = 3.264638 outer fences: FENCElower = FL - 2 * STEP = 1.787866 FENCEupper = FU + 2 * STEP = 3.667394 outliers: none

Bangkok: LO = 1.987341 , FL = 2.030283 , M = 2.058924 , FU = 2.508322 , HI = 2.716767, STEP = 1.5 * (FU - FL) = 0.7170585 inner fences: fencelower = FL - STEP = 1.313224 fenceupper = FU + STEP = 3.225381 outer fences: FENCElower = FL - 2 * STEP = 0.596166 FENCEupper = FU + 2 * STEP = 3.942439 outliers: none

Hong_Kong: LO = 2.260322 , FL = 2.483523 , M = 2.587567 , FU = 2.756374 , HI = 2.894005, STEP = 1.5 * (FU - FL) = 0.4092765 inner fences: fencelower = FL - STEP = 2.074246 fenceupper = FU + STEP = 3.165651 outer fences: FENCElower = FL - 2 * STEP = 1.66497 FENCEupper = FU + 2 * STEP = 3.574927 outliers: none

Los_Angeles: LO = 2.822088 , FL = 3.146112 , M = 3.181585 , FU = 3.330311 , HI = 3.586005, STEP = 1.5 * (FU - FL) = 0.2762985 inner fences: fencelower = FL - STEP = 2.869814 fenceupper = FU + STEP = 3.60661 outer fences: FENCElower = FL - 2 * STEP = 2.593515 FENCEupper = FU + 2 * STEP = 3.882908 outliers: 2.822088 (cashier) is outside the inner lower fence but still within the outer lower fence

Singapore: LO = 2.121747 , FL = 2.311973 , M = 2.454019 , FU = 2.498562 , HI = 3.017088, STEP = 1.5 * (FU - FL) = 0.2798835 inner fences: fencelower = FL - STEP = 2.032089 fenceupper = FU + STEP = 2.778446 outer fences: FENCElower = FL - 2 * STEP = 1.752206 FENCEupper = FU + 2 * STEP = 3.058329
outliers: 3.017088 (manager) is outside of both, the inner upper fence and the outer upper fence

spread_level_plot(salaries, Reexpressed, City)

## # A tibble: 6 × 5
##   City            M    df log.M log.df
##   <fct>       <dbl> <dbl> <dbl>  <dbl>
## 1 Amsterdam    3.21 0.200 0.507 -0.700
## 2 Athens       2.76 0.269 0.441 -0.571
## 3 Bangkok      2.06 0.478 0.314 -0.321
## 4 Hong_Kong    2.59 0.273 0.413 -0.564
## 5 Los_Angeles  3.18 0.184 0.503 -0.735
## 6 Singapore    2.45 0.187 0.390 -0.729

After reexpressing the data, this display looks perhaps a little better than our original picture. There are still outliers and the spreads of each batch are still not equal and there seems to be a dependence between level and spread, Here, since this method did not work, we could try different methods of transformations such as the power family, log transformation, or ladders of powers.

Moreover, comparing the spreads (dFs) side by side we see the largest spread in the original dataset is 140/30 = 4.67 times the smallest. Looking at the reexpressed data, the spreads range from 0.1841987 to 0.4780390, so 0.4780390/0.1841987 = 2.595. Also, there still seems to be a dependence between level and spread. We can check out this point by performing a spread vs. level plot for the reexpressed data. We can see that there now is a negative dependence between level and spread. The most obvious differences between the cities is the spread. In the original data set, We can clearly see that for cities that have a high median salary, the spread is also larger. For example, Amsterdam has the highest median and also the largest spread out of all cities. On the other hand, Bangkok which has the smallest median also as a comparatively small spread. Some unusual data values of the reexpressed data set are 2.822088 in Los Angeles (cashier)and 3.017088 (manager) in Singapore.

Dataset 2

Areas of Important Islands by Continent

Datafile: island.areas in LearnEDA package

head(island.areas)
##    Ocean          Name   Area
## 1 Arctic Axel_Heilberg  16671
## 2 Arctic        Baffin 195928
## 3 Arctic         Banks  27038
## 4 Arctic      Bathurst   6194
## 5 Arctic         Devon  21331
## 6 Arctic     Ellesmere  75767
View(island.areas)

fivenum(subset(island.areas$Area, island.areas$Ocean == 'Arctic'))
## [1]   2800  11221  16671  31019 195928
fivenum(subset(island.areas$Area, island.areas$Ocean == 'Caribbean'))
## [1]    59.0   124.0   290.0  2689.5 44218.0
fivenum(subset(island.areas$Area, island.areas$Ocean == 'Indian'))
## [1]    171.0    510.0    844.5  13916.0 226658.0
fivenum(subset(island.areas$Area, island.areas$Ocean == 'Mediterranean'))
## [1]   86.0  385.5 1936.0 3470.5 9822.0
fivenum(subset(island.areas$Area, island.areas$Ocean == 'East_Indies'))
## [1]   2113.0   3707.0  21429.5  69000.0 280100.0

Arctic: LO = 2800 , FL = 11221, M = 16671 , FU = 31019, HI = 195928, STEP = 1.5 * (31019 - 11221) = 29697 inner fences: fencelower = FL - STEP = 11221 - 29697 = -18476 fenceupper = FU + STEP = 31019 + 29697 = 60716 outer fences: FENCElower = FL - 2 * STEP = 11221 - 2 * 29697 = -48173 FENCEupper = FU + 2 * STEP = 31019 + 2 * 29697 = 90413 outside values: 195928 (Baffin) is outside of the upper inner fence as well as the upper outer fence. 83896 (Victoria) and 75767
(Ellesmere) are outside of the inner upper fence but still within the outer upper fence

Caribbean: LO = 59 , FL = 124, M = 290 , FU = 2689.5, HI = 44218, STEP = 1.5 * (2689.5 - 124) = 3848.25 inner fences: fencelower = FL - STEP = 124 - 3848.25 = -3724.25 fenceupper = FU + STEP = 2689.5 + 3848.25 = 6537.75 outer fences: FENCElower = FL - 2 * STEP = 124 - 2 * 3848.25 = -7572.5 FENCEupper = FU + 2 * STEP = 2689.5 + 2 * 3848.25 = 10386 outside values: 44218 (Cuba) and 29530 (Hispaniola,_Haiti,_and_Dominican_Republic) are outside of the upper inner fence as well as the upper outer fence.

Indian: LO = 171 , FL = 510, M = 844.5 , FU = 13916, HI = 226658, STEP = 1.5 * (13916 - 510) = 20109 inner fences: fencelower = FL - STEP = 510 - 20109 = -19608 fenceupper = FU + STEP = 13916 + 20109 = 34025 outer fences: FENCElower = FL - 2 * STEP = 510 - 2 * 20109 = -39708 FENCEupper = FU + 2 * STEP = 13916 + 2 * 20109 = 54134 outside values: 226658 (Madagascar) is outside of the upper inner fence as well as the upper outer fence.

Mediterranean: LO = 86 , FL = 385.5, M = 1936 , FU = 3470.5, HI = 9822, STEP = 1.5 * (3470.5 - 385.5) = 4627.5 inner fences: fencelower = FL - STEP = 385.5 - 4627.5 = -4242 fenceupper = FU + STEP = 3470.5 + 4627.5 = 8098 outer fences: FENCElower = FL - 2 * STEP = 385.5 - 2 * 4627.5 = -8869.5 FENCEupper = FU + 2 * STEP = 3470.5 + 2 * 4627.5 = 12725.5 outside values: 9262 (Sardinia) and 9822 (Sicily) are outside of the upper inner fence but still within the upper outer fence

East_Indies: LO = 2113 , FL = 3707, M = 21429.5 , FU = 69000, HI = 280100, STEP = 1.5 * (69000 - 3707) = 97939.5 inner fences: fencelower = FL - STEP = 3707 - 97939.5 = -94232.5 fenceupper = FU + STEP = 69000 + 97939.5 = 166939.5 outer fences: FENCElower = FL - 2 * STEP = 3707 - 2 * 97939.5 = -192.172 FENCEupper = FU + 2 * STEP = 69000 + 2 * 97939.5 = 264879 outside values: 280100 (Borneo), and 165000 (Sumatra) are outside of the upper inner fence as well as the upper outer fence.

Parallel Boxplots:

ggplot(island.areas, aes(x = factor(Ocean), y = Area)) +
  geom_boxplot() + coord_flip() +
  xlab("Ocean") + ylab("Area")

spread-vs-level plot

spread_level_plot(island.areas, Area, Ocean)

## # A tibble: 5 × 5
##   Ocean              M     df log.M log.df
##   <fct>          <dbl>  <dbl> <dbl>  <dbl>
## 1 Arctic        16671  19798   4.22   4.30
## 2 Caribbean       290   2566.  2.46   3.41
## 3 East_Indies   21430. 58302.  4.33   4.77
## 4 Indian          844.  7633   2.93   3.88
## 5 Mediterranean  1936   3085   3.29   3.49

power of transformation: p = 1 − b = 1 - 1.13 = -0.13 which is approximately -0.1. So this method suggests that we should reexpress the density data by taking a -0.1 power. now, let us try this method.

island.areas %>%
  mutate(Reexpressed = Area ^ (-0.1)) ->
  island.areas
ggplot(island.areas, aes(Ocean, Reexpressed)) +
  geom_boxplot() + coord_flip()

spread_level_plot(island.areas, Reexpressed, Ocean)

## # A tibble: 5 × 5
##   Ocean             M     df  log.M log.df
##   <fct>         <dbl>  <dbl>  <dbl>  <dbl>
## 1 Arctic        0.378 0.0382 -0.422 -1.42 
## 2 Caribbean     0.567 0.161  -0.246 -0.793
## 3 East_Indies   0.371 0.0968 -0.430 -1.01 
## 4 Indian        0.510 0.0974 -0.292 -1.01 
## 5 Mediterranean 0.469 0.114  -0.329 -0.942

reexpressed island Five number summaries:

View(island.areas)

fivenum(subset(island.areas$Reexpressed, island.areas$Ocean == 'Arctic'))
## [1] 0.2956585 0.3558230 0.3782717 0.3940214 0.4521517
fivenum(subset(island.areas$Reexpressed, island.areas$Ocean == 'Caribbean'))
## [1] 0.3431153 0.4564570 0.5672313 0.6176714 0.6651427
fivenum(subset(island.areas$Reexpressed, island.areas$Ocean == 'Indian'))
## [1] 0.2913821 0.4100382 0.5103464 0.5380831 0.5979989
fivenum(subset(island.areas$Reexpressed, island.areas$Ocean == 'Mediterranean'))
## [1] 0.3988228 0.4425692 0.4691476 0.5568152 0.6405458
fivenum(subset(island.areas$Reexpressed, island.areas$Ocean == 'East_Indies'))
## [1] 0.2852783 0.3281823 0.3714344 0.4396405 0.4650611

Arctic: LO = 0.2956585 , FL = 0.3558230 , M = 0.3782717 , FU = 0.3940214 , HI = 0.4521517, STEP = 1.5 * (FU - FL) = 0.0572976 inner fences: fencelower = FL - STEP = 0.2985254 fenceupper = FU + STEP = 0.451319 outer fences: FENCElower = FL - 2 * STEP = 0.2412278 FENCEupper = FU + 2 * STEP = 0.4521517 outside values: 0.2956585 (Baffin) is outside the inner lower fence but still within the outer lower fence 0.4521517 (Wrangle) is outside both, the inner upper fence and the outer upper fence .

Caribbean: LO = 0.3431153 , FL = 0.4564570 , M = 0.5672313 , FU = 0.6176714 , HI = 0.6651427, STEP = 1.5 * (FU - FL) = 0.2418216 inner fences: fencelower = FL - STEP = 0.2146354 fenceupper = FU + STEP = 0.859493 outer fences: FENCElower = FL - 2 * STEP = -0.0271862 FENCEupper = FU + 2 * STEP = 1.101315 outside values: none

Indian: LO = 0.2913821 , FL = 0.4100382 , M = 0.5103464 , FU = 0.5380831 , HI = 0.5979989, STEP = 1.5 * (FU - FL) = 0.1920674 inner fences: fencelower = FL - STEP = 0.2179708 fenceupper = FU + STEP = 0.7301505 outer fences: FENCElower = FL - 2 * STEP = 0.0259035 FENCEupper = FU + 2 * STEP = 0.9222178 outside values: none

Mediterranean: LO = 0.3988228 , FL = 0.4425692 , M = 0.4691476 , FU = 0.5568152 , HI = 0.6405458, STEP = 1.5 * (FU - FL) = 0.171369 inner fences: fencelower = FL - STEP = 0.2712002 fenceupper = FU + STEP = 0.7281842 outer fences: FENCElower = FL - 2 * STEP = 0.0998312 FENCEupper = FU + 2 * STEP = 0.8995532 outside values: none

East_Indies: LO = 0.2852783 , FL = 0.3281823 , M = 0.3714344 , FU = 0.4396405 , HI = 0.4650611, STEP = 1.5 * (FU - FL) = 0.1671873 inner fences: fencelower = FL - STEP = 0.160995 fenceupper = FU + STEP = 0.6068278 outer fences: FENCElower = FL - 2 * STEP = -0.0061923 FENCEupper = FU + 2 * STEP = 0.7740151 outside values: none

After reexpressing the data, at first glance, this display looks perhaps a little better than our original picture. Looking at the boxplots, we can see there are less outliers in the reexpressed data set. Although there is an improvement, the spreads of each batch are still not equal and there still seems to be a dependence between level and spread, Here, since this method did not work, we could try different methods of transformations such as the power family, log transformation, or ladders of powers.

The most obvious differences between the oceans is the spread. In the original data set. Looking at the reexpressed boxplot, we can see that oceans with a lower median density like the Arctic also have a smaller spread. The same is true for the opposite, meaning that oceans that have a higher median density like the Carribean for example also have a larger spread. This implies that there still seems to be a dependence between level and spread. We can check out this point by performing a spread vs. level plot for the reexpressed data. We can see that there now is still a dependence between level and spread. The most obvious differences between the oceans is the spread. Some unusual data values of the reexpressed data set are 0.2956585 in the Arctic ocean (Baffin) and 0.4521517 (Wrangle). Moreover, comparing the spreads (dFs) side by side we see the largest spread in the original dataset is 58302.25/2565.50 = 4.67 times the smallest. Looking at the reexpressed data, the spreads range from 0.03819838 to 0.16121441, so 0.16121441/0.03819838 = 4.220 indicating a small improvement but it still is not ideal.