Assignment 3: Comparing Batches

For each of the two datasets below

Find 5-number summaries, fences, and outside values for each group (CITY).
Construct parallel boxplots.
Using a spread-vs-level plot, determine the power of a transformation that you believe will stabilize spread.
Using the transformation, reanalyze the data by computing new 5-number summaries (and fence and outside values) and parallel boxplots. Was the transformation successful in stabilizing spread? If not, which alternative transformation might you try next?
Discuss the differences you found between the groups and unusual data values (BE SPECIFIC AND MAKE COMPARISONS).

Dataset 1

1979 salaries (in hundreds of Swiss francs) of 7 different professions in 6 cities

Datafile: salaries in LearnEDA package

salaries <- read.delim("C:/Users/ylu_local/Desktop/5470/LearnEDAfunctions-master/LearnEDAfunctions-master/data/salaries.txt")
head(salaries)

##   Salary        City Profession
## 1    341   Amsterdam    Teacher
## 2    110      Athens    Teacher
## 3     31     Bangkok    Teacher
## 4    116   Hong_Kong    Teacher
## 5    326 Los_Angeles    Teacher
## 6     89   Singapore    Teacher

fivenum(salaries$Salary)

## [1]  31.0  89.0 154.5 320.0 593.0

Ams<-subset.data.frame(salaries,City=="Amsterdam")
fivenum(Ams$Salary)

## [1] 266 310 341 424 593

lval(Ams$Salary)

##   depth  lo  hi  mids spreads
## M   4.0 341 341 341.0       0
## H   2.5 310 424 367.0     114
## E   1.0 266 593 429.5     327

lval_plus(Ams,Ams$Salary)

##    Salary      City Profession Fence_LO Fence_HI   OUT
## 1     341 Amsterdam    Teacher      139      595 FALSE
## 7     298 Amsterdam   Chauffer      139      595 FALSE
## 13    266 Amsterdam   Mechanic      139      595 FALSE
## 19    371 Amsterdam       Cook      139      595 FALSE
## 25    593 Amsterdam    Manager      139      595 FALSE
## 31    477 Amsterdam   Engineer      139      595 FALSE
## 37    322 Amsterdam    Cashier      139      595 FALSE

City = Amsterdam

Depth \| Lower \| Upper
N= 7 \|
M 4 \| 341 \|
F 2.5\| 310 \| 424
STEP = 171
FENCES = 139, 595
OUTLIERS: NA

Ath<-subset.data.frame(salaries,City=="Athens")
fivenum(Ath$Salary)

## [1] 106.0 117.5 161.0 192.0 320.0

lval(Ath$Salary)

##   depth    lo  hi   mids spreads
## M   4.0 161.0 161 161.00     0.0
## H   2.5 117.5 192 154.75    74.5
## E   1.0 106.0 320 213.00   214.0

lval_plus(Ath,Ath$Salary)

##    Salary   City Profession Fence_LO Fence_HI   OUT
## 2     110 Athens    Teacher     5.75   303.75 FALSE
## 8     106 Athens   Chauffer     5.75   303.75 FALSE
## 14    125 Athens   Mechanic     5.75   303.75 FALSE
## 20    195 Athens       Cook     5.75   303.75 FALSE
## 26    320 Athens    Manager     5.75   303.75  TRUE
## 32    189 Athens   Engineer     5.75   303.75 FALSE
## 38    161 Athens    Cashier     5.75   303.75 FALSE

City = Athens

Depth \| Lower \| Upper
N= 7 \|
M 4 \| 161 \|
F 2.5\| 117.5 \| 192
STEP = 111.75
FENCES = 5.75, 303.75
OUTLIERS: 320

Ban<-subset.data.frame(salaries,City=="Bangkok")
fivenum(Ban$Salary)

## [1]  31.0  34.5  37.0 101.5 148.0

lval(Ban$Salary)

##   depth   lo    hi mids spreads
## M   4.0 37.0  37.0 37.0       0
## H   2.5 34.5 101.5 68.0      67
## E   1.0 31.0 148.0 89.5     117

lval_plus(Ban,Ban$Salary)

##    Salary    City Profession Fence_LO Fence_HI   OUT
## 3      31 Bangkok    Teacher      -66      202 FALSE
## 9      34 Bangkok   Chauffer      -66      202 FALSE
## 15     35 Bangkok   Mechanic      -66      202 FALSE
## 21    125 Bangkok       Cook      -66      202 FALSE
## 27    148 Bangkok    Manager      -66      202 FALSE
## 33     78 Bangkok   Engineer      -66      202 FALSE
## 39     37 Bangkok    Cashier      -66      202 FALSE

City = Bangkok

Depth \| Lower \| Upper
N= 7 \|
M 4 \| 37 \|
F 2.5\| 34.5 \| 101.5
STEP = 100.5
FENCES = -66, 202
OUTLIERS: NA

Hon<-subset.data.frame(salaries,City=="Hong_Kong")
fivenum(Hon$Salary)

## [1]  59.0  96.0 116.0 159.5 203.0

lval(Hon$Salary)

##   depth  lo    hi   mids spreads
## M   4.0 116 116.0 116.00     0.0
## H   2.5  96 159.5 127.75    63.5
## E   1.0  59 203.0 131.00   144.0

lval_plus(Hon,Hon$Salary)

##    Salary      City Profession Fence_LO Fence_HI   OUT
## 4     116 Hong_Kong    Teacher     0.75   254.75 FALSE
## 10     77 Hong_Kong   Chauffer     0.75   254.75 FALSE
## 16     59 Hong_Kong   Mechanic     0.75   254.75 FALSE
## 22    147 Hong_Kong       Cook     0.75   254.75 FALSE
## 28    203 Hong_Kong    Manager     0.75   254.75 FALSE
## 34    172 Hong_Kong   Engineer     0.75   254.75 FALSE
## 40    115 Hong_Kong    Cashier     0.75   254.75 FALSE

City = Hongkong

Depth \| Lower \| Upper
N= 7 \|
M 4 \| 116 \|
F 2.5\| 96 \| 159.5
STEP = 95.25
FENCES = 0.75, 254.75
OUTLIERS: NA

Los<-subset.data.frame(salaries,City=="Los_Angeles")
fivenum(Los$Salary)

## [1] 179.0 308.5 326.0 412.0 593.0

lval(Los$Salary)

##   depth    lo  hi   mids spreads
## M   4.0 326.0 326 326.00     0.0
## H   2.5 308.5 412 360.25   103.5
## E   1.0 179.0 593 386.00   414.0

lval_plus(Los,Los$Salary)

##    Salary        City Profession Fence_LO Fence_HI   OUT
## 5     326 Los_Angeles    Teacher   153.25   567.25 FALSE
## 11    294 Los_Angeles   Chauffer   153.25   567.25 FALSE
## 17    363 Los_Angeles   Mechanic   153.25   567.25 FALSE
## 23    323 Los_Angeles       Cook   153.25   567.25 FALSE
## 29    593 Los_Angeles    Manager   153.25   567.25  TRUE
## 35    461 Los_Angeles   Engineer   153.25   567.25 FALSE
## 41    179 Los_Angeles    Cashier   153.25   567.25 FALSE

City = Los_Angeles

Depth \| Lower \| Upper
N= 7 \|
M 4 \| 326 \|
F 2.5\| 308.5 \| 412
STEP = 155.25
FENCES = 153.25, 567.25
OUTLIERS: 593

Sin<-subset.data.frame(salaries,City=="Singapore")
fivenum(Sin$Salary)

## [1]  43.0  67.5  89.0  97.5 250.0

lval(Sin$Salary)

##   depth   lo    hi  mids spreads
## M   4.0 89.0  89.0  89.0       0
## H   2.5 67.5  97.5  82.5      30
## E   1.0 43.0 250.0 146.5     207

lval_plus(Sin,Sin$Salary)

##    Salary      City Profession Fence_LO Fence_HI   OUT
## 6      89 Singapore    Teacher     22.5    142.5 FALSE
## 12     43 Singapore   Chauffer     22.5    142.5 FALSE
## 18     52 Singapore   Mechanic     22.5    142.5 FALSE
## 24    103 Singapore       Cook     22.5    142.5 FALSE
## 30    250 Singapore    Manager     22.5    142.5  TRUE
## 36     83 Singapore   Engineer     22.5    142.5 FALSE
## 42     92 Singapore    Cashier     22.5    142.5 FALSE

City = Singapore

Depth \| Lower \| Upper
N= 7 \|
M 4 \| 89 \|
F 2.5\| 67.5 \| 97.5
STEP = 45
FENCES = 22.5, 142.5
OUTLIERS: 250

Sal<-salaries[-43,-3]
unstacked.data <- unstack(Sal)
stacked<-stack(unstacked.data)
boxplot(values ~ ind, data = stacked,
        horizontal = TRUE,
        ylim = c(0, 600),
        xlab = "Salary", ylab = "City")

Here are parallel boxplots of the salaries on the six cities. It is clear to see most of the cities are strongly skewed in distribution. We see differences in the average salaries and 3 outliers in this graph (listed above). Amsterdam and Los Angeles tend to have obviously higher salaries than other four cities. In addition, we also see differences in the spreads of the batches. We see a tendency for the batches of higher salaries to have larger spreads and so we try our spread vs. level plot to suggest a possible reexpression of the data.

spreadLevelPlot(Sal$Salary, Sal$City)

##             LowerHinge Median UpperHinge Hinge-Spread
## Bangkok           34.5     37      101.5         67.0
## Singapore         67.5     89       97.5         30.0
## Hong_Kong         96.0    116      159.5         63.5
## Athens           117.5    161      192.0         74.5
## Los_Angeles      308.5    326      412.0        103.5
## Amsterdam        310.0    341      424.0        114.0
## 
## Suggested power transformation:  0.7161117

Clearly there is a positive association in the graph, indicating that batches with small medians tend also to have small dfs (spreads) except for Bangkok. Also the suggested power transformation is 0.7161117. We can correct the dependence between spread and level by reexpressing the data to a different scale.

reexpressed.data <- data.frame(unstacked.data[,c(1:6)] ^ (0.7161117))
boxplot(reexpressed.data,
        horizontal = TRUE, ylim = c(0, 100),
        xlab = "TRANSFORMED Salary", ylab = "City")

stack.re<-stack(reexpressed.data)
spreadLevelPlot(stack.re$values, stack.re$ind)

##             LowerHinge   Median UpperHinge Hinge-Spread
## Bangkok       12.62528 13.27418   27.19249    14.567211
## Singapore     20.30584 24.88748   26.55883     6.252989
## Hong_Kong     26.16845 30.08725   37.77036    11.601904
## Athens        30.35277 38.04817   43.16065    12.807882
## Los_Angeles   60.60203 63.05863   74.46144    13.859413
## Amsterdam     60.81729 65.12310   75.99665    15.179359
## 
## Suggested power transformation:  0.9329465

After transformation, actually this display looks much better than our original picture. However, there are still right skewness in four batches and we can’t help but notice high outliers in Athens disappears.

In addition, there is one improvement – the spreads of the right skewness of each batch are roughly equal and there isn’t much difference in the two higher batches in spreads than before, which means we may remove the dependence between level and spread.

We checked out this point by performing a spread vs. level plot for the reexpressed data, which shows much more stable between spread and level. Also, the suggested power transformation is 0.9329465(close to 1), which is a considerable balance between level and spread. If a more accurate result is needed, we can choose rescale the reexpressed data with a suggested power 0.9329465 in the next step.

I see a slight improvement using this reexpression. The spreads of the raw data range from 30 to 114 – the largest spread is 114/30 = 3.8 times the smallest. Looking at the reexpressed data, the spreads range from 6.252989 to 15.179359 – the ratio is around 2.43. Actually, this is not a big improvement – it probably doesn’t make much sense in this case to reexpress the times.

City = Bangkok after transformation

Depth \| Lower \| Upper
M 5.5 \| 13.27418 \|
F 3.0 \| 12.62528 \| 27.19249
STEP = 21.8508
FENCES = -9.22552, 49.04329
OUTLIER: NA

City = Singapore after transformation

Depth \| Lower \| Upper
M 5.5 \| 24.88748 \|
F 3.0 \| 20.30584 \| 26.55883
STEP = 9.37935
FENCES = 10.92649, 35.93818
OUTLIERS: 52.14263

City = Hong_Kong after transformation

Depth \| Lower \| Upper
M 5.5 \| 30.08725 \|
F 3.0 \| 26.16845 \| 37.77036
STEP = 17.40285
FENCES = 8.7656, 55.17321
OUTLIERS: NA

City = Athens after transformation

Depth \| Lower \| Upper
M 5.5 \| 38.04817 \|
F 3.0 \| 30.35277 \| 43.16065
STEP = 19.2117
FENCES = 11.14107, 62.37235
OUTLIERS: NA

City = Los_Angeles after transformation

Depth \| Lower \| Upper
M 5.5 \| 63.05863 \|
F 3.0 \| 60.60203 \| 74.46144
STEP = 20.7891
FENCES = 39.81293, 95.25054
OUTLIERS: 96.78697

City = Amsterdam after transformation

Depth \| Lower \| Upper
M 5.5 \| 65.12310 \|
F 3.0 \| 60.81729 \| 75.99665
STEP = 22.76895
FENCES = 38.04834, 98.7656
OUTLIERS: NA
## Dataset 2

Areas of Important Islands by Continent

Datafile: island.areas in LearnEDA package

Compare the island areas of the Artic Ocean, Caribbean Sea, Indian Ocean, Mediterranean Sea and East Indies

island.areas <- read.delim("C:/Users/ylu_local/Desktop/5470/LearnEDAfunctions-master/LearnEDAfunctions-master/data/island.areas.txt")
head(island.areas)

##    Ocean          Name   Area
## 1 Arctic Axel_Heilberg  16671
## 2 Arctic        Baffin 195928
## 3 Arctic         Banks  27038
## 4 Arctic      Bathurst   6194
## 5 Arctic         Devon  21331
## 6 Arctic     Ellesmere  75767

summary(island.areas)

##     Ocean               Name                Area       
##  Length:59          Length:59          Min.   :    59  
##  Class :character   Class :character   1st Qu.:   591  
##  Mode  :character   Mode  :character   Median :  3572  
##                                        Mean   : 26324  
##                                        3rd Qu.: 22636  
##                                        Max.   :280100

fivenum(island.areas$Area)

## [1]     59.0    591.0   3572.0  22635.5 280100.0

island<-island.areas[,c(3,1)]
Arc<-subset.data.frame(island,Ocean=="Arctic")
fivenum(Arc$Area)

## [1]   2800  11221  16671  31019 195928

lval(Arc$Area)

##   depth    lo       hi     mids  spreads
## M   8.0 16671  16671.0 16671.00      0.0
## H   4.5 11221  31019.0 21120.00  19798.0
## E   2.5  7097  79831.5 43464.25  72734.5
## D   1.0  2800 195928.0 99364.00 193128.0

lval_plus(Arc,Arc$Area)

##      Area  Ocean Fence_LO Fence_HI   OUT
## 1   16671 Arctic   -18476    60716 FALSE
## 2  195928 Arctic   -18476    60716  TRUE
## 3   27038 Arctic   -18476    60716 FALSE
## 4    6194 Arctic   -18476    60716 FALSE
## 5   21331 Arctic   -18476    60716 FALSE
## 6   75767 Arctic   -18476    60716  TRUE
## 7   16274 Arctic   -18476    60716 FALSE
## 8   12872 Arctic   -18476    60716 FALSE
## 9    9570 Arctic   -18476    60716 FALSE
## 10  15913 Arctic   -18476    60716 FALSE
## 11  83896 Arctic   -18476    60716  TRUE
## 12   8000 Arctic   -18476    60716 FALSE
## 13  35000 Arctic   -18476    60716 FALSE
## 14   2800 Arctic   -18476    60716 FALSE
## 15  23940 Arctic   -18476    60716 FALSE

Ocean = Arctic

Depth \| Lower \| Upper
N= 15 \|
M 8 \| 16671 \|
F 4.5\| 11221 \| 31019
STEP = 29697
FENCES = -18476, 60716
OUTLIERS: 195928,75767,83896

Car<-subset.data.frame(island,Ocean=="Caribbean")
fivenum(Car$Area)

## [1]    59.0   124.0   290.0  2689.5 44218.0

lval(Car$Area)

##   depth    lo      hi     mids spreads
## M   8.0 290.0   290.0   290.00     0.0
## H   4.5 124.0  2689.5  1406.75  2565.5
## E   2.5  91.5 16887.0  8489.25 16795.5
## D   1.0  59.0 44218.0 22138.50 44159.0

lval_plus(Car,Car$Area)

##     Area     Ocean Fence_LO Fence_HI   OUT
## 16   108 Caribbean -3724.25  6537.75 FALSE
## 17    75 Caribbean -3724.25  6537.75 FALSE
## 18   166 Caribbean -3724.25  6537.75 FALSE
## 19 44218 Caribbean -3724.25  6537.75  TRUE
## 20   171 Caribbean -3724.25  6537.75 FALSE
## 21   290 Caribbean -3724.25  6537.75 FALSE
## 22   687 Caribbean -3724.25  6537.75 FALSE
## 23 29530 Caribbean -3724.25  6537.75  TRUE
## 24  4244 Caribbean -3724.25  6537.75 FALSE
## 25   425 Caribbean -3724.25  6537.75 FALSE
## 26  3515 Caribbean -3724.25  6537.75 FALSE
## 27   116 Caribbean -3724.25  6537.75 FALSE
## 28  1864 Caribbean -3724.25  6537.75 FALSE
## 29    59 Caribbean -3724.25  6537.75 FALSE
## 30   132 Caribbean -3724.25  6537.75 FALSE

Ocean = Caribbean

Depth \| Lower \| Upper
N= 15 \|
M 8 \| 290 \|
F 4.5\| 124 \| 2689.5
STEP = 3848.25
FENCES = -3724.25, 6537.75
OUTLIERS: 44218,29530

Ind<-subset.data.frame(island,Ocean=="Indian")
fivenum(Ind$Area)

## [1]    171.0    510.0    844.5  13916.0 226658.0

lval(Ind$Area)

##   depth    lo       hi     mids spreads
## M   4.5 844.5    844.5    844.5       0
## H   2.5 510.0  13916.0   7213.0   13406
## E   1.0 171.0 226658.0 113414.5  226487

lval_plus(Ind,Ind$Area)

##      Area  Ocean Fence_LO Fence_HI   OUT
## 31   2500 Indian -10874.5  19657.5 FALSE
## 32 226658 Indian -10874.5  19657.5  TRUE
## 33    720 Indian -10874.5  19657.5 FALSE
## 34    380 Indian -10874.5  19657.5 FALSE
## 35    969 Indian -10874.5  19657.5 FALSE
## 36    171 Indian -10874.5  19657.5 FALSE
## 37  25332 Indian -10874.5  19657.5  TRUE
## 38    640 Indian -10874.5  19657.5 FALSE

Ocean = Indian

Depth \| Lower \| Upper
N= 8 \|
M 4.5\| 844.5 \|
F 2.5\| 510.0 \| 13916.0
STEP = 20109
FENCES = -19599, 34025
OUTLIERS: 226658

Med<-subset.data.frame(island,Ocean=="Mediterranean")
fivenum(Med$Area)

## [1]   86.0  385.5 1936.0 3470.5 9822.0

lval(Med$Area)

##   depth     lo     hi mids spreads
## M   6.0 1936.0 1936.0 1936       0
## H   3.5  385.5 3470.5 1928    3085
## E   2.0  122.0 9262.0 4692    9140
## D   1.0   86.0 9822.0 4954    9736

lval_plus(Med,Med$Area)

##    Area         Ocean Fence_LO Fence_HI   OUT
## 39 1936 Mediterranean    -4242     8098 FALSE
## 40  229 Mediterranean    -4242     8098 FALSE
## 41 3369 Mediterranean    -4242     8098 FALSE
## 42 3186 Mediterranean    -4242     8098 FALSE
## 43 3572 Mediterranean    -4242     8098 FALSE
## 44   86 Mediterranean    -4242     8098 FALSE
## 45 1409 Mediterranean    -4242     8098 FALSE
## 46  122 Mediterranean    -4242     8098 FALSE
## 47  542 Mediterranean    -4242     8098 FALSE
## 48 9262 Mediterranean    -4242     8098  TRUE
## 49 9822 Mediterranean    -4242     8098  TRUE

Ocean = Mediterranean

Depth \| Lower \| Upper
N= 11 \|
M 6 \| 1936 \|
F 3.5\| 385.5\| 3470.5
STEP = 4627.5
FENCES = -4242, 8098
OUTLIERS: 9262,9822

East<-subset.data.frame(island,Ocean=="East_Indies")
fivenum(East$Area)

## [1]   2113.0   3707.0  21429.5  69000.0 280100.0

lval(East$Area)

##   depth      lo       hi     mids spreads
## M   5.5 21429.5  21429.5  21429.5       0
## H   3.0  3707.0  69000.0  36353.5   65293
## E   2.0  2147.0 165000.0  83573.5  162853
## D   1.0  2113.0 280100.0 141106.5  277987

lval_plus(East,East$Area)

##      Area       Ocean  Fence_LO Fence_HI   OUT
## 50   2147 East_Indies -81780.62 151428.4 FALSE
## 51 280100 East_Indies -81780.62 151428.4  TRUE
## 52  69000 East_Indies -81780.62 151428.4 FALSE
## 53  48900 East_Indies -81780.62 151428.4 FALSE
## 54   2113 East_Indies -81780.62 151428.4 FALSE
## 55  28766 East_Indies -81780.62 151428.4 FALSE
## 56  14093 East_Indies -81780.62 151428.4 FALSE
## 57   3707 East_Indies -81780.62 151428.4 FALSE
## 58 165000 East_Indies -81780.62 151428.4  TRUE
## 59  11570 East_Indies -81780.62 151428.4 FALSE

Ocean = East_Indies

Depth \| Lower \| Upper
N= 15 \|
M 5.5 \| 21429.5 \|
F 3.0 \| 3707.0 \| 69000.0
STEP = 97939.5
FENCES = -94232, 166939.5
OUTLIERS: 280100

boxplot(Area ~ Ocean, data = island,
        horizontal = TRUE,
        xlab = "Area", ylab = "Ocean")

spreadLevelPlot(island$Area, island$Ocean)

##               LowerHinge  Median UpperHinge Hinge-Spread
## Caribbean          124.0   290.0     2689.5       2565.5
## Indian             510.0   844.5    13916.0      13406.0
## Mediterranean      385.5  1936.0     3470.5       3085.0
## Arctic           11221.0 16671.0    31019.0      19798.0
## East_Indies       3707.0 21429.5    69000.0      65293.0
## 
## Suggested power transformation:  0.4155263

Here are parallel boxplots of the areas on the five oceans.

Each batch is strongly right-skewed and there are 9 outliers at the high end.
The batches have different spreads. If we look at the length of the boxes (the fourth spread), we see that the East-Indies data is more spread out than the Arctic and Indian data which are more spread out than the Mediterranean and Caribbean data.
Looking further, we see differences in the average areas and a tendency for the batches of larger areas to have larger spreads. In other words, we can see a dependence between spread and level in the boxplot display above. It is difficult to compare these batches since they have unequal spreads. Let’s do transformation next.

reArea<-island$Area ^ (0.4155263 )
ocean<-island$Ocean
redata<-cbind.data.frame(reArea,ocean)
boxplot(reArea ~ ocean,data = redata,
        horizontal = TRUE, 
        xlab = "TRANSFORMED Area", ylab = "Ocean")

spreadLevelPlot(redata$reArea, redata$ocean)

##               LowerHinge   Median UpperHinge Hinge-Spread
## Caribbean       7.407163 10.54851   26.29973     18.89257
## Indian         13.229608 16.40298   46.70106     33.47145
## Mediterranean  11.620684 23.21644   29.58563     17.96495
## Arctic         48.055475 56.79840   73.36941     25.31393
## East_Indies    30.410443 62.10913  102.48707     72.07662
## 
## Suggested power transformation:  0.5582082

After the first transformation, although this display doesn’t look much better than our original one, we notice that 5 high outliers disappears(2 in Mediterranean, 2 in Arctic and 1 in East_Indies).It does helps the spreads of the skewness of each batch more equally than before. Actually, we still see a strongly positive relationship between level and spread, which indicates that a further reexpression may be needed to remove or, at least, decrease the relationship between level and spread. The suggested power transformation is 0.5582082.

re_Area<-redata$reArea^(0.5582082)
re_data<-cbind.data.frame(re_Area,ocean)
boxplot(re_Area ~ ocean,data = re_data,
        horizontal = TRUE, 
        xlab = "TRANSFORMED Area", ylab = "Ocean")

spreadLevelPlot(re_data$re_Area, re_data$ocean)

##               LowerHinge   Median UpperHinge Hinge-Spread
## Caribbean       3.057794 3.725235   6.190182     3.132388
## Indian          4.221151 4.764049   8.322824     4.101674
## Mediterranean   3.916718 5.786265   6.624628     2.707910
## Arctic          8.680801 9.534212  10.994916     2.314115
## East_Indies     6.727217 9.995057  13.254763     6.527545
## 
## Suggested power transformation:  0.906075

After the second transformation, actually this display looks really better than our original picture. There are still three right skewness in five batches and we can’t help but notice there are only 3 high outliers (original 9 outliers), one more disappears at this transformation.

In addition, there is another improvement –the spreads of the batches are more similar. The spreads of the skewness of each batch are roughly equal this time which means we have removed the dependence between level and spread.

I see an exciting improvement using these reexpressions. The spreads of the raw data range from 2565.5 to 65293 – the largest spread is 65293/2565.5 = 25.45 times the smallest. Looking at the last reexpressed data, the spreads range from 2.314115 to 6.527545 – the ratio is around 2.8. Actually, this is a really big improvement – it does make much sense in this case to reexpress the times.

Ocean = East_Indies after transformation

Depth \| Lower \| Upper
M 5.5 \| 9.995057 \|
F 3.0 \| 6.727217 \| 13.254763
STEP = 9.791317
FENCES = -3.0641, 23.04608
OUTLIERS: NA

Ocean = Arctic after transformation

Depth \| Lower \| Upper
M 5.5 \| 9.534212 \|
F 3.0 \| 8.680801 \| 10.994916
STEP = 3.471173
FENCES = 5.209628, 14.46609
OUTLIERS: 16.885067

Ocean = Indian after transformation

Depth \| Lower \| Upper
M 5.5 \| 4.764049 \|
F 3.0 \| 4.221151 \| 8.322824
STEP = 6.152511
FENCES = -1.93136, 14.47534
OUTLIERS: 17.465432

Ocean = Caribbean after transformation

Depth \| Lower \| Upper
M 5.5 \| 3.725235 \|
F 3.0 \| 3.057794 \| 6.190182
STEP = 4.698582
FENCES = -1.640788, 10.88876
OUTLIERS: 11.954945

Ocean = Mediterranean after transformation

Depth \| Lower \| Upper
M 5.5 \| 5.786265 \|
F 3.0 \| 3.916718 \| 6.624628
STEP = 4.061865
FENCES = -0.145147, 10.68649
OUTLIERS: NA