For each of the two datasets below
- Find 5-number summaries, fences, and outside values for each group (CITY).
- Construct parallel boxplots.
- Using a spread-vs-level plot, determine the power of a transformation that you believe will stabilize spread.
- Using the transformation, reanalyze the data by computing new 5-number summaries (and fence and outside values) and parallel boxplots. Was the transformation successful in stabilizing spread? If not, which alternative transformation might you try next?
- Discuss the differences you found between the groups and unusual data values (BE SPECIFIC AND MAKE COMPARISONS).
Dataset 1
1979 salaries (in hundreds of Swiss francs) of 7 different professions in 6 cities
Datafile: salaries in LearnEDA package
salaries <- read.delim("C:/Users/ylu_local/Desktop/5470/LearnEDAfunctions-master/LearnEDAfunctions-master/data/salaries.txt")
head(salaries)
## Salary City Profession
## 1 341 Amsterdam Teacher
## 2 110 Athens Teacher
## 3 31 Bangkok Teacher
## 4 116 Hong_Kong Teacher
## 5 326 Los_Angeles Teacher
## 6 89 Singapore Teacher
fivenum(salaries$Salary)
## [1] 31.0 89.0 154.5 320.0 593.0
Ams<-subset.data.frame(salaries,City=="Amsterdam")
fivenum(Ams$Salary)
## [1] 266 310 341 424 593
lval(Ams$Salary)
## depth lo hi mids spreads
## M 4.0 341 341 341.0 0
## H 2.5 310 424 367.0 114
## E 1.0 266 593 429.5 327
lval_plus(Ams,Ams$Salary)
## Salary City Profession Fence_LO Fence_HI OUT
## 1 341 Amsterdam Teacher 139 595 FALSE
## 7 298 Amsterdam Chauffer 139 595 FALSE
## 13 266 Amsterdam Mechanic 139 595 FALSE
## 19 371 Amsterdam Cook 139 595 FALSE
## 25 593 Amsterdam Manager 139 595 FALSE
## 31 477 Amsterdam Engineer 139 595 FALSE
## 37 322 Amsterdam Cashier 139 595 FALSE
City = Amsterdam
| N= 7 | |
| M 4 | 341 | |
| F 2.5| 310 | 424 |
| STEP = 171 |
| FENCES = 139, 595 |
| OUTLIERS: NA |
Ath<-subset.data.frame(salaries,City=="Athens")
fivenum(Ath$Salary)
## [1] 106.0 117.5 161.0 192.0 320.0
lval(Ath$Salary)
## depth lo hi mids spreads
## M 4.0 161.0 161 161.00 0.0
## H 2.5 117.5 192 154.75 74.5
## E 1.0 106.0 320 213.00 214.0
lval_plus(Ath,Ath$Salary)
## Salary City Profession Fence_LO Fence_HI OUT
## 2 110 Athens Teacher 5.75 303.75 FALSE
## 8 106 Athens Chauffer 5.75 303.75 FALSE
## 14 125 Athens Mechanic 5.75 303.75 FALSE
## 20 195 Athens Cook 5.75 303.75 FALSE
## 26 320 Athens Manager 5.75 303.75 TRUE
## 32 189 Athens Engineer 5.75 303.75 FALSE
## 38 161 Athens Cashier 5.75 303.75 FALSE
City = Athens
| N= 7 | |
| M 4 | 161 | |
| F 2.5| 117.5 | 192 |
| STEP = 111.75 |
| FENCES = 5.75, 303.75 |
| OUTLIERS: 320 |
Ban<-subset.data.frame(salaries,City=="Bangkok")
fivenum(Ban$Salary)
## [1] 31.0 34.5 37.0 101.5 148.0
lval(Ban$Salary)
## depth lo hi mids spreads
## M 4.0 37.0 37.0 37.0 0
## H 2.5 34.5 101.5 68.0 67
## E 1.0 31.0 148.0 89.5 117
lval_plus(Ban,Ban$Salary)
## Salary City Profession Fence_LO Fence_HI OUT
## 3 31 Bangkok Teacher -66 202 FALSE
## 9 34 Bangkok Chauffer -66 202 FALSE
## 15 35 Bangkok Mechanic -66 202 FALSE
## 21 125 Bangkok Cook -66 202 FALSE
## 27 148 Bangkok Manager -66 202 FALSE
## 33 78 Bangkok Engineer -66 202 FALSE
## 39 37 Bangkok Cashier -66 202 FALSE
City = Bangkok
| N= 7 | |
| M 4 | 37 | |
| F 2.5| 34.5 | 101.5 |
| STEP = 100.5 |
| FENCES = -66, 202 |
| OUTLIERS: NA |
Hon<-subset.data.frame(salaries,City=="Hong_Kong")
fivenum(Hon$Salary)
## [1] 59.0 96.0 116.0 159.5 203.0
lval(Hon$Salary)
## depth lo hi mids spreads
## M 4.0 116 116.0 116.00 0.0
## H 2.5 96 159.5 127.75 63.5
## E 1.0 59 203.0 131.00 144.0
lval_plus(Hon,Hon$Salary)
## Salary City Profession Fence_LO Fence_HI OUT
## 4 116 Hong_Kong Teacher 0.75 254.75 FALSE
## 10 77 Hong_Kong Chauffer 0.75 254.75 FALSE
## 16 59 Hong_Kong Mechanic 0.75 254.75 FALSE
## 22 147 Hong_Kong Cook 0.75 254.75 FALSE
## 28 203 Hong_Kong Manager 0.75 254.75 FALSE
## 34 172 Hong_Kong Engineer 0.75 254.75 FALSE
## 40 115 Hong_Kong Cashier 0.75 254.75 FALSE
City = Hongkong
| N= 7 | |
| M 4 | 116 | |
| F 2.5| 96 | 159.5 |
| STEP = 95.25 |
| FENCES = 0.75, 254.75 |
| OUTLIERS: NA |
Los<-subset.data.frame(salaries,City=="Los_Angeles")
fivenum(Los$Salary)
## [1] 179.0 308.5 326.0 412.0 593.0
lval(Los$Salary)
## depth lo hi mids spreads
## M 4.0 326.0 326 326.00 0.0
## H 2.5 308.5 412 360.25 103.5
## E 1.0 179.0 593 386.00 414.0
lval_plus(Los,Los$Salary)
## Salary City Profession Fence_LO Fence_HI OUT
## 5 326 Los_Angeles Teacher 153.25 567.25 FALSE
## 11 294 Los_Angeles Chauffer 153.25 567.25 FALSE
## 17 363 Los_Angeles Mechanic 153.25 567.25 FALSE
## 23 323 Los_Angeles Cook 153.25 567.25 FALSE
## 29 593 Los_Angeles Manager 153.25 567.25 TRUE
## 35 461 Los_Angeles Engineer 153.25 567.25 FALSE
## 41 179 Los_Angeles Cashier 153.25 567.25 FALSE
City = Los_Angeles
| N= 7 | |
| M 4 | 326 | |
| F 2.5| 308.5 | 412 |
| STEP = 155.25 |
| FENCES = 153.25, 567.25 |
| OUTLIERS: 593 |
Sin<-subset.data.frame(salaries,City=="Singapore")
fivenum(Sin$Salary)
## [1] 43.0 67.5 89.0 97.5 250.0
lval(Sin$Salary)
## depth lo hi mids spreads
## M 4.0 89.0 89.0 89.0 0
## H 2.5 67.5 97.5 82.5 30
## E 1.0 43.0 250.0 146.5 207
lval_plus(Sin,Sin$Salary)
## Salary City Profession Fence_LO Fence_HI OUT
## 6 89 Singapore Teacher 22.5 142.5 FALSE
## 12 43 Singapore Chauffer 22.5 142.5 FALSE
## 18 52 Singapore Mechanic 22.5 142.5 FALSE
## 24 103 Singapore Cook 22.5 142.5 FALSE
## 30 250 Singapore Manager 22.5 142.5 TRUE
## 36 83 Singapore Engineer 22.5 142.5 FALSE
## 42 92 Singapore Cashier 22.5 142.5 FALSE
City = Singapore
| N= 7 | |
| M 4 | 89 | |
| F 2.5| 67.5 | 97.5 |
| STEP = 45 |
| FENCES = 22.5, 142.5 |
| OUTLIERS: 250 |
Sal<-salaries[-43,-3]
unstacked.data <- unstack(Sal)
stacked<-stack(unstacked.data)
boxplot(values ~ ind, data = stacked,
horizontal = TRUE,
ylim = c(0, 600),
xlab = "Salary", ylab = "City")

Here are parallel boxplots of the salaries on the six cities. It is clear to see most of the cities are strongly skewed in distribution. We see differences in the average salaries and 3 outliers in this graph (listed above). Amsterdam and Los Angeles tend to have obviously higher salaries than other four cities. In addition, we also see differences in the spreads of the batches. We see a tendency for the batches of higher salaries to have larger spreads and so we try our spread vs. level plot to suggest a possible reexpression of the data.
spreadLevelPlot(Sal$Salary, Sal$City)

## LowerHinge Median UpperHinge Hinge-Spread
## Bangkok 34.5 37 101.5 67.0
## Singapore 67.5 89 97.5 30.0
## Hong_Kong 96.0 116 159.5 63.5
## Athens 117.5 161 192.0 74.5
## Los_Angeles 308.5 326 412.0 103.5
## Amsterdam 310.0 341 424.0 114.0
##
## Suggested power transformation: 0.7161117
Clearly there is a positive association in the graph, indicating that batches with small medians tend also to have small dfs (spreads) except for Bangkok. Also the suggested power transformation is 0.7161117. We can correct the dependence between spread and level by reexpressing the data to a different scale.
reexpressed.data <- data.frame(unstacked.data[,c(1:6)] ^ (0.7161117))
boxplot(reexpressed.data,
horizontal = TRUE, ylim = c(0, 100),
xlab = "TRANSFORMED Salary", ylab = "City")

stack.re<-stack(reexpressed.data)
spreadLevelPlot(stack.re$values, stack.re$ind)

## LowerHinge Median UpperHinge Hinge-Spread
## Bangkok 12.62528 13.27418 27.19249 14.567211
## Singapore 20.30584 24.88748 26.55883 6.252989
## Hong_Kong 26.16845 30.08725 37.77036 11.601904
## Athens 30.35277 38.04817 43.16065 12.807882
## Los_Angeles 60.60203 63.05863 74.46144 13.859413
## Amsterdam 60.81729 65.12310 75.99665 15.179359
##
## Suggested power transformation: 0.9329465
After transformation, actually this display looks much better than our original picture. However, there are still right skewness in four batches and we can’t help but notice high outliers in Athens disappears.
In addition, there is one improvement – the spreads of the right skewness of each batch are roughly equal and there isn’t much difference in the two higher batches in spreads than before, which means we may remove the dependence between level and spread.
We checked out this point by performing a spread vs. level plot for the reexpressed data, which shows much more stable between spread and level. Also, the suggested power transformation is 0.9329465(close to 1), which is a considerable balance between level and spread. If a more accurate result is needed, we can choose rescale the reexpressed data with a suggested power 0.9329465 in the next step.
I see a slight improvement using this reexpression. The spreads of the raw data range from 30 to 114 – the largest spread is 114/30 = 3.8 times the smallest. Looking at the reexpressed data, the spreads range from 6.252989 to 15.179359 – the ratio is around 2.43. Actually, this is not a big improvement – it probably doesn’t make much sense in this case to reexpress the times.
Ocean = Arctic
| N= 15 | |
| M 8 | 16671 | |
| F 4.5| 11221 | 31019 |
| STEP = 29697 |
| FENCES = -18476, 60716 |
| OUTLIERS: 195928,75767,83896 |
Car<-subset.data.frame(island,Ocean=="Caribbean")
fivenum(Car$Area)
## [1] 59.0 124.0 290.0 2689.5 44218.0
lval(Car$Area)
## depth lo hi mids spreads
## M 8.0 290.0 290.0 290.00 0.0
## H 4.5 124.0 2689.5 1406.75 2565.5
## E 2.5 91.5 16887.0 8489.25 16795.5
## D 1.0 59.0 44218.0 22138.50 44159.0
lval_plus(Car,Car$Area)
## Area Ocean Fence_LO Fence_HI OUT
## 16 108 Caribbean -3724.25 6537.75 FALSE
## 17 75 Caribbean -3724.25 6537.75 FALSE
## 18 166 Caribbean -3724.25 6537.75 FALSE
## 19 44218 Caribbean -3724.25 6537.75 TRUE
## 20 171 Caribbean -3724.25 6537.75 FALSE
## 21 290 Caribbean -3724.25 6537.75 FALSE
## 22 687 Caribbean -3724.25 6537.75 FALSE
## 23 29530 Caribbean -3724.25 6537.75 TRUE
## 24 4244 Caribbean -3724.25 6537.75 FALSE
## 25 425 Caribbean -3724.25 6537.75 FALSE
## 26 3515 Caribbean -3724.25 6537.75 FALSE
## 27 116 Caribbean -3724.25 6537.75 FALSE
## 28 1864 Caribbean -3724.25 6537.75 FALSE
## 29 59 Caribbean -3724.25 6537.75 FALSE
## 30 132 Caribbean -3724.25 6537.75 FALSE
Ocean = Caribbean
| N= 15 | |
| M 8 | 290 | |
| F 4.5| 124 | 2689.5 |
| STEP = 3848.25 |
| FENCES = -3724.25, 6537.75 |
| OUTLIERS: 44218,29530 |
Ind<-subset.data.frame(island,Ocean=="Indian")
fivenum(Ind$Area)
## [1] 171.0 510.0 844.5 13916.0 226658.0
lval(Ind$Area)
## depth lo hi mids spreads
## M 4.5 844.5 844.5 844.5 0
## H 2.5 510.0 13916.0 7213.0 13406
## E 1.0 171.0 226658.0 113414.5 226487
lval_plus(Ind,Ind$Area)
## Area Ocean Fence_LO Fence_HI OUT
## 31 2500 Indian -10874.5 19657.5 FALSE
## 32 226658 Indian -10874.5 19657.5 TRUE
## 33 720 Indian -10874.5 19657.5 FALSE
## 34 380 Indian -10874.5 19657.5 FALSE
## 35 969 Indian -10874.5 19657.5 FALSE
## 36 171 Indian -10874.5 19657.5 FALSE
## 37 25332 Indian -10874.5 19657.5 TRUE
## 38 640 Indian -10874.5 19657.5 FALSE
Ocean = Indian
| N= 8 | |
| M 4.5| 844.5 | |
| F 2.5| 510.0 | 13916.0 |
| STEP = 20109 |
| FENCES = -19599, 34025 |
| OUTLIERS: 226658 |
Med<-subset.data.frame(island,Ocean=="Mediterranean")
fivenum(Med$Area)
## [1] 86.0 385.5 1936.0 3470.5 9822.0
lval(Med$Area)
## depth lo hi mids spreads
## M 6.0 1936.0 1936.0 1936 0
## H 3.5 385.5 3470.5 1928 3085
## E 2.0 122.0 9262.0 4692 9140
## D 1.0 86.0 9822.0 4954 9736
lval_plus(Med,Med$Area)
## Area Ocean Fence_LO Fence_HI OUT
## 39 1936 Mediterranean -4242 8098 FALSE
## 40 229 Mediterranean -4242 8098 FALSE
## 41 3369 Mediterranean -4242 8098 FALSE
## 42 3186 Mediterranean -4242 8098 FALSE
## 43 3572 Mediterranean -4242 8098 FALSE
## 44 86 Mediterranean -4242 8098 FALSE
## 45 1409 Mediterranean -4242 8098 FALSE
## 46 122 Mediterranean -4242 8098 FALSE
## 47 542 Mediterranean -4242 8098 FALSE
## 48 9262 Mediterranean -4242 8098 TRUE
## 49 9822 Mediterranean -4242 8098 TRUE
Ocean = Mediterranean
| N= 11 | |
| M 6 | 1936 | |
| F 3.5| 385.5| 3470.5 |
| STEP = 4627.5 |
| FENCES = -4242, 8098 |
| OUTLIERS: 9262,9822 |
East<-subset.data.frame(island,Ocean=="East_Indies")
fivenum(East$Area)
## [1] 2113.0 3707.0 21429.5 69000.0 280100.0
lval(East$Area)
## depth lo hi mids spreads
## M 5.5 21429.5 21429.5 21429.5 0
## H 3.0 3707.0 69000.0 36353.5 65293
## E 2.0 2147.0 165000.0 83573.5 162853
## D 1.0 2113.0 280100.0 141106.5 277987
lval_plus(East,East$Area)
## Area Ocean Fence_LO Fence_HI OUT
## 50 2147 East_Indies -81780.62 151428.4 FALSE
## 51 280100 East_Indies -81780.62 151428.4 TRUE
## 52 69000 East_Indies -81780.62 151428.4 FALSE
## 53 48900 East_Indies -81780.62 151428.4 FALSE
## 54 2113 East_Indies -81780.62 151428.4 FALSE
## 55 28766 East_Indies -81780.62 151428.4 FALSE
## 56 14093 East_Indies -81780.62 151428.4 FALSE
## 57 3707 East_Indies -81780.62 151428.4 FALSE
## 58 165000 East_Indies -81780.62 151428.4 TRUE
## 59 11570 East_Indies -81780.62 151428.4 FALSE
Ocean = East_Indies
| N= 15 | |
| M 5.5 | 21429.5 | |
| F 3.0 | 3707.0 | 69000.0 |
| STEP = 97939.5 |
| FENCES = -94232, 166939.5 |
| OUTLIERS: 280100 |
boxplot(Area ~ Ocean, data = island,
horizontal = TRUE,
xlab = "Area", ylab = "Ocean")

spreadLevelPlot(island$Area, island$Ocean)

## LowerHinge Median UpperHinge Hinge-Spread
## Caribbean 124.0 290.0 2689.5 2565.5
## Indian 510.0 844.5 13916.0 13406.0
## Mediterranean 385.5 1936.0 3470.5 3085.0
## Arctic 11221.0 16671.0 31019.0 19798.0
## East_Indies 3707.0 21429.5 69000.0 65293.0
##
## Suggested power transformation: 0.4155263
Here are parallel boxplots of the areas on the five oceans.
Each batch is strongly right-skewed and there are 9 outliers at the high end.
The batches have different spreads. If we look at the length of the boxes (the fourth spread), we see that the East-Indies data is more spread out than the Arctic and Indian data which are more spread out than the Mediterranean and Caribbean data.
Looking further, we see differences in the average areas and a tendency for the batches of larger areas to have larger spreads. In other words, we can see a dependence between spread and level in the boxplot display above. It is difficult to compare these batches since they have unequal spreads. Let’s do transformation next.
reArea<-island$Area ^ (0.4155263 )
ocean<-island$Ocean
redata<-cbind.data.frame(reArea,ocean)
boxplot(reArea ~ ocean,data = redata,
horizontal = TRUE,
xlab = "TRANSFORMED Area", ylab = "Ocean")

spreadLevelPlot(redata$reArea, redata$ocean)

## LowerHinge Median UpperHinge Hinge-Spread
## Caribbean 7.407163 10.54851 26.29973 18.89257
## Indian 13.229608 16.40298 46.70106 33.47145
## Mediterranean 11.620684 23.21644 29.58563 17.96495
## Arctic 48.055475 56.79840 73.36941 25.31393
## East_Indies 30.410443 62.10913 102.48707 72.07662
##
## Suggested power transformation: 0.5582082
After the first transformation, although this display doesn’t look much better than our original one, we notice that 5 high outliers disappears(2 in Mediterranean, 2 in Arctic and 1 in East_Indies).It does helps the spreads of the skewness of each batch more equally than before. Actually, we still see a strongly positive relationship between level and spread, which indicates that a further reexpression may be needed to remove or, at least, decrease the relationship between level and spread. The suggested power transformation is 0.5582082.
re_Area<-redata$reArea^(0.5582082)
re_data<-cbind.data.frame(re_Area,ocean)
boxplot(re_Area ~ ocean,data = re_data,
horizontal = TRUE,
xlab = "TRANSFORMED Area", ylab = "Ocean")

spreadLevelPlot(re_data$re_Area, re_data$ocean)

## LowerHinge Median UpperHinge Hinge-Spread
## Caribbean 3.057794 3.725235 6.190182 3.132388
## Indian 4.221151 4.764049 8.322824 4.101674
## Mediterranean 3.916718 5.786265 6.624628 2.707910
## Arctic 8.680801 9.534212 10.994916 2.314115
## East_Indies 6.727217 9.995057 13.254763 6.527545
##
## Suggested power transformation: 0.906075
After the second transformation, actually this display looks really better than our original picture. There are still three right skewness in five batches and we can’t help but notice there are only 3 high outliers (original 9 outliers), one more disappears at this transformation.
In addition, there is another improvement –the spreads of the batches are more similar. The spreads of the skewness of each batch are roughly equal this time which means we have removed the dependence between level and spread.
I see an exciting improvement using these reexpressions. The spreads of the raw data range from 2565.5 to 65293 – the largest spread is 65293/2565.5 = 25.45 times the smallest. Looking at the last reexpressed data, the spreads range from 2.314115 to 6.527545 – the ratio is around 2.8. Actually, this is a really big improvement – it does make much sense in this case to reexpress the times.