library(LearnEDAfunctions)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: ggplot2
library(aplpack)

Lake data

  1. The dataset lake in the LearnEDA package (taken from the Minitab dataset collection) contains measurements of lakes in the Vilas andOneidacounties of northernWisconsin. The variables are

AREA = area of lake in acres DEPTH = maximum depth of lake in feet PH = pH (acidity) measurement WSHED = watershed area in square miles HIONS = concentration of hydrogen ions

For two of these variables (that are not symmetric), find a transformation which makes the dataset symmetric. You can use any tool you want, including H quick method, inspection of mids, symmetry plots, etc. Demonstrate that your transformation has been symmetric.

Variable 1: Lake Area

The displays below show the dataset of lake areas is heavily skewed right.

head(lake)
##   Area Depth  PH Wshed Hions
## 1   55    19 7.1   0.8 1e-07
## 2   26    14 6.1   0.3 8e-07
## 3 1065    36 7.6   6.3 0e+00
## 4  213    71 7.6   4.0 0e+00
## 5 1463    35 8.2  33.0 0e+00
## 6  180    24 7.1   5.0 1e-07
stem.leaf(lake$Area)
## 1 | 2: represents 120
##  leaf unit: 10
##             n: 71
##    12    0* | 222223333344
##    26    0. | 55555778899999
##   (11)   1* | 01122223344
##    34    1. | 6777788889
##    24    2* | 0112233
##    17    2. | 8
##    16    3* | 0
##    15    3. | 5579
##    11    4* | 33
##          4. | 
##     9    5* | 3
## HI: 599 610 716 1065 1285 1352 1463 3585

I will inspect symmetry of transformations by inspecting midsummaries. The midsummaries of the raw data are shown below.

area.lettervals <- lval(lake$Area)
select(area.lettervals, mids)
##      mids
## M  148.00
## H  157.00
## E  259.75
## D  547.00
## C  689.00
## B  743.00
## A 1803.50
area.lettervals %>% mutate(LV=1:7) %>%
  ggplot(aes(LV, mids)) +
  geom_point() + ggtitle("Area Raw Data")

The positive trend in the midsummary plot confirms the right skewness of the raw data. To find a transformation for symmetry, I will begin by taking roots of the raw data and inspecting the midsummaries of the roots.

aroots<-sqrt(lake$Area)
stem.leaf(aroots)
## 1 | 2: represents 12
##  leaf unit: 1
##             n: 71
##    10     f | 4455555555
##    17     s | 6677777
##    26    0. | 889999999
##    35    1* | 000111111
##   (12)    t | 222333333333
##    24     f | 4444455
##    17     s | 67
##    15    1. | 8899
##    11    2* | 00
##     9     t | 3
##     8     f | 44
##     6     s | 6
## HI: 32.6343377441614 35.8468966578698 36.7695526217005 38.2491829978106 59.8748695196908
aroots.lv<-lval(aroots)
aroots.lv%>%mutate(LV=1:7) %>%
  ggplot(aes(LV, mids)) +
  geom_point() + ggtitle("Area Root Data")

The plot of midsummaries still show a positive trend. Thus, we take another step down the ladder of powers and transform the data using logs.

alogs <- log(lake$Area)
stem.leaf(alogs)
## 1 | 2: represents 1.2
##  leaf unit: 0.1
##             n: 71
##     8    3* | 01233444
##    14    3. | 558899
##    21    4* | 0003344
##   (15)   4. | 555556778888889
##    35    5* | 001111122223334444
##    17    5. | 678899
##    11    6* | 00234
##     6    6. | 59
##     4    7* | 122
##          7. | 
##     1    8* | 1
alogs.lv <- lval(alogs)
alogs.lv %>% mutate(LV = 1:7) %>%
  ggplot(aes(LV, mids)) +
  geom_point() + ggtitle("Area Log Data")

By using this log transformation, the plot of midsummaries no longer has a positive trend, suggesting we have achieved approximate symmetry. The raw data and log data are compared below.

hist(lake$Area, main="RAW")

hist(alogs, main="LOGS")

Variable 2: Lake Depth

The lake depth dataset is shown in the stemplot below. This dataset is also skewed right.

stem.leaf(lake$Depth)
## 1 | 2: represents 12
##  leaf unit: 1
##             n: 71
##    6    0. | 777999
##   15    1* | 011334444
##   26    1. | 55677777899
##   32    2* | 133444
##   (6)   2. | 556678
##   33    3* | 012233334
##   24    3. | 5667899
##   17    4* | 0233
##   13    4. | 55
##   11    5* | 00
##    9    5. | 568
##    6    6* | 0
##    5    6. | 5
##    4    7* | 011
## HI: 89

The midsummaries are shown in the plot below.

depth.lettervals <- lval(lake$Depth)
select(depth.lettervals, mids)
##    mids
## M 26.00
## H 27.75
## E 32.25
## D 37.00
## C 39.00
## B 39.00
## A 48.00
depth.lettervals%>% mutate(LV = 1:7) %>%
  ggplot(aes(LV, mids)) +
  geom_point() + ggtitle("Depth Raw Data")

The positive trend here is not nearly as strong as the raw data for the lake areas. These data are not as heavily right skewed. We will attempt to achieve symmetry by first taking roots of the raw data.

droots <- sqrt(lake$Depth)
stem.leaf(droots)
## 1 | 2: represents 1.2
##  leaf unit: 0.1
##             n: 71
##    3     2. | 666
##    9     3* | 000133
##   17     3. | 66777788
##   26     4* | 011111233
##   32     4. | 577888
##   (7)    5* | 0000124
##   32     5. | 566777789
##   23     6* | 00012234
##   15     6. | 5577
##   11     7* | 0044
##    7     7. | 67
##    5     8* | 0344
##          8. | 
##    1     9* | 4
droots.lv<-lval(droots)
droots.lv %>% mutate(LV=1:7) %>%
  ggplot(aes(LV, mids)) +
  geom_point() + ggtitle("Depth Root Data")

This plot of midsummaries suggests we have achieved approximate symmetry. The raw data and root data are shown in the histograms below.

hist(lake$Depth, main="RAW")

hist(droots, main="ROOTS")

My data

  1. Find two datasets (at least 30 observations in each) that you are interested in that are not symmetric. Find a transformation that makes each dataset symmetric and demonstrate that your transformation is effective in achieving approximate symmetry.

Dataset 1: 2025 AUSL Hits per Player

The data below are the number of hits per player in the AUSL professional softball league in the 2025 season.

hits<-read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSsj7vQXUjOtPpfIAqSlQpBdpWd0IPUCggxT5g7hv3TAW5DX3mZFgeRNVLC3WAd27nwsiCX6NTjpSqE/pub?gid=1085482376&single=true&output=csv")
stem.leaf(hits$Hits)
## 1 | 2: represents 12
##  leaf unit: 1
##             n: 56
##   11    0* | 00000123344
##   22    0. | 55556777888
##   (6)   1* | 011233
##   (9)   1. | 667777899
##   19    2* | 00001114
##   11    2. | 56677889
##    3    3* | 1
##    2    3. | 7
##    1    4* | 2

The midsummaries of this dataset are plotted below. There is only a slight right skew in this data set, so a power transformation of \(p=0.5\) should be sufficient to achieve symmetry.

hits.lv<-lval(hits$Hits)
select(hits.lv, mids)
##    mids
## M 14.50
## H 13.00
## E 14.75
## D 14.50
## C 17.00
## B 21.00
hits.lv %>% mutate(LV=1:6) %>%
  ggplot(aes(LV, mids)) +
  geom_point() + ggtitle("Hits Raw Data")

hroots<-sqrt(hits$Hits)
stem.leaf(hroots)
## 1 | 2: represents 1.2
##  leaf unit: 0.1
##             n: 56
##     5    0* | 00000
##          0. | 
##     7    1* | 04
##     9    1. | 77
##    16    2* | 0022224
##    22    2. | 666888
##    26    3* | 1334
##    (2)   3. | 66
##   (13)   4* | 0011112334444
##    15    4. | 5558
##    11    5* | 00011223
##     3    5. | 5
##     2    6* | 04
hroots.lv<-lval(hroots)
hroots.lv %>% mutate(LV=1:6) %>%
  ggplot(aes(LV, mids)) +
  geom_point() + ggtitle("Hits Root Data")

As seen in the midsummary plot for the roots of the raw data, the positive trend has been eliminated.

hist(hits$Hits, main="RAW")

hist(hroots, main="ROOTS")

Dataset 2: 2025 AUSL Total Bases per Player

From the same database, my second dataset is total bases per player in the AUSL league. The raw data and midsummaries are shown below.

tbases<-read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSsj7vQXUjOtPpfIAqSlQpBdpWd0IPUCggxT5g7hv3TAW5DX3mZFgeRNVLC3WAd27nwsiCX6NTjpSqE/pub?gid=939459700&single=true&output=csv")
stem.leaf(tbases$TB)
## 1 | 2: represents 12
##  leaf unit: 1
##             n: 56
##    9    0* | 000001344
##   16    0. | 5566778
##   22    1* | 111333
##   (6)   1. | 566888
##   (1)   2* | 3
##   27    2. | 5678999
##   20    3* | 11223444
##   12    3. | 899
##    9    4* | 044
##    6    4. | 69
##    4    5* | 0
##    3    5. | 56
## HI: 78
tbases.lv<-lval(tbases$TB)
select(tbases.lv, mids)
##    mids
## M 20.50
## H 20.50
## E 23.75
## D 25.00
## C 27.75
## B 39.00
tbases.lv %>% mutate(LV=1:6) %>%
  ggplot(aes(LV, mids)) +
  geom_point() + ggtitle("Total Bases Raw Data")

Similar to my first data set, the transformation of roots is sufficient in achieving approximate symmetry.

tbroots<-sqrt(tbases$TB)
stem.leaf(tbroots)
## 1 | 2: represents 1.2
##  leaf unit: 0.1
##             n: 56
##    5    0 | 00000
##    7    1 | 07
##   16    2 | 002244668
##   23    3 | 3336668
##   (6)   4 | 002227
##   27    5 | 001233355667888
##   12    6 | 1223667
##    5    7 | 0044
##    1    8 | 8
tbroots.lv<-lval(tbroots)
tbroots.lv %>% mutate(LV=1:6) %>%
  ggplot(aes(LV, mids)) +
  geom_point() + ggtitle("Hits Root Data")

hist(tbases$TB, main="RAW")

hist(tbroots, main="ROOTS")