1. The dataset lake in the LearnEDA package (taken from the Minitab dataset collection) contains measurements of lakes in the Vilas andOneidacounties of northernWisconsin. The variables are

For two of these variables (that are not symmetric), find a transformation which makes the dataset symmetric. You can use any tool you want, including H quick method, inspection of mids, symmetry plots, etc. Demonstrate that your transformation has been symmetric.

lake <- read.delim("~/data/lake.txt")
View(lake)
aplpack::stem.leaf(lake$Depth)
## 1 | 2: represents 12
##  leaf unit: 1
##             n: 71
##    6    0. | 777999
##   15    1* | 011334444
##   26    1. | 55677777899
##   32    2* | 133444
##   (6)   2. | 556678
##   33    3* | 012233334
##   24    3. | 5667899
##   17    4* | 0233
##   13    4. | 55
##   11    5* | 00
##    9    5. | 568
##    6    6* | 0
##    5    6. | 5
##    4    7* | 011
## HI: 89
hist(lake$Depth)

lval(lake$Depth)
##   depth   lo   hi  mids spreads
## M  36.0 26.0 26.0 26.00     0.0
## H  18.5 16.5 39.0 27.75    22.5
## E   9.5 12.0 52.5 32.25    40.5
## D   5.0  9.0 65.0 37.00    56.0
## C   3.0  7.0 71.0 39.00    64.0
## B   2.0  7.0 71.0 39.00    64.0
## A   1.0  7.0 89.0 48.00    82.0
symplot(lake$Depth)

From the stemplot and histogram plot, we clearly see the right-skewness in this data. Most of the depth values are in the range of 10-30 (mean is 30.26761), and we notice several large values. Also, the median(26) is small than the mid-forth(27.75) – this indicates some right-skewness in the middle half of the data; the midsummaries increase from the mid-fourth to the mid-extremes – this tells us that the outside half of the data is right-skewed.In addition, symplot shows the almost all of the points fall above the line, which indicates the batch is right-skewness.

To remove the right-skewness, we go down the ladder of powers and try power transformations p that are smaller than p = 1.

roots <- sqrt(lake$Depth)
symplot(roots)

hist(roots)

logs <- log(lake$Depth)
hist(logs)

symplot(logs)

boxplot(data.frame(roots,logs),horizontal = TRUE)

After trying the p=0.5 and p=0, we notice the log plots perform better. The symplot shows most of points are close to the line in log transformation, and boxplot also looks perfect as the plot seems symmetric from left side to right side.

hist(lake$Hions)

boxplot(lake$Hions,horizontal = TRUE)

lval(lake$Hions)
##   depth    lo      hi  mids spreads
## M  36.0 1e-07 1.0e-07 1e-07 0.0e+00
## H  18.5 1e-07 3.0e-07 2e-07 2.0e-07
## E   9.5 0e+00 8.0e-07 4e-07 8.0e-07
## D   5.0 0e+00 1.0e-06 5e-07 1.0e-06
## C   3.0 0e+00 1.6e-06 8e-07 1.6e-06
## B   2.0 0e+00 1.6e-06 8e-07 1.6e-06
## A   1.0 0e+00 2.0e-06 1e-06 2.0e-06
symplot(lake$Hions)

raw<-lake$Hions
matched.roots<-mtrans(raw,0.5)
matched.logs<-mtrans(raw,0)
boxplot(data.frame(raw,matched.roots,matched.logs),horizontal = TRUE)

Similar problem found in data Lake$Hions, the middle half and outside half of the data are both right-skewness strongly. After trying the root and log transformation, we help adjust the outside part of the data more symmetric. In this case, the root transformation (p= 0.5) performs better.

2.Find two datasets (at least 30 observations in each) that you are interested in that are not symmetric. Find a transformation that makes each dataset symmetric and demonstrate that your transformation is effective in achieving approximate symmetry.

Dataset 1: Lastest Covid-19 India State Data

data <- read.csv("C:/Users/ylu_local/Desktop/5470/Covid19India.csv")
head(data)
##             State.UTs Total.Cases Active Discharged Deaths Active.Ratio....
## 1 Andaman and Nicobar        7566      6       7431    129             0.08
## 2      Andhra Pradesh     2014116  14693    1985566  13857             0.73
## 3   Arunachal Pradesh       53031    863      51908    260             1.63
## 4               Assam      589426   6901     576865   5660             1.17
## 5               Bihar      725708    100     715955   9653             0.01
## 6          Chandigarh       65105     40      64252    813             0.06
##   Discharge.Ratio.... Death.Ratio....
## 1               98.22            1.70
## 2               98.58            0.69
## 3               97.88            0.49
## 4               97.87            0.96
## 5               98.66            1.33
## 6               98.69            1.25
summary(data)
##   State.UTs          Total.Cases          Active         Discharged     
##  Length:36          Min.   :   7566   Min.   :     4   Min.   :   7431  
##  Class :character   1st Qu.:  73153   1st Qu.:   145   1st Qu.:  70212  
##  Mode  :character   Median : 468646   Median :   839   Median : 459735  
##                     Mean   : 911412   Mean   : 10505   Mean   : 888712  
##                     3rd Qu.:1005276   3rd Qu.:  6034   3rd Qu.: 991172  
##                     Max.   :6464876   Max.   :219441   Max.   :6272800  
##      Deaths         Active.Ratio....  Discharge.Ratio.... Death.Ratio....
##  Min.   :     4.0   Min.   : 0.0100   Min.   :84.60       Min.   :0.040  
##  1st Qu.:   809.8   1st Qu.: 0.0475   1st Qu.:97.63       1st Qu.:0.955  
##  Median :  5396.0   Median : 0.5350   Median :98.22       Median :1.300  
##  Mean   : 12195.0   Mean   : 1.2553   Mean   :97.48       Mean   :1.266  
##  3rd Qu.: 13630.5   3rd Qu.: 0.9450   3rd Qu.:98.65       3rd Qu.:1.590  
##  Max.   :137313.0   Max.   :15.0300   Max.   :99.92       Max.   :2.740
raw<- data$Total.Cases
hist(raw)

boxplot(raw,horizontal = TRUE)

symplot(raw)

In the India Covid-19 dataset,we notice the number of total cases shows big difference across different states. From the histogram, boxplot and symplot, we can clearly see the right-skewness in the middle and outside of the data.

roots3<-sqrt(raw)
logs3<-log(raw)
hinkley(roots3)
##         h 
## 0.1091865
hinkley(logs3)
##          h 
## -0.1683523
tran<-raw^0.1
hist(tran)

matched.roots<-mtrans(raw,0.5)
matched.logs<-mtrans(raw,0)
matched.tran<-mtrans(raw,0.1)
boxplot(data.frame(raw,matched.roots,matched.logs,matched.tran),horizontal = TRUE)

After trying the root and log transformation, we find that the data distribution changes from right-skewness to left skewness as shown in the hinkley output.The root data (p = 0.5) has a positive value of d and logs (p = 0) has a negative d value, therefore this might suggest choosing a power reexpression (p) between 0 and 0.5. After trying p = 0.1 and comparing with other transformation, we finally got a nice boxplot.

Dataset 2: Hospital data

Hospital <- read.csv("C:/Users/ylu_local/Desktop/Hospital.csv")
head(Hospital)
##   ID  Stay  Age InfctRsk Culture  Xray Beds MedSchool Region Census Nurses
## 1  1  7.13 55.7      4.1     9.0  39.6  279         2      4    207    241
## 2  2  8.82 58.2      1.6     3.8  51.7   80         2      2     51     52
## 3  3  8.34 56.9      2.7     8.1  74.0  107         2      3     82     54
## 4  4  8.95 53.7      5.6    18.9 122.8  147         2      4     53    148
## 5  5 11.20 56.5      5.7    34.5  88.9  180         2      1    134    151
## 6  6  9.76 50.9      5.1    21.9  97.0  150         2      2    147    106
##   Facilities
## 1         60
## 2         40
## 3         20
## 4         40
## 5         40
## 6         40
summary(Hospital)
##        ID              Stay             Age           InfctRsk    
##  Min.   :  1.00   Min.   : 6.700   Min.   :38.80   Min.   :1.300  
##  1st Qu.: 28.75   1st Qu.: 8.330   1st Qu.:50.85   1st Qu.:3.650  
##  Median : 56.50   Median : 9.415   Median :53.10   Median :4.400  
##  Mean   : 56.58   Mean   : 9.610   Mean   :53.12   Mean   :4.335  
##  3rd Qu.: 84.25   3rd Qu.:10.432   3rd Qu.:56.12   3rd Qu.:5.200  
##  Max.   :113.00   Max.   :19.560   Max.   :64.10   Max.   :7.800  
##     Culture            Xray             Beds         MedSchool    
##  Min.   : 1.600   Min.   : 39.60   Min.   : 29.0   Min.   :1.000  
##  1st Qu.: 8.375   1st Qu.: 69.40   1st Qu.:104.5   1st Qu.:2.000  
##  Median :14.050   Median : 82.15   Median :185.0   Median :2.000  
##  Mean   :15.795   Mean   : 81.17   Mean   :251.2   Mean   :1.848  
##  3rd Qu.:20.350   3rd Qu.: 93.28   3rd Qu.:307.5   3rd Qu.:2.000  
##  Max.   :60.500   Max.   :122.80   Max.   :835.0   Max.   :2.000  
##      Region          Census           Nurses        Facilities   
##  Min.   :1.000   Min.   : 20.00   Min.   : 14.0   Min.   : 5.70  
##  1st Qu.:2.000   1st Qu.: 67.75   1st Qu.: 65.5   1st Qu.:31.40  
##  Median :2.000   Median :142.00   Median :130.5   Median :42.90  
##  Mean   :2.375   Mean   :190.33   Mean   :173.2   Mean   :42.98  
##  3rd Qu.:3.000   3rd Qu.:249.00   3rd Qu.:218.5   3rd Qu.:54.30  
##  Max.   :4.000   Max.   :791.00   Max.   :656.0   Max.   :80.00
hist(Hospital$Beds)

boxplot(Hospital$Beds,horizontal = TRUE)

symplot(Hospital$Beds)

roots4<-sqrt(Hospital$Beds)
logs4<-log(Hospital$Beds)
boxplot(data.frame(roots4,logs4),horizontal = TRUE)

matched.roots<-mtrans(Hospital$Beds,0.5)
matched.logs<-mtrans(Hospital$Beds,0)
matched.tran<-mtrans(Hospital$Beds,0.1)
boxplot(data.frame(matched.roots,matched.logs,matched.tran),horizontal = TRUE)

Similarly found in Hospital$Beds, it shows strongly right-skewed from the histogram, boxplot and symplot. The roots transformation still shows right-skewness and logs transformation shows a little left-skewness. Therefore, we choose p = 0.1 and compare it with other transformations, finally get a relatively symmetric plot.