For two of these variables (that are not symmetric), find a transformation which makes the dataset symmetric. You can use any tool you want, including H quick method, inspection of mids, symmetry plots, etc. Demonstrate that your transformation has been symmetric.
lake <- read.delim("~/data/lake.txt")
View(lake)
aplpack::stem.leaf(lake$Depth)
## 1 | 2: represents 12
## leaf unit: 1
## n: 71
## 6 0. | 777999
## 15 1* | 011334444
## 26 1. | 55677777899
## 32 2* | 133444
## (6) 2. | 556678
## 33 3* | 012233334
## 24 3. | 5667899
## 17 4* | 0233
## 13 4. | 55
## 11 5* | 00
## 9 5. | 568
## 6 6* | 0
## 5 6. | 5
## 4 7* | 011
## HI: 89
hist(lake$Depth)
lval(lake$Depth)
## depth lo hi mids spreads
## M 36.0 26.0 26.0 26.00 0.0
## H 18.5 16.5 39.0 27.75 22.5
## E 9.5 12.0 52.5 32.25 40.5
## D 5.0 9.0 65.0 37.00 56.0
## C 3.0 7.0 71.0 39.00 64.0
## B 2.0 7.0 71.0 39.00 64.0
## A 1.0 7.0 89.0 48.00 82.0
symplot(lake$Depth)
From the stemplot and histogram plot, we clearly see the right-skewness in this data. Most of the depth values are in the range of 10-30 (mean is 30.26761), and we notice several large values. Also, the median(26) is small than the mid-forth(27.75) – this indicates some right-skewness in the middle half of the data; the midsummaries increase from the mid-fourth to the mid-extremes – this tells us that the outside half of the data is right-skewed.In addition, symplot shows the almost all of the points fall above the line, which indicates the batch is right-skewness.
To remove the right-skewness, we go down the ladder of powers and try power transformations p that are smaller than p = 1.
roots <- sqrt(lake$Depth)
symplot(roots)
hist(roots)
logs <- log(lake$Depth)
hist(logs)
symplot(logs)
boxplot(data.frame(roots,logs),horizontal = TRUE)
After trying the p=0.5 and p=0, we notice the log plots perform better. The symplot shows most of points are close to the line in log transformation, and boxplot also looks perfect as the plot seems symmetric from left side to right side.
hist(lake$Hions)
boxplot(lake$Hions,horizontal = TRUE)
lval(lake$Hions)
## depth lo hi mids spreads
## M 36.0 1e-07 1.0e-07 1e-07 0.0e+00
## H 18.5 1e-07 3.0e-07 2e-07 2.0e-07
## E 9.5 0e+00 8.0e-07 4e-07 8.0e-07
## D 5.0 0e+00 1.0e-06 5e-07 1.0e-06
## C 3.0 0e+00 1.6e-06 8e-07 1.6e-06
## B 2.0 0e+00 1.6e-06 8e-07 1.6e-06
## A 1.0 0e+00 2.0e-06 1e-06 2.0e-06
symplot(lake$Hions)
raw<-lake$Hions
matched.roots<-mtrans(raw,0.5)
matched.logs<-mtrans(raw,0)
boxplot(data.frame(raw,matched.roots,matched.logs),horizontal = TRUE)
Similar problem found in data Lake$Hions, the middle half and outside half of the data are both right-skewness strongly. After trying the root and log transformation, we help adjust the outside part of the data more symmetric. In this case, the root transformation (p= 0.5) performs better.
2.Find two datasets (at least 30 observations in each) that you are interested in that are not symmetric. Find a transformation that makes each dataset symmetric and demonstrate that your transformation is effective in achieving approximate symmetry.
Dataset 1: Lastest Covid-19 India State Data
data <- read.csv("C:/Users/ylu_local/Desktop/5470/Covid19India.csv")
head(data)
## State.UTs Total.Cases Active Discharged Deaths Active.Ratio....
## 1 Andaman and Nicobar 7566 6 7431 129 0.08
## 2 Andhra Pradesh 2014116 14693 1985566 13857 0.73
## 3 Arunachal Pradesh 53031 863 51908 260 1.63
## 4 Assam 589426 6901 576865 5660 1.17
## 5 Bihar 725708 100 715955 9653 0.01
## 6 Chandigarh 65105 40 64252 813 0.06
## Discharge.Ratio.... Death.Ratio....
## 1 98.22 1.70
## 2 98.58 0.69
## 3 97.88 0.49
## 4 97.87 0.96
## 5 98.66 1.33
## 6 98.69 1.25
summary(data)
## State.UTs Total.Cases Active Discharged
## Length:36 Min. : 7566 Min. : 4 Min. : 7431
## Class :character 1st Qu.: 73153 1st Qu.: 145 1st Qu.: 70212
## Mode :character Median : 468646 Median : 839 Median : 459735
## Mean : 911412 Mean : 10505 Mean : 888712
## 3rd Qu.:1005276 3rd Qu.: 6034 3rd Qu.: 991172
## Max. :6464876 Max. :219441 Max. :6272800
## Deaths Active.Ratio.... Discharge.Ratio.... Death.Ratio....
## Min. : 4.0 Min. : 0.0100 Min. :84.60 Min. :0.040
## 1st Qu.: 809.8 1st Qu.: 0.0475 1st Qu.:97.63 1st Qu.:0.955
## Median : 5396.0 Median : 0.5350 Median :98.22 Median :1.300
## Mean : 12195.0 Mean : 1.2553 Mean :97.48 Mean :1.266
## 3rd Qu.: 13630.5 3rd Qu.: 0.9450 3rd Qu.:98.65 3rd Qu.:1.590
## Max. :137313.0 Max. :15.0300 Max. :99.92 Max. :2.740
raw<- data$Total.Cases
hist(raw)
boxplot(raw,horizontal = TRUE)
symplot(raw)
In the India Covid-19 dataset,we notice the number of total cases shows big difference across different states. From the histogram, boxplot and symplot, we can clearly see the right-skewness in the middle and outside of the data.
roots3<-sqrt(raw)
logs3<-log(raw)
hinkley(roots3)
## h
## 0.1091865
hinkley(logs3)
## h
## -0.1683523
tran<-raw^0.1
hist(tran)
matched.roots<-mtrans(raw,0.5)
matched.logs<-mtrans(raw,0)
matched.tran<-mtrans(raw,0.1)
boxplot(data.frame(raw,matched.roots,matched.logs,matched.tran),horizontal = TRUE)
After trying the root and log transformation, we find that the data distribution changes from right-skewness to left skewness as shown in the hinkley output.The root data (p = 0.5) has a positive value of d and logs (p = 0) has a negative d value, therefore this might suggest choosing a power reexpression (p) between 0 and 0.5. After trying p = 0.1 and comparing with other transformation, we finally got a nice boxplot.
Dataset 2: Hospital data
Hospital <- read.csv("C:/Users/ylu_local/Desktop/Hospital.csv")
head(Hospital)
## ID Stay Age InfctRsk Culture Xray Beds MedSchool Region Census Nurses
## 1 1 7.13 55.7 4.1 9.0 39.6 279 2 4 207 241
## 2 2 8.82 58.2 1.6 3.8 51.7 80 2 2 51 52
## 3 3 8.34 56.9 2.7 8.1 74.0 107 2 3 82 54
## 4 4 8.95 53.7 5.6 18.9 122.8 147 2 4 53 148
## 5 5 11.20 56.5 5.7 34.5 88.9 180 2 1 134 151
## 6 6 9.76 50.9 5.1 21.9 97.0 150 2 2 147 106
## Facilities
## 1 60
## 2 40
## 3 20
## 4 40
## 5 40
## 6 40
summary(Hospital)
## ID Stay Age InfctRsk
## Min. : 1.00 Min. : 6.700 Min. :38.80 Min. :1.300
## 1st Qu.: 28.75 1st Qu.: 8.330 1st Qu.:50.85 1st Qu.:3.650
## Median : 56.50 Median : 9.415 Median :53.10 Median :4.400
## Mean : 56.58 Mean : 9.610 Mean :53.12 Mean :4.335
## 3rd Qu.: 84.25 3rd Qu.:10.432 3rd Qu.:56.12 3rd Qu.:5.200
## Max. :113.00 Max. :19.560 Max. :64.10 Max. :7.800
## Culture Xray Beds MedSchool
## Min. : 1.600 Min. : 39.60 Min. : 29.0 Min. :1.000
## 1st Qu.: 8.375 1st Qu.: 69.40 1st Qu.:104.5 1st Qu.:2.000
## Median :14.050 Median : 82.15 Median :185.0 Median :2.000
## Mean :15.795 Mean : 81.17 Mean :251.2 Mean :1.848
## 3rd Qu.:20.350 3rd Qu.: 93.28 3rd Qu.:307.5 3rd Qu.:2.000
## Max. :60.500 Max. :122.80 Max. :835.0 Max. :2.000
## Region Census Nurses Facilities
## Min. :1.000 Min. : 20.00 Min. : 14.0 Min. : 5.70
## 1st Qu.:2.000 1st Qu.: 67.75 1st Qu.: 65.5 1st Qu.:31.40
## Median :2.000 Median :142.00 Median :130.5 Median :42.90
## Mean :2.375 Mean :190.33 Mean :173.2 Mean :42.98
## 3rd Qu.:3.000 3rd Qu.:249.00 3rd Qu.:218.5 3rd Qu.:54.30
## Max. :4.000 Max. :791.00 Max. :656.0 Max. :80.00
hist(Hospital$Beds)
boxplot(Hospital$Beds,horizontal = TRUE)
symplot(Hospital$Beds)
roots4<-sqrt(Hospital$Beds)
logs4<-log(Hospital$Beds)
boxplot(data.frame(roots4,logs4),horizontal = TRUE)
matched.roots<-mtrans(Hospital$Beds,0.5)
matched.logs<-mtrans(Hospital$Beds,0)
matched.tran<-mtrans(Hospital$Beds,0.1)
boxplot(data.frame(matched.roots,matched.logs,matched.tran),horizontal = TRUE)
Similarly found in Hospital$Beds, it shows strongly right-skewed from the histogram, boxplot and symplot. The roots transformation still shows right-skewness and logs transformation shows a little left-skewness. Therefore, we choose p = 0.1 and compare it with other transformations, finally get a relatively symmetric plot.