Github Link: Presentation Files
For each of the following, state whether you expect the distribution to be symmetric, right skewed or left skewed. Also specify whether the mean or median would best represent a typical observation in the data, and whether the variability of observations would be best represented using the standard deviation or IQR. Explain your reasoning.
1. Very few households will have many pets.
2. Most people who have pets will have one or two.
3. Many households may not have pets at all.
Based on the assumption above I will be doing a simulated Sampling to generate data with a sample size of 10000. We can then observe the results and be able to suggest what the distribution looks like and whether the mean or median would best describe a typical observation.
SampleSize<-10000
x <- data.frame("SerialNumber" = 1:SampleSize,
"PetsPerHousehold" = c(
rep(0,.37*SampleSize),
rep(1,.34*SampleSize),
rep(2,.16*SampleSize),
rep(3,.1*SampleSize),
rep(4,.015*SampleSize),
rep(5,0.01*SampleSize),
rep(8,0.005*SampleSize)
),
stringsAsFactors = FALSE)
barplot(table(x$PetsPerHousehold))
We can see from the generated bar chart that the data is right skewed or positive skewed. We can see a long tail towards the right.
boxplot(x$PetsPerHousehold)
summary<-summary(x$PetsPerHousehold)
summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 1.00 1.11 2.00 8.00
We can also see from the summary table that the
Mean is 1.11
Median is 1
The IQR is 2
In this situation since a typical value is 1 and the distribution seems to be right skewed it is better that the median be used to represent the typical value and to describe the variability in the data we can use the IQR .
Answer:
1. Very Few People will live Far away from work.
2. Most people will live close to their work not more than 5 miles.
Again using the sample size of 10,000 I will generate a simulated sample data set using R and we can see that using the assumption above how our data looks like.
dfDistanceToWork<-data.frame("SN"=1:SampleSize,
DistanceToWork=c(
rep(20.6,.08*SampleSize), rep(2.5,.20*SampleSize),
rep(3.5,.15*SampleSize),
rep(4.0,.15*SampleSize),
rep(5.0,.1*SampleSize),
rep(1,.26*SampleSize),
rep(40,.02*SampleSize),
rep(14.7,.04*SampleSize)
)
)
barplot(table(dfDistanceToWork$DistanceToWork))
summary(dfDistanceToWork$DistanceToWork)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 3.500 5.421 4.000 40.000
boxplot(dfDistanceToWork$DistanceToWork)
sd(dfDistanceToWork$DistanceToWork)
## [1] 7.311498
var(dfDistanceToWork$DistanceToWork)
## [1] 53.458
Also we can see above from the summary and box plot that the typical value can be best explained by using the median and the variability in the data can be best represented by using the IQR.
1. The Male heights in Feet are generally very close to eachother.
2. There can be male in various hieghts and difference is genrally not more than a few inches.
Say for example we take a sample size of 10,000 using the assumption above.
dfHeightMales<-data.frame("SN"=1:SampleSize,Height=c(
rep(4.8,.08*SampleSize),
rep(5.5,.16*SampleSize),
rep(4.9,.035*SampleSize),
rep(5.3,.06*SampleSize),
rep(5.2,.185*SampleSize),
rep(5.8,.20*SampleSize),
rep(5.10,.120*SampleSize),
rep(6,.15*SampleSize),
rep(6.5,.01*SampleSize)
)
)
head(dfHeightMales)
## SN Height
## 1 1 4.8
## 2 2 4.8
## 3 3 4.8
## 4 4 4.8
## 5 5 4.8
## 6 6 4.8
barplot(table(dfHeightMales$Height))
summary(dfHeightMales$Height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.800 5.200 5.500 5.452 5.800 6.500
boxplot(dfHeightMales$Height)
As can be seen from the results of the bar plot the distribution seems to be symmetrical in nature as there is not left or right tail and the data seems to be evenly distributed on both sides of the mean and median. Therefore the distribution is Symmetrical and the mean would be the best way to describe a typical observation and standard deviation can be used to explain the variability in the data.
sd(dfHeightMales$Height)
## [1] 0.391547
var(dfHeightMales$Height)
## [1] 0.1533091