Github Link: Presentation Files

Problem 2.15:

For each of the following, state whether you expect the distribution to be symmetric, right skewed or left skewed. Also specify whether the mean or median would best represent a typical observation in the data, and whether the variability of observations would be best represented using the standard deviation or IQR. Explain your reasoning.

  1. Number Of Pets Per Houseold.
  2. Distance to work i.e number of miles between work and home.
  3. Heights of adult males.

(a) Number of Pets Per Household

Assumptions:

    1. Very few households will have many pets.
    2. Most people who have pets will have one or two.
    3. Many households may not have pets at all.

Simulated Sampling:

Based on the assumption above I will be doing a simulated Sampling to generate data with a sample size of 10000. We can then observe the results and be able to suggest what the distribution looks like and whether the mean or median would best describe a typical observation.

SampleSize<-10000
x <- data.frame("SerialNumber" = 1:SampleSize, 
                "PetsPerHousehold" = c(
                                      rep(0,.37*SampleSize),
                                      rep(1,.34*SampleSize),
                                      rep(2,.16*SampleSize),
                                      rep(3,.1*SampleSize),
                                      rep(4,.015*SampleSize),
                                      rep(5,0.01*SampleSize),
                                      rep(8,0.005*SampleSize)
                                      ),
                stringsAsFactors = FALSE)

barplot(table(x$PetsPerHousehold))

We can see from the generated bar chart that the data is right skewed or positive skewed. We can see a long tail towards the right.

boxplot(x$PetsPerHousehold)

summary<-summary(x$PetsPerHousehold)

summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    1.00    1.11    2.00    8.00

We can also see from the summary table that the

Mean is 1.11

Median is 1

The IQR is 2

In this situation since a typical value is 1 and the distribution seems to be right skewed it is better that the median be used to represent the typical value and to describe the variability in the data we can use the IQR .

(b) Distance to work i.e number of miles between work and home.

Answer:

Assumption

1. Very Few People will live Far away from work.
2. Most people will live close to their work not more than 5 miles.

Simulated Sampling

Again using the sample size of 10,000 I will generate a simulated sample data set using R and we can see that using the assumption above how our data looks like.

dfDistanceToWork<-data.frame("SN"=1:SampleSize,
                              DistanceToWork=c(
                                                rep(20.6,.08*SampleSize),                                                         rep(2.5,.20*SampleSize),
                                                rep(3.5,.15*SampleSize),
                                                rep(4.0,.15*SampleSize),
                                                rep(5.0,.1*SampleSize),
                                                rep(1,.26*SampleSize),
                                                rep(40,.02*SampleSize),
                                                rep(14.7,.04*SampleSize)
                                                )
                             )



barplot(table(dfDistanceToWork$DistanceToWork))

summary(dfDistanceToWork$DistanceToWork)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   3.500   5.421   4.000  40.000
boxplot(dfDistanceToWork$DistanceToWork)

sd(dfDistanceToWork$DistanceToWork)
## [1] 7.311498
var(dfDistanceToWork$DistanceToWork)
## [1] 53.458

Also we can see above from the summary and box plot that the typical value can be best explained by using the median and the variability in the data can be best represented by using the IQR.

(c) Heights of adult males.

Assumption

1. The Male heights in Feet are generally very close to eachother.
2. There can be male in various hieghts and difference is genrally not more than a few inches.

Simulated Sampling

Say for example we take a sample size of 10,000 using the assumption above.

dfHeightMales<-data.frame("SN"=1:SampleSize,Height=c(
  
                             rep(4.8,.08*SampleSize),
                             rep(5.5,.16*SampleSize),
                             rep(4.9,.035*SampleSize),
                             rep(5.3,.06*SampleSize),
                             rep(5.2,.185*SampleSize),
                             rep(5.8,.20*SampleSize),
                             rep(5.10,.120*SampleSize),
                             rep(6,.15*SampleSize),
                             rep(6.5,.01*SampleSize)
                             )
                             )


head(dfHeightMales)
##   SN Height
## 1  1    4.8
## 2  2    4.8
## 3  3    4.8
## 4  4    4.8
## 5  5    4.8
## 6  6    4.8
barplot(table(dfHeightMales$Height))

summary(dfHeightMales$Height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.800   5.200   5.500   5.452   5.800   6.500
boxplot(dfHeightMales$Height)

As can be seen from the results of the bar plot the distribution seems to be symmetrical in nature as there is not left or right tail and the data seems to be evenly distributed on both sides of the mean and median. Therefore the distribution is Symmetrical and the mean would be the best way to describe a typical observation and standard deviation can be used to explain the variability in the data.

sd(dfHeightMales$Height)
## [1] 0.391547
var(dfHeightMales$Height)
## [1] 0.1533091