Estimate the proportion & mean height of male/female students

Firstly let us start with picking the sample for our research and showing a small description of what are we working with. Size of sample should be at around 10% of the population. In case of data set ‘survey’ let’s round it up to 30 of which NA values will be dropped

Now, the sample data description, histogram and density plot of sample data + qq plot

summary(data[data$Sex == 'Male',]$Height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     ##   165.0   172.7   176.0   177.1   180.0   196.0

summary(data[data$Sex == 'Female',]$Height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     ##   157.5   160.0   165.0   165.2   168.9   175.0

summary(data$Height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     ##   157.5   165.0   170.0   170.6   175.0   196.0

ggplot(data, aes(x=Height, colour=Sex)) + geom_histogram(binwidth=.5, alpha=.5, position="identity")

ggplot(data, aes(x=Height, colour=Sex)) + geom_density()

qqnorm(data$Height)
     qqline(data$Height, col = "steelblue", lwd = 2)

Normal QQ plot shows us that the data is mostly normally distributed

Point Estimate of Population Mean for both males and females

datamean = mean(data$Height)
     datamean

## [1] 170.5607

mM <- mean(data[data$Sex == 'Male',]$Height)
     mF <- mean(data[data$Sex == 'Female',]$Height)

After the Point estimate is found sample data should be checked whether it would be representative for sure

power.t.test(delta = mM - mF, sd = sd(survey$Height, na.rm = TRUE), power = 0.95)

## 
     ##      Two-sample t test power calculation 
     ## 
     ##               n = 18.91559
     ##           delta = 11.86644
     ##              sd = 9.847528
     ##       sig.level = 0.05
     ##           power = 0.95
     ##     alternative = two.sided
     ## 
     ## NOTE: n is number in *each* group

Result n = 18.9 tells us that in each group there should be 18 people (male and female groups) which would result in sample of size 36, thus sample of size 30 is close enough to be treated as representative (IMO ;p)

Interval Estimate of Population Proportion

Now computing the margin of error

gender = data$Sex
     k = sum(gender == 'Female')
     n = length(gender)
     phat = k/n
     SE = sqrt(phat*(1-phat)/n); SE

## [1] 0.09234953

With the result 0.09 the 95% confidence level implies that 97.5th percentile of the normal distribution is at upper tail. To compute the margin of error SE must be multiplied by qnorm(.975)

error = qnorm(.975)*SE; error

## [1] 0.1810017

Combined with sample proportion gives us the confidence interval of a proportion how much females are university students

res = phat + c(-error, error);res

## [1] 0.3707224 0.7327259

The result is as follows: with 95% confidence level and +- 18% (margin of error) there are between 38% and 73% female university students. Obviously the sample size picked for this research is bare minimum that would tell anything, the bigger sample the closer the result will be to the real proportion.

Test done on whole population that was surveyed

gender2= na.omit(survey$Sex);
     k2 = sum(gender2 == 'Female')
     n2 = length(gender2)
     prop.test(k2,n2, p=NULL,conf.level = 0.95)

## 
     ##  1-sample proportions test without continuity correction
     ## 
     ## data:  k2 out of n2, null probability 0.5
     ## X-squared = 0, df = 1, p-value = 1
     ## alternative hypothesis: true p is not equal to 0.5
     ## 95 percent confidence interval:
     ##  0.4367215 0.5632785
     ## sample estimates:
     ##   p 
     ## 0.5

And answers lie within the error margin with probability being between 43% (38 + 5) and 56% (73 -17)

Interval estimate of Population Mean

With the use of a T test we can quickly reach the interval estimate of a population mean with a 95% of confidence

library(distributions3)

## Warning: package 'distributions3' was built under R version 4.0.5

## 
     ## Attaching package: 'distributions3'

## The following objects are masked from 'package:stats':
     ## 
     ##     Gamma, quantile

## The following object is masked from 'package:grDevices':
     ## 
     ##     pdf

n3<-length(data$Height)
     T_n<-StudentsT(df = n-1)
     lower = mean(data$Height) + quantile(T_n, 0.05 / 2) * sd(data$Height) / sqrt(n3); lower

## [1] 167.1345

upper = mean(data$Height) + quantile(T_n, 1 - 0.05 / 2) * sd(data$Height) / sqrt(n3); upper

## [1] 173.9868

The answer is like this: with 95% confidence we can say that the mean height lies somewhere between 167.135 cm and 173.987 To check it let’s use the built-in T-test

t.test(data$Height, conf.level = 0.95)

## 
     ##  One Sample t-test
     ## 
     ## data:  data$Height
     ## t = 101.97, df = 28, p-value < 2.2e-16
     ## alternative hypothesis: true mean is not equal to 0
     ## 95 percent confidence interval:
     ##  167.1345 173.9868
     ## sample estimates:
     ## mean of x 
     ##  170.5607

And the answer is exactly the same. Thus we can say that the result based on our sample data is (I hope) mostly correct. The mean for the whole population is 171 so the result is perfect.

Point & Interval Estimation

Krystian

14 11 2021

Estimate the proportion & mean height of male/female students

Interval Estimate of Population Proportion

Interval estimate of Population Mean