Firstly let us start with picking the sample for our research and showing a small description of what are we working with. Size of sample should be at around 10% of the population. In case of data set ‘survey’ let’s round it up to 30 of which NA values will be dropped
Now, the sample data description, histogram and density plot of sample data + qq plot
summary(data[data$Sex == 'Male',]$Height)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 165.0 172.7 176.0 177.1 180.0 196.0
summary(data[data$Sex == 'Female',]$Height)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 157.5 160.0 165.0 165.2 168.9 175.0
summary(data$Height)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 157.5 165.0 170.0 170.6 175.0 196.0
ggplot(data, aes(x=Height, colour=Sex)) + geom_histogram(binwidth=.5, alpha=.5, position="identity")ggplot(data, aes(x=Height, colour=Sex)) + geom_density() qqnorm(data$Height)
qqline(data$Height, col = "steelblue", lwd = 2)Normal QQ plot shows us that the data is mostly normally distributed
Point Estimate of Population Mean for both males and females
datamean = mean(data$Height)
datamean## [1] 170.5607
mM <- mean(data[data$Sex == 'Male',]$Height)
mF <- mean(data[data$Sex == 'Female',]$Height)After the Point estimate is found sample data should be checked whether it would be representative for sure
power.t.test(delta = mM - mF, sd = sd(survey$Height, na.rm = TRUE), power = 0.95)##
## Two-sample t test power calculation
##
## n = 18.91559
## delta = 11.86644
## sd = 9.847528
## sig.level = 0.05
## power = 0.95
## alternative = two.sided
##
## NOTE: n is number in *each* group
Result n = 18.9 tells us that in each group there should be 18 people (male and female groups) which would result in sample of size 36, thus sample of size 30 is close enough to be treated as representative (IMO ;p)
Now computing the margin of error
gender = data$Sex
k = sum(gender == 'Female')
n = length(gender)
phat = k/n
SE = sqrt(phat*(1-phat)/n); SE## [1] 0.09234953
With the result 0.09 the 95% confidence level implies that 97.5th percentile of the normal distribution is at upper tail. To compute the margin of error SE must be multiplied by qnorm(.975)
error = qnorm(.975)*SE; error## [1] 0.1810017
Combined with sample proportion gives us the confidence interval of a proportion how much females are university students
res = phat + c(-error, error);res## [1] 0.3707224 0.7327259
The result is as follows: with 95% confidence level and +- 18% (margin of error) there are between 38% and 73% female university students. Obviously the sample size picked for this research is bare minimum that would tell anything, the bigger sample the closer the result will be to the real proportion.
Test done on whole population that was surveyed
gender2= na.omit(survey$Sex);
k2 = sum(gender2 == 'Female')
n2 = length(gender2)
prop.test(k2,n2, p=NULL,conf.level = 0.95)##
## 1-sample proportions test without continuity correction
##
## data: k2 out of n2, null probability 0.5
## X-squared = 0, df = 1, p-value = 1
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.4367215 0.5632785
## sample estimates:
## p
## 0.5
And answers lie within the error margin with probability being between 43% (38 + 5) and 56% (73 -17)
With the use of a T test we can quickly reach the interval estimate of a population mean with a 95% of confidence
library(distributions3)## Warning: package 'distributions3' was built under R version 4.0.5
##
## Attaching package: 'distributions3'
## The following objects are masked from 'package:stats':
##
## Gamma, quantile
## The following object is masked from 'package:grDevices':
##
## pdf
n3<-length(data$Height)
T_n<-StudentsT(df = n-1)
lower = mean(data$Height) + quantile(T_n, 0.05 / 2) * sd(data$Height) / sqrt(n3); lower## [1] 167.1345
upper = mean(data$Height) + quantile(T_n, 1 - 0.05 / 2) * sd(data$Height) / sqrt(n3); upper## [1] 173.9868
The answer is like this: with 95% confidence we can say that the mean height lies somewhere between 167.135 cm and 173.987 To check it let’s use the built-in T-test
t.test(data$Height, conf.level = 0.95)##
## One Sample t-test
##
## data: data$Height
## t = 101.97, df = 28, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 167.1345 173.9868
## sample estimates:
## mean of x
## 170.5607
And the answer is exactly the same. Thus we can say that the result based on our sample data is (I hope) mostly correct. The mean for the whole population is 171 so the result is perfect.