Clear all data from your space

rm(list=ls())

Set your working directory to your own folder

setwd("H:/CSSS321A/hw3")

load dataset without outlier

load("H:/CSSS321A/hw3/county.Rdata")

Q1.Sampling distribution

average_wage<-county$per_capita_income[complete.cases(county$per_capita_income)]
n=length(average_wage)
n
## [1] 3012
summary(average_wage)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10770   18920   21580   21820   24340   33390
sd(average_wage)
## [1] 4095.516

Q1.a Drawing samples of 100 observations from the population and its distribution

set.seed(10000)
sample100.1<-sample(average_wage, 100, replace=FALSE)
sum(complete.cases(sample100.1))
## [1] 100
summary(sample100.1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13690   19200   21320   21750   24160   33290
sd(sample100.1)
## [1] 3992.207

The numerical summary of this sample is slightly different from that of the full complement of cases. First, the sample mean is smaller than the population mean to a very small extent. Second, the standard deviation is relatively small in the sample. Third, the range of this sample is also smaller than the population due to a relatively larger minimum and smaller maximum.

Draw the second sample

set.seed(9000)
sample100.2<-sample(average_wage, 100, replace=FALSE)

Draw the third sample

set.seed(8000)
sample100.3<-sample(average_wage, 100, replace=FALSE)

plot sample means on the histogram

hist(average_wage)
abline(v=mean(average_wage), lwd=2, col="black")
abline(v=mean(average_wage)+sd(average_wage),lwd=8,col="black")
abline(v=mean(average_wage)-sd(average_wage),lwd=8,col="black")
abline(v=mean(sample100.1),lwd=2,col="blue")
abline(v=mean(sample100.2),lwd=2,col="blue")
abline(v=mean(sample100.3),lwd=2,col="blue")
average_mean100<-mean(c(mean(sample100.1),mean(sample100.2),mean(sample100.3)))
abline(v=average_mean100,lwd=2,col="red")

According to the above histogram, means of samples are very close to the population mean (within one standard deviation). The average of sample means is more close to the true mean than two of the sample means because of the central limit theorem. The variance of the sample means will be the variance of the population divided by the sample size.Thus, sampling distribution will get a relatively small variance. Also, the mean of sampling distribution is the true mean. Thus, the average of the means of my random samples is closer to true mean.

Q1.b Larger samples

set.seed(10000)
sample500.1<-sample(average_wage, 500, replace=FALSE)
set.seed(90000)
sample500.2<-sample(average_wage, 500, replace=FALSE)
set.seed(80000)
sample500.3<-sample(average_wage, 500, replace=FALSE)
hist(average_wage)
abline(v=mean(average_wage), lwd=2, col="black")
abline(v=mean(average_wage)+sd(average_wage),lwd=8,col="black")
abline(v=mean(average_wage)-sd(average_wage),lwd=8,col="black")
abline(v=mean(sample500.1),lwd=2,col="orange")
abline(v=mean(sample500.2),lwd=2,col="orange")
abline(v=mean(sample500.3),lwd=2,col="orange")
average_mean500<-mean(c(mean(sample500.1),mean(sample500.2),mean(sample500.3)))
abline(v=average_mean500,lwd=2,col="blue")

As the above histogram, means of samples are very close to the population mean (within one standard deviation). Also, they locate closer to the true mean than those of samples with 100 observations.If I took smaller samples, the results would be opposite.The average of sample means and the true mean is somewhat overlapping. The average of sample means is closer to the ture mean due to the central limit theorem. The variance of the sample means will be the variance of the population divided by the sample size.Thus, sampling distribution will get a relatively small variance. Also, the mean of sampling distribution is the true mean. Thus, the average of the means of my random samples is closer to true mean.The variation in the estimates of the mean results from variation among observations.If some observations are drawn repeatly into multiple samples, the variation in estimates of the mean would decrease due to the stable of the composition of sample. It doesn’t matter if there are some ouliers in the population as long as we randomly draw samples.

Q2 Standard errors and confidence intervals

mean<-mean(sample500.1, na.rm=T)
mean
## [1] 21633.61
n<-sum(complete.cases(sample500.1))
n
## [1] 500
std.error<-sd(sample500.1,na.rm=TRUE)/sqrt(n)
std.error
## [1] 178.7684

There are 500 cases in my sample. The mean is 21,633.61 with a standard error of 178.7684. The standard error of the mean is standard deviation of sampling distribution. It means how sample mean deviate from the true mean.

margin.error<-1.96*std.error
upper<-mean+margin.error
lower<-mean-margin.error
c(lower,upper)
## [1] 21283.22 21983.99

The confidence interval around my point estimate is from 21,283.22 to 21,983.99.Thus, I am 95% confident that the true mean is between 21,283.22 and 21,983.99 dollars per year.There are three conditions that we should meet to get valid confidence interval: 1.The sample observations are independent.Since the sample is randomly selected and consists of fewer than 10% of the population, this requiremnet has been met.2. The sample size is larger than 30. 3. According to previous analysis, the population distribution would be approximately normal without outliers.These conditions are crucial for the precision of standard error and confidence interval since they are assure of the normal distribution of sample means and the sufficient sample size to estimate the SE.

margin.error<-2.58*std.error
upper<-mean+margin.error
lower<-mean-margin.error
c(lower,upper)
## [1] 21172.38 22094.83

The 99% confident interval is from 21,172.38 to 22,094.83 which is wider than the 95% confident interval. Since it has a larger range, I can be more certain that I capture the true mean. In other words, I am 99% confident that the true mean lies between 21,172.38 and 22,094.83.

Q3 Create a binary variable

summary(county$bachelors)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    3.70   12.90   16.40   17.72   21.30   36.80     156
hist(county$bachelors)

n<-sum(complete.cases(county$bachelors))
n
## [1] 2987

Since my original independent variable is quantitative and it is positively skewed, I’d like to create a binary variable with a cutoof point of 16.40 %.

cutoff<-16.4
cutoff
## [1] 16.4
county$bachelors_binary<-NA
county$bachelors_binary[which(county$bachelors<cutoff)]=0
county$bachelors_binary[which(county$bachelors>=cutoff)]=1
mean(county$bachelors_binary,na.rm=TRUE)
## [1] 0.5068631
barplot(table(county$bachelors_binary,useNA ="no"))

n_bachelors_binary0<-sum(complete.cases(county$bachelors_binary[which(county$bachelors<cutoff)]))
n_bachelors_binary0
## [1] 1473
n_bachelors_binary1<-2987-n_bachelors_binary0
n_bachelors_binary1
## [1] 1514

The mean of my binarary variable is 0.5068631.There are 1473 cases in the category with a percentage of bachelors that is lower than 16.4%. Another category has 1514 cases with a percentage of bachelors that is equal to or higher than 16.4%.

Q4 Statistically assessing the difference in means: using a t-test I think, the group with a percentage of bachelors that is equal to or higher than 16.4% will have a higher mean of my dependent variable(average wage).First, workers with higher education experience have stronger bargain power and are likely to ask for higher wage. Second, highly educated workers tend to work for highly skilled positions which offer more salary. Third, the high payment received by highly educated workers leverage the average wage. Thus, the higher percentage of bachelor, the higher average wages. In this case, my null hypothesis is that the mean of average wage between these two groups remains the same. My alternative hypothesis is that the mean of average wage of the group with higher percentage of bachelors is larger than that of another group. Thus, this is an one-tailed test.

hist(county$per_capita_income[which(county$bachelors_binary==0)], col=rgb(1,0,0,0.3))
hist(county$per_capita_income[which(county$bachelors_binary==1)],add=TRUE,col=rgb(0,0,1,0.3))

boxplot(county$per_capita_income~county$bachelors_binary,names = c("lower group","higher group"))

According to the histogram, eventhough there are some overlapping areas, the histogram of the group of counties with lower percentage of bachelors locates at the left side of the x-axis to a large extent that indicates a relatively low average income. Oppositely, the histogram of counties with higher percentage of bachelors locates at the left side of the x-axis that shows a higher average income. As the boxplot, the median of group with lower percent of bachelors is approximately 5,000 less than that of another group.The median of both groups are in the middle of the box that indicates these two groups are normally distributed. Also, the samples are randomly selected that shows the indenpendence of observation. Thus, my variables meet the necessary conditions to run t-test.

mean0<-mean(county$per_capita_income[which(county$bachelors_binary==0)],na.rm=TRUE)
mean1<-mean(county$per_capita_income[which(county$bachelors_binary==1)],na.rm=TRUE)
mean.diff<-mean0-mean1
mean0
## [1] 19322.61
mean1
## [1] 23993.77
mean.diff
## [1] -4671.156

The difference in means of the two group is -4,671.156, which shows a larger mean in group with more people with bachelors’ degree. It is the same direction as I expected.

sd0<-sd(county$per_capita_income[which(county$bachelors_binary==0)],na.rm=TRUE)
sd1<-sd(county$per_capita_income[which(county$bachelors_binary==1)],na.rm=TRUE)
n0<-sum(complete.cases(county$per_capita_income[which(county$bachelors_binary==0)]))
n0
## [1] 1470
n1<-sum(complete.cases(county$per_capita_income[which(county$bachelors_binary==1)]))
n1
## [1] 1466
se<-sqrt(sd0^2/(n0-1)+sd1^2/(n1-1))
se
## [1] 118.0222

The standard error of the difference in sample means shows the standard deviation for observations(mean difference) derived from the point estimate, zero in this case.

margin.error.t<-qt(0.95,df=min(c(n0-1,n1-1)))*se
upper<-mean.diff+margin.error.t
lower<-mean.diff-margin.error.t
c(lower,upper)
## [1] -4865.408 -4476.904

The confidence interval around the mean difference is from -4,865.408 to -4,476.904. I use the value of .95 to construct 95% confidence interval since it is a one-tailed test and 95% of the mean difference locate at the left side of the cutoff point.I have 95% confidence that the mean difference of average wages between these two samples should fill in the range from -4,865.408 to -4,476.904.

t<-(mean.diff-0)/se
t
## [1] -39.57862
p_value<-pt(-abs(t),df=min(c(n0-1,n1-1)))
p_value
## [1] 6.704416e-234
df<-n0+n1-2
df
## [1] 2934

The t-value is -39.57862 with the degree of freedom of 2934. The associated p-value shows the probability that the difference between two groups in terms of dependent variable is due to chance.Since the obtained t is extremely large and the p-value is relatively small, I reject the null hypothesis.

t.test(county$per_capita_income~county$bachelors_binary,alternative="less")
## 
##  Welch Two Sample t-test
## 
## data:  county$per_capita_income by county$bachelors_binary
## t = -39.592, df = 2853.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf -4477.03
## sample estimates:
## mean in group 0 mean in group 1 
##        19322.61        23993.77

The result of runing R function is the same as the above calculation. Thus, the statistical analysis help me become more certain about my hypothesis, which is that counties with higher percentage of bachelors tend to have higher level of average wage. The results may change once we change the cutoff point to create binary variable.The whole process is struggling,especially the binary variable part, but I learned a lot from it. Now, I am considering where I should go next. If I should chose another cutoff point to re-run the test or something else.