Goals

In completing this assignment, you will:

Instructions

See the instructions in Assignments 1 and 2, and the grading_instructions.Rmd.

Question 1

Define the following terms in a sentence (or short paragraph) and state a formula if appropriate.

  1. Hypothesis test
  2. Point estimate
  3. Interval estimate
  4. Sampling distribution
  5. Level of confidence

Answer

i. Statistical analysis in which you put your assumptions about a population parameter to the test.

ii. Point estimate involves the use of sample data to calculate a single value which is to serve as a “best estimate” of an unknown parameter.

iii. The use of sample data to estimate an interval of plausible values of a parameter of interests

iv. The probability distribution of that sample statistic when the statistic is viewed as a random variable

v. Level of confidence is the proportion of confidence intervals which, construct under repeated random sampling

Question 2

If the income in a community is normally distributed, with a mean of 38,000 and a standard deviation of 6,000, what maximum income does a member of the community have to earn in order to be in the bottom 5%? What is the maximum income one can have and still be in the middle 60%?

Use the normal tables in the back of the book, or your McMaster-approved calculator. Explain the steps you took in getting the answers, and state your answers.

Answer

i. Find the z-value

5%= 0.05

\(P[X<=x]\)

plug 0.05 into the normal table

Z-score= -1.645 because there is no exact 0.05 present in the table it is in-between 1.64 and -1.65

plug z-score into the equation

\((X- \mu) / (\sigma)\)

to find values distributed between 5%

\((X- 38,000)/(6000)=-1.645\)

\(x-38,000=-1.645*6000\)

\(x=38,000-9870\)

\(x= 28,130\)

28,130 is the maximum income for someone to be considered in the bottom 5% of earners.

ii. Let X2 be Max income one can have and be in the middle 60% of the population an then let X3 be the min income one can have and be in the middle 60%

\(P(X3<=X<=X2)= 0.6\)

\(P(Z3<=z<=Z2)=0.6\)

\(Z3= -0.845 and Z2= 0.845\)

\(Z2=X2-\mu/\sigma\)

\(Z2= X2-38000/6000=1.015\)

X2= 44,091

maximum amount to be within the middle 60% is 43,070

Question 3

Now, answer the previous question again using R, using the dnorm, pnorm, and qnorm functions (whichever is appropriate).

Make sure to compare the two answers. Did you find the same answer twice?

Do not forget to add text before and after every code chunk!

Answer

To answer question 2 using R you have to substitute the probability, mean and Standard deviation into the function qnorm

(i<-qnorm(0.05,38000, 6000))
## [1] 28130.88
(ii<-qnorm (0.845, 38000, 6000))
## [1] 44091.33

The answer for i) in question 2 comes out as $28130 which is the maximum income to be within the bottom 5 percent which coincides with the answer from the previous question ii) In question 2 comes out as $44,091 which is the maximum income to be within the middle 60% which coincides with the answer of the previous question.

Question 4

Suppose that the number of hours per week of lost work due to illness in a certain automobile assembly plant is approximately normally distributed, with a mean of 40 hours and a standard deviation of 15 hours. For a given week, selected at random, what is the probability that:

  1. The number of lost work hours will exceed 70 hours?
  2. The number of lost work hours will be between 30 and 45 hours?
  3. The number of lost work hours will be exactly 50 hours?

Use R code to answer each question.

Answer

i. Use pnorm(70,40,15) to find the probability of 70 hours from lost work then subtract the figure from 1 to find the probability that lost work will exceed 70 hours.

1-pnorm(70,40,15)
## [1] 0.02275013

Probability that lost work hours will exceed 70 hours is 22.75%

ii. To find the probability you must subtract the probability of having 35-45 lost hours I use the function pnorm(of 45hours)-pnorm( of 35 hours)

pnorm(45,40,15)-pnorm(30,40,15)
## [1] 0.3780661

iii. cannot be answered because x is continuous. a single point cannot have a density.

Question 5

A senator claims that 58% of her constituents favour her voting policies over the past year. In a random sample of 50 of these people, the sample proportion of those favoured her voting policies was only 0.4. Is this enough evidence to make the senator’s claims strongly suspect? (Hint: Use a normal approximation to the binomial distribution, then construct a confidence interval).

It is up to you whether you want to compute this with R or with other methods. In the latter case, please explain the steps you took to arrive at your answer, and which methods you used to obtain the numerical answers.

Answer

Let P Denote the proportions of constituents favoring her voting policies. The appropriate hypothesis as per the claim is

\(H0: p=0.58%\)

\(H1: p \not = 0.58\)

sample proportions \(\hat P\)= 0.4

sample size \((n)= 50\)

the confidence interval for true proportion is

\(\hat P+-z sqrt(\hat P(1-\hat P))/ (n)\)

Critical Value= \(z0.025 = 1.96\)

The 95% confidence interval is

\(0.4\ +- 1.96* sqrt ((0.4(1-0.4)/(50))\)

\(= 0.4+- 0.136\)

\(=(0.264, 0.536)\)

58% does not fall within the confidence interval so the senators claim must be rejected

Question 6

I wish to estimate the proportion of defectives in a large production lot with plus or minus \(D=0.02\) of the true proportion, with a 99% level of confidence. From past experience it is believed that the true proportion of defectives is \(\pi=0.02\). How large a sample must be used? (Hint: Use a normal approximation for the sample proportion \(\hat P\).

Use R as your calculator, and walk us through the steps in your calculations.

Answer

The sample size in (n) is calculated according to the formula n= z^2 *p(1-p)/e^2

Where z= 2.576 for a confidence level (a) of 99% , p= proportion (expressed as a decimal), e = margin of error

z=2.576, p= 0.02, e= 0.02

2.576^2*0.02*(1-0.02)/0.02^2
## [1] 325.153

The sample size should be 325.153

Question 7

A cereal company checks the weight of its breakfast cereal by randomly checking 62 of the boxes. This particular brand is packed in 20-ounce boxes. Suppose that a particular random sample of 62 boxes results in a mean weight of 20.02 ounces. How often will the sample mean be this high, or higher, if \(\mu=20\) and \(\sigma=0.10\)?

It is up to you whether you want to compute this with R or with other methods. In the latter case, please explain the steps you took to arrive at your answer, and which methods you used to obtain the numerical answers.

Answer

We need to find \(P(\bar x>=20.02)\). As distribution is normal we convert \(\bar x\) to z

= \(P(z>=((20.02-20)/(0.10/sqrt(62)))\)

=\(P(z>= 1.57)\)

=1-1.57

=1-0.9418

=0.0582

Question 8

In this question, and the next two, you will work with data from the Current Population Survey (CPS), see more information at this Wikipedia page.

You will use a slice of the CPS data provided by the AER package. Make sure you run the code in the following code chunk once, so that the AER package is available on your system or in your Rstudio.cloud project.

install.packages("AER")

The following code chunk loads the AER package and the data set CPSSW9204 that we are going to use.

library(AER)
data("CPSSW9204")

For more information about the data set, you can, for example, ask ?CPSSW9204. Or you can run the code in the following code chunk (don’t remove eval=FALSE: it would print all the data to the PDF when you knit!).

CPSSW9204

The sample size is 15588. The following frequency table for year reveals that roughly 7600 observations were collected in 1992, and 7986 were collected in 2004:

table(CPSSW9204$year)
## 
## 1992 2004 
## 7602 7986

We take the 1992 observations and store them in a new data set CPS92:

CPS92 <- subset(CPSSW9204, year == 1992)

For the 1992 data, we can use a frequency table to look at the distribution of education of the employees in the sample:

table(CPS92$degree)
## 
## highschool   bachelor 
##       4640       2962

Make a similar table for the observations in 2004. Compare your findings to the 1992 frequency table. Interpret the results, focusing on the difference in tables for 1992 versus 2004.

Answer

In the first r segment I make a subcategory CPS04 which contains data from 2004

CPS04 <- subset(CPSSW9204, year == 2004)

in the second r segment i use the function table(CPS04$degree) to create a table containing the number of high school and bachelor degree graduates.

table(CPS04$degree)
## 
## highschool   bachelor 
##       4346       3640

In 1992 there were more people with high-school diplomas as their highest level of education 61% compared to 0.54% in 2004. In 2004 there were more people with bachelors as their highest level of education 45.6% compared to 39% in 1992.

Question 9

Continue with the data above. For 1992, compute the sample mean and standard deviation of earnings. Then make a 95% confidence interval for income in 1992. Interpret your findings.

Repeat all steps for the observations from 2004.

Comment on the difference in your findings for 1992 versus 2004.

Answer

In 1992:

To find the mean and standard deviation of earning in 1992, by using codes

(m92<-mean(CPS92$earnings))
## [1] 11.62818
(s92<-sd(CPS92$earnings))
## [1] 5.558322

The mean: 11.62818

The standard deviation: 5.558322

Then I find the Standard Error of mean 1992 with function (err92<-s92/sqrt(7602))

(err92<-s92/sqrt(7602))
## [1] 0.06374994

the Standard Error of mean = 0.0637

Then I find the margin error of 1992, with function (er92<- qt(0.95, df= 7602)* err92)

(er92<- qt(0.95, df= 7602)* err92)
## [1] 0.1048721

The margin of error is 0.104872

Lastly I find the upper and lower bounds using functions m92-er92 and

m92+er92

m92-er92
## [1] 11.52331
m92+er92
## [1] 11.73305

The upper and lower bound are (11.52331, 11.73305)

ii.

In 2004:

To find the mean and standard deviation of earning in 2004, by using codes

(m04<-mean(CPS04$earnings))
## [1] 16.77115
(s04<-sd(CPS04$earnings))
## [1] 8.758696

The mean: 16.77115

The standard deviation: 8.8758696

Then I find the Standard Error of mean 2004 with function (err04<-s04/sqrt(7986))

(err04<-s04/sqrt(7986))
## [1] 0.09801099

the Standard Error of mean = 0.098

Then I find the margin error of 2004, with function (er04<- qt(0.95, df= 7602)* err04)

(er04<- qt(0.95, df= 7986)* err04)
## [1] 0.1612324

The margin of error is 0.01612324

Lastly I find the upper and lower bounds using functions m92-er92 and

m92+er92

m04-er04
## [1] 16.60992
m04+er04
## [1] 16.93238

The upper and lower bound are (16.60992, 16.93238)

The confidence interval in 2004 is wider than in 1992 (0.0322 compared to 0.2097 in 1992) and the confidence interval range is greater in 2004 than 1992 (the interval in 2004 is between 16.60992-16.93238 while the confidence interval in 1992 is between 11.52331-11.73305)

Question 10

Continue with the data from the previous exercise.

In lecture A7, we used the following code to make one graph with boxplots for a quantitative variable for each value of a qualitative variable:

ggplot(data = wage1, aes(x=wage, colour = factor(female))) + geom_boxplot

Have another look at those slides to check your understand of the code. Then, make four boxplots:

  1. one for 1992, one for 2004
  2. for each year, one that colours by the variable degree, and one that colours by the variable gender.

After each boxplot, write two sentences interpreting the boxplots, focusing on the comparison between colours and the changes over time.

Answer

i. Used function attach(CPS92) to load CPS92 data set then used the function Degree<-as.numeric(degree) to divide the level education into 1 and 2. 1- representing highschool diploma holders and 2 representing Bachelor degree holders. Then finally used the function hist(Degree, col= "blue") to make a blue histogram of degrees in 1992

attach(CPS92)
Degree<-as.numeric(degree)
hist(Degree, col= "blue")

Histogram represent distribution of degree holders in 1992

ii. Used function attach(CPS92) to load CPS92 data set then used the function Gender<-as.numeric(gender) to divide the genders into 1 and 2. 1- representing men and 2 representing women. Then finally used the function hist(Gender, col= "green") to make a blue histogram of genders in data set 1992

attach(CPS92)
## The following objects are masked from CPS92 (pos = 3):
## 
##     age, degree, earnings, gender, year
Gender<-as.numeric(gender)
hist(Gender, col= "green")

Histogram represent distribution of genders in data set in 1992

iii. For 2004:

Used function attach(CPS04) to load CPS04 data set then used the function gender<-as.numeric(gender) to divide the genders into 1 and 2. 1- representing men and 2 representing women. Then finally used the function hist(gender, col= "yellow") to make a blue histogram of genders in data set 1992

attach(CPS04)
## The following objects are masked from CPS92 (pos = 3):
## 
##     age, degree, earnings, gender, year
## The following objects are masked from CPS92 (pos = 4):
## 
##     age, degree, earnings, gender, year
gender<-as.numeric(gender)

hist(gender, col= "yellow")

Histogram represent distribution of genders in data set in 1992

iv. Used function attach(CPS04) to load CPS04 data set then used the function Degree<-as.numeric(degree) to divide the level education into 1 and 2. 1- representing highschool diploma holders and 2 representing Bachelor degree holders. Then finally used the function hist(degree, col= "red") to make a blue histogram of degrees in 2004

attach(CPS04)
## The following object is masked _by_ .GlobalEnv:
## 
##     gender
## The following objects are masked from CPS04 (pos = 3):
## 
##     age, degree, earnings, gender, year
## The following objects are masked from CPS92 (pos = 4):
## 
##     age, degree, earnings, gender, year
## The following objects are masked from CPS92 (pos = 5):
## 
##     age, degree, earnings, gender, year
degree<-as.numeric(degree)

hist(degree, col= "red")

Histogram represent distribution of degree holders in 1992