In completing this assignment, you will:
R and Rmd
skillsRR to compute confidence intervals for means and
proportionsSee the instructions in Assignments 1 and 2, and the
grading_instructions.Rmd.
Define the following terms in a sentence (or short paragraph) and state a formula if appropriate.
i. Statistical analysis in which you put your assumptions about a population parameter to the test.
ii. Point estimate involves the use of sample data to calculate a single value which is to serve as a “best estimate” of an unknown parameter.
iii. The use of sample data to estimate an interval of plausible values of a parameter of interests
iv. The probability distribution of that sample statistic when the statistic is viewed as a random variable
v. Level of confidence is the proportion of confidence intervals which, construct under repeated random sampling
If the income in a community is normally distributed, with a mean of 38,000 and a standard deviation of 6,000, what maximum income does a member of the community have to earn in order to be in the bottom 5%? What is the maximum income one can have and still be in the middle 60%?
Use the normal tables in the back of the book, or your McMaster-approved calculator. Explain the steps you took in getting the answers, and state your answers.
i. Find the z-value
5%= 0.05
\(P[X<=x]\)
plug 0.05 into the normal table
Z-score= -1.645 because there is no exact 0.05 present in the table it is in-between 1.64 and -1.65
plug z-score into the equation
\((X- \mu) / (\sigma)\)
to find values distributed between 5%
\((X- 38,000)/(6000)=-1.645\)
\(x-38,000=-1.645*6000\)
\(x=38,000-9870\)
\(x= 28,130\)
28,130 is the maximum income for someone to be considered in the bottom 5% of earners.
ii. Let X2 be Max income one can have and be in the middle 60% of the population an then let X3 be the min income one can have and be in the middle 60%
\(P(X3<=X<=X2)= 0.6\)
\(P(Z3<=z<=Z2)=0.6\)
\(Z3= -0.845 and Z2= 0.845\)
\(Z2=X2-\mu/\sigma\)
\(Z2= X2-38000/6000=1.015\)
X2= 44,091
maximum amount to be within the middle 60% is 43,070
Now, answer the previous question again using R, using
the dnorm, pnorm, and qnorm
functions (whichever is appropriate).
Make sure to compare the two answers. Did you find the same answer twice?
Do not forget to add text before and after every code chunk!
To answer question 2 using R you have to substitute the probability,
mean and Standard deviation into the function qnorm
(i<-qnorm(0.05,38000, 6000))
## [1] 28130.88
(ii<-qnorm (0.845, 38000, 6000))
## [1] 44091.33
The answer for i) in question 2 comes out as $28130 which is the maximum income to be within the bottom 5 percent which coincides with the answer from the previous question ii) In question 2 comes out as $44,091 which is the maximum income to be within the middle 60% which coincides with the answer of the previous question.
Suppose that the number of hours per week of lost work due to illness in a certain automobile assembly plant is approximately normally distributed, with a mean of 40 hours and a standard deviation of 15 hours. For a given week, selected at random, what is the probability that:
Use R code to answer each question.
i. Use pnorm(70,40,15) to find the probability of 70
hours from lost work then subtract the figure from 1 to find the
probability that lost work will exceed 70 hours.
1-pnorm(70,40,15)
## [1] 0.02275013
Probability that lost work hours will exceed 70 hours is 22.75%
ii. To find the probability you must subtract the probability of
having 35-45 lost hours I use the function
pnorm(of 45hours)-pnorm( of 35 hours)
pnorm(45,40,15)-pnorm(30,40,15)
## [1] 0.3780661
iii. cannot be answered because x is continuous. a single point cannot have a density.
A senator claims that 58% of her constituents favour her voting policies over the past year. In a random sample of 50 of these people, the sample proportion of those favoured her voting policies was only 0.4. Is this enough evidence to make the senator’s claims strongly suspect? (Hint: Use a normal approximation to the binomial distribution, then construct a confidence interval).
It is up to you whether you want to compute this with R
or with other methods. In the latter case, please explain the steps you
took to arrive at your answer, and which methods you used to obtain the
numerical answers.
Let P Denote the proportions of constituents favoring her voting policies. The appropriate hypothesis as per the claim is
\(H0: p=0.58%\)
\(H1: p \not = 0.58\)
sample proportions \(\hat P\)= 0.4
sample size \((n)= 50\)
the confidence interval for true proportion is
\(\hat P+-z sqrt(\hat P(1-\hat P))/ (n)\)
Critical Value= \(z0.025 = 1.96\)
The 95% confidence interval is
\(0.4\ +- 1.96* sqrt ((0.4(1-0.4)/(50))\)
\(= 0.4+- 0.136\)
\(=(0.264, 0.536)\)
58% does not fall within the confidence interval so the senators claim must be rejected
I wish to estimate the proportion of defectives in a large production lot with plus or minus \(D=0.02\) of the true proportion, with a 99% level of confidence. From past experience it is believed that the true proportion of defectives is \(\pi=0.02\). How large a sample must be used? (Hint: Use a normal approximation for the sample proportion \(\hat P\).
Use R as your calculator, and walk us through the steps
in your calculations.
The sample size in (n) is calculated according to the formula
n= z^2 *p(1-p)/e^2
Where z= 2.576 for a confidence level (a) of 99% , p= proportion (expressed as a decimal), e = margin of error
z=2.576, p= 0.02, e= 0.02
2.576^2*0.02*(1-0.02)/0.02^2
## [1] 325.153
The sample size should be 325.153
A cereal company checks the weight of its breakfast cereal by randomly checking 62 of the boxes. This particular brand is packed in 20-ounce boxes. Suppose that a particular random sample of 62 boxes results in a mean weight of 20.02 ounces. How often will the sample mean be this high, or higher, if \(\mu=20\) and \(\sigma=0.10\)?
It is up to you whether you want to compute this with R
or with other methods. In the latter case, please explain the steps you
took to arrive at your answer, and which methods you used to obtain the
numerical answers.
We need to find \(P(\bar x>=20.02)\). As distribution is normal we convert \(\bar x\) to z
= \(P(z>=((20.02-20)/(0.10/sqrt(62)))\)
=\(P(z>= 1.57)\)
=1-1.57
=1-0.9418
=0.0582
In this question, and the next two, you will work with data from the Current Population Survey (CPS), see more information at this Wikipedia page.
You will use a slice of the CPS data provided by the AER
package. Make sure you run the code in the following code chunk once, so
that the AER package is available on your system or in your
Rstudio.cloud project.
install.packages("AER")
The following code chunk loads the AER package and the
data set CPSSW9204 that we are going to use.
library(AER)
data("CPSSW9204")
For more information about the data set, you can, for example, ask
?CPSSW9204. Or you can run the code in the following code
chunk (don’t remove eval=FALSE: it would print all the data
to the PDF when you knit!).
CPSSW9204
The sample size is 15588. The following frequency table for
year reveals that roughly 7600 observations were collected
in 1992, and 7986 were collected in 2004:
table(CPSSW9204$year)
##
## 1992 2004
## 7602 7986
We take the 1992 observations and store them in a new data set
CPS92:
CPS92 <- subset(CPSSW9204, year == 1992)
For the 1992 data, we can use a frequency table to look at the distribution of education of the employees in the sample:
table(CPS92$degree)
##
## highschool bachelor
## 4640 2962
Make a similar table for the observations in 2004. Compare your findings to the 1992 frequency table. Interpret the results, focusing on the difference in tables for 1992 versus 2004.
In the first r segment I make a subcategory CPS04 which
contains data from 2004
CPS04 <- subset(CPSSW9204, year == 2004)
in the second r segment i use the function
table(CPS04$degree) to create a table containing the number
of high school and bachelor degree graduates.
table(CPS04$degree)
##
## highschool bachelor
## 4346 3640
In 1992 there were more people with high-school diplomas as their highest level of education 61% compared to 0.54% in 2004. In 2004 there were more people with bachelors as their highest level of education 45.6% compared to 39% in 1992.
Continue with the data above. For 1992, compute the sample mean and
standard deviation of earnings. Then make a 95% confidence
interval for income in 1992. Interpret your findings.
Repeat all steps for the observations from 2004.
Comment on the difference in your findings for 1992 versus 2004.
In 1992:
To find the mean and standard deviation of earning in 1992, by using codes
(m92<-mean(CPS92$earnings))
## [1] 11.62818
(s92<-sd(CPS92$earnings))
## [1] 5.558322
The mean: 11.62818
The standard deviation: 5.558322
Then I find the Standard Error of mean 1992 with function
(err92<-s92/sqrt(7602))
(err92<-s92/sqrt(7602))
## [1] 0.06374994
the Standard Error of mean = 0.0637
Then I find the margin error of 1992, with function
(er92<- qt(0.95, df= 7602)* err92)
(er92<- qt(0.95, df= 7602)* err92)
## [1] 0.1048721
The margin of error is 0.104872
Lastly I find the upper and lower bounds using functions
m92-er92 and
m92+er92
m92-er92
## [1] 11.52331
m92+er92
## [1] 11.73305
The upper and lower bound are (11.52331, 11.73305)
ii.
In 2004:
To find the mean and standard deviation of earning in 2004, by using codes
(m04<-mean(CPS04$earnings))
## [1] 16.77115
(s04<-sd(CPS04$earnings))
## [1] 8.758696
The mean: 16.77115
The standard deviation: 8.8758696
Then I find the Standard Error of mean 2004 with function
(err04<-s04/sqrt(7986))
(err04<-s04/sqrt(7986))
## [1] 0.09801099
the Standard Error of mean = 0.098
Then I find the margin error of 2004, with function
(er04<- qt(0.95, df= 7602)* err04)
(er04<- qt(0.95, df= 7986)* err04)
## [1] 0.1612324
The margin of error is 0.01612324
Lastly I find the upper and lower bounds using functions
m92-er92 and
m92+er92
m04-er04
## [1] 16.60992
m04+er04
## [1] 16.93238
The upper and lower bound are (16.60992, 16.93238)
The confidence interval in 2004 is wider than in 1992 (0.0322 compared to 0.2097 in 1992) and the confidence interval range is greater in 2004 than 1992 (the interval in 2004 is between 16.60992-16.93238 while the confidence interval in 1992 is between 11.52331-11.73305)
Continue with the data from the previous exercise.
In lecture A7, we used the following code to make one
graph with boxplots for a quantitative variable for each value of a
qualitative variable:
ggplot(data = wage1, aes(x=wage, colour = factor(female))) + geom_boxplot
Have another look at those slides to check your understand of the code. Then, make four boxplots:
degree,
and one that colours by the variable gender.After each boxplot, write two sentences interpreting the boxplots, focusing on the comparison between colours and the changes over time.
i. Used function attach(CPS92) to load CPS92 data set
then used the function Degree<-as.numeric(degree) to
divide the level education into 1 and 2. 1- representing highschool
diploma holders and 2 representing Bachelor degree holders. Then finally
used the function hist(Degree, col= "blue") to make a blue
histogram of degrees in 1992
attach(CPS92)
Degree<-as.numeric(degree)
hist(Degree, col= "blue")
Histogram represent distribution of degree holders in 1992
ii. Used function attach(CPS92) to load CPS92 data set
then used the function Gender<-as.numeric(gender) to
divide the genders into 1 and 2. 1- representing men and 2 representing
women. Then finally used the function
hist(Gender, col= "green") to make a blue histogram of
genders in data set 1992
attach(CPS92)
## The following objects are masked from CPS92 (pos = 3):
##
## age, degree, earnings, gender, year
Gender<-as.numeric(gender)
hist(Gender, col= "green")
Histogram represent distribution of genders in data set in 1992
iii. For 2004:
Used function attach(CPS04) to load CPS04 data set then
used the function gender<-as.numeric(gender) to divide
the genders into 1 and 2. 1- representing men and 2 representing women.
Then finally used the function hist(gender, col= "yellow")
to make a blue histogram of genders in data set 1992
attach(CPS04)
## The following objects are masked from CPS92 (pos = 3):
##
## age, degree, earnings, gender, year
## The following objects are masked from CPS92 (pos = 4):
##
## age, degree, earnings, gender, year
gender<-as.numeric(gender)
hist(gender, col= "yellow")
Histogram represent distribution of genders in data set in 1992
iv. Used function attach(CPS04) to load CPS04 data set
then used the function Degree<-as.numeric(degree) to
divide the level education into 1 and 2. 1- representing highschool
diploma holders and 2 representing Bachelor degree holders. Then finally
used the function hist(degree, col= "red") to make a blue
histogram of degrees in 2004
attach(CPS04)
## The following object is masked _by_ .GlobalEnv:
##
## gender
## The following objects are masked from CPS04 (pos = 3):
##
## age, degree, earnings, gender, year
## The following objects are masked from CPS92 (pos = 4):
##
## age, degree, earnings, gender, year
## The following objects are masked from CPS92 (pos = 5):
##
## age, degree, earnings, gender, year
degree<-as.numeric(degree)
hist(degree, col= "red")
Histogram represent distribution of degree holders in 1992