title: "Psyc210a Practice 04, due by 10:30pm, Wednesday 9/28" author: "Caley M Mikesell" output: html_document ---
```{r options, echo=FALSE} knitr::opts_chunk$set(echo=TRUE) options(warn=-1, width=96, digits=8, show.signif.stars=FALSE, str=strOptions(strict.width="cut"), scipen=999)
```
If you use this .Rmd template file to do this practice, make sure to submit two documents: 1) Psc210aPR04yourInitials.Rmd; 2) Psc210aPR04yourInitials.html (this .html file should be generated from your .Rmd file). Before you submit your work, replace ‘yourInitials’ by your own initials in the file name.
If you use any other statistical program (e.g. SPSS), make sure to submit your SPSS syntax codes together with your answers to the questions.
```{r, error=TRUE}
setwd('~/Desktop/Psy210a_F2022')
```
First read the data into R as a data frame. After setting up your working directory, you may use the following R code to read the data in.
```{r, error=TRUE}
working<-read.csv('pr4_Q2data.csv')
```
a. [2 points] Focus on the variable 'recall' in the data, check whether there is any missing value for variable 'recall'. If there is any missing value for variable 'recall', remove the observation(s) with missing value in 'recall'.
```{r, error=TRUE}
working<-read.csv('pr4Q2data.csv')
pr4Q2data<-read.csv('pr4_Q2data.csv') #also naming it this
head(pr4Q2data, 3) #to read first 3 rows of data head(working, 3) sum(is.na(pr4Q2data$Recall)) ##check for missing value
```
b. [4 points] Generate side-by-side boxplots of 'Recall" by ‘group’, with a main title ‘Boxplot of recall by group’ and lable Y axis as ‘Recall of words’. Describe briefly what the boxplots reveal in terms of the relationship between 'Recall" and the learning conditions.
{r, error=TRUE} ## your R codes for 2b boxplot(pr4_Q2data$Recall ~ pr4_Q2data$group, main="Boxplot of recall by group", ylab = 'Recall of words', xlab='Group', frame.plot=FALSE) #The boxplots reveal that participants in groups 1 and 2 recalled fewer words overall and on average, as demonstrated by the smaller means and shorter whiskers. Participants in group 3 recalled more words on average and overall than groups 1 and 2, and was also the group with the most outliers. Group 4 recalled the most words on average and overall. Group 5 recalled slightly fewer words on average and had an overall larger range of words recalled than any other group.
c. [2 points] Generate a histogram of 'Recall" with a main title ‘Histogram of Recall of Words’, label X axis as ‘Recall of words’, label Y axis as ‘Density’ (make sure that the scale of Y is density, not frequency). Is this distribution a sample distribution or a sampling distribution? (you may put the answer to this question within your R code chunk as comments).
d. [4 points] Add a normal density curve and an empirical density curve to the histogram in 2c. Describe briefly the distribution of variable 'Recall".
```{r, error=TRUE}
hist(pr4_Q2data$Recall, prob=TRUE, ## argument 'prob=TRUE' specifies 'density' as the scale of Y-axis ## without 'prob=TRUE', the default of Y-axis is frequency main='Histogram of Recall of Words', xlab='Recall of words', ylab = 'Density', density = 15)
curve(dnorm(x, mean=11.61, sd=5.191), from=3.0, to=23, col='green', add=TRUE)
lines(density(pr4_Q2data$Recall), col='purple')
mean=mean(pr4Q2data$Recall) sd=sd(pr4Q2data$Recall) summary=summary(pr4_Q2data$Recall)
```
e. [4 points] Suppose the data on 'Recall" in Eysenck study was a random sample (with sample size 100) drawn from the population of 'Recall" data with population mean 12.5 (i.e., $\mu$=12.5) and population standard deviation 6 (i.e., $\sigma$=6). Based upon this information, present the sampling distribution of sample means (where each sample size is 100, drawn from this same population) [Hint: consider what are the mean and SE of the sampling distribution of sample means]. Make sure that your graph has appropriate title and label x-axis and y-axis appropriately.
f. [2 points] Place/mark the mean of 'Recall' from Eysenck's study on the sampling distribution you presented in 2e. Given this sampling distribution, what is the probability that a randomly drawn sample from this population (with sample size 100) will have sample mean less than or equal to the "Recall" mean of Eysenck's study?
```{r, error=TRUE}
n<-100 #sample size 100
pop.mean<-12.5 pop.sd<-6
se<-6/sqrt(n)
mean.samplingdist<-pop.mean
curve(dnorm(x, mean=pop.mean, sd=se), from = pop.mean-5se, to=pop.mean+5se, main="Theoretical sampling distribution of sample means of recall scores (n=100) \ndotted red line:sample mean", xlab="recall scores", ylab="Density", col="purple") abline(v=12.5, lty='dotted', col='red')
pnorm(mean(working$Recall), 12.5, 6)
```
g. [4 points] Compare the distribution of sample means in 2e) with the distribution of 'Recall" based upon Eysenck’s data (you presented this distribution in 2c). Comment on the differences and similarities between these two distributions, cite appropriate statistics, if necessary, to support your comments.
Your answers for 2g here:
```{r, error=TRUE}
n<-100 #sample size 100
pop.mean<-12.5 pop.sd<-6
standarderror.mean<-pop.mean/sqrt(n)
mean.samplingdist<-pop.mean
curve(dnorm(x, mean=pop.mean, sd=standarderror.mean), from = pop.mean-12standarderror.mean, to=pop.mean+12standarderror.mean, main="Theoretical sampling distribution of sample means of recall scores (n=100) \ndotted red line:sample mean \npurple curve: theoretical sampling distribution \ngreen curve: sample distribution of sample 2c\n", xlab="recall scores", ylab="Density", col="purple")
curve(dnorm(x, mean=11.61, sd=5.191), col='green', add=TRUE) abline(v=12.5, lty='dotted', col='red')
cat("\nSample Stats and Population Parameters\n") SampleDisribution.Stats <- c(mean=mean(pr4Q2data$Recall), sd=sd(pr4Q2data$Recall)) Population.Parameters <- c(mean=mean(12.5), sd=6) rbind(SampleDisribution.Stats,Population.Parameters)
```
a. [2 points] Generate a graph to show the empirical cumulative distribution function (eCDF) of 'Recall" with appropriate title and labels for the X and Y axises.
```{r, error=TRUE}
prob<-ecdf(pr4_Q2data$Recall)
plot(prob, main='The empirical cumulative function (eCF) of recall scores', xlab='Recall score', ylab='Pr(x<=given recall score)', col='blue') ```
b. [2 points] Based upon the eCDF, compute the proportion of observations with 'Recall" value greater than 15.
c. [2 points] Based upon the eCDF, if randomly selecting one observation from the data, what is the probability that the 'Recall" value of this observation will be between 11 and 15.
```{r, error=TRUE}
summary(pr4_Q2data$Recall) #run summary of data to find out max to help calculate probability of values between 15 and 23
prob<-ecdf(pr4_Q2data$Recall) #create ecd function, object prob is function prob(23)##cumulative probability of recall score <=23 prob(15)##cumulative probability of recall score <=15
prob(23)-prob(15) paste('[1] 0.25')
1-prob(15)
prob(15) prob(11) prob(15)-prob(11) paste('[1] 0.16')
```
d. [4 points] If randomly selecting a person from the data (let’s say his name is John) and John’s 'Recall" value is 15. Would John be from the old age group (variable Age=1) or from the young age group (Age=2)? Support your answer with appropriate statistics or graphs (but do NOT do any hypothesis testing here) [hint: think about the probability of 'Recall" >=15 for each age group].
```{r, error=TRUE}
older<-subset(pr4Q2data, Age==1,select = c('Recall', 'group')) #subset older adults younger<-subset(pr4Q2data, Age==2,select = c('Recall', 'group')) #subset younger adults)
older.stats <- with(older, c(n=length(Recall),mean=mean(Recall), sd=sd(Recall),min=min(Recall), max=max(Recall))) younger.stats <- with(younger, c(n=length(Recall),mean=mean(Recall), sd=sd(Recall),min=min(Recall), max=max(Recall)))
cat('The sample statistics by age\n') #add a spiffy title to my summary stats table rbind(older.stats,younger.stats)
older.ecdf<-ecdf(older$Recall) #create ecdf by age group function younger.ecdf<-ecdf(younger$Recall)
1-(older.ecdf(15)) 1- younger.ecdf(15) ```
a. [2 points] Based upon the assumed population distribution, compute, in the population, the proportion of observations with 'Recall" value greater than 15.
```{r, error=TRUE}
pnorm(15, mean=12.5, sd=6, lower.tail=FALSE) paste('[1] 0.3384611') ```
b. [2 points] Based upon the assumed population distribution, if you randomly select one observation from the population, what is the probability that the 'Recall" value of this observation will be between 11 and 15.
```{r, error=TRUE}
p15<-pnorm(15, mean=12.5, sd=6, lower.tail=TRUE)
p11<-pnorm(11, mean=12.5, sd=6, lower.tail=TRUE)
p15-p11
```
c. [2 points] If randomly selecting a sample with sample size 400 from this population, what is the expected difference between this sample mean and the true population mean? Show how you estimate this difference.
```{r, error=TRUE}
mu<-12.5 sigma<-6
n=400
se.mean<-sigma/sqrt(n)
```