Psyc210a Practice 04, due by 10:30pm, Wednesday 9/28

If you use this .Rmd template file to do this practice, make sure to submit two documents: 1) Psc210a_PR04_yourInitials.Rmd; 2) Psc210a_PR04_yourInitials.html (this .html file should be generated from your .Rmd file). Before you submit your work, replace ‘yourInitials’ by your own initials in the file name.

If you use any other statistical program (e.g. SPSS), make sure to submit your SPSS syntax codes together with your answers to the questions.

Write R codes to set your working directory to ‘Psy210a_F2022’. Make sure to include appropriate path.

##set my working directory  
setwd('~/Desktop/Psy210a_F2022')

##clear my work space
#rm(list=ls()) #Note: I did clear my data, but I commented this out so it would stop clearing my data

[totally 20 points] Eysenck (1974) did an experimental study to examine whether certain learning strategy can help participants to remember words in a word list and whether there is age difference. One hundred participants were randomly assigned into one of five groups (each group was instructed to use a different strategy while learning the word list). In the pr4_Q2data.sav (SPSS format) or pr4_Q2data.csv (.csv format), the variable ‘Recall” contains the number of words recalled correctly in the test session. For this practice, assume that ’recall’ is in a continuous scale. The other two variables in the data are ‘age’ (1=old adult, 2=young adult) and ‘group’ (categorical variable with five categories).

First read the data into R as a data frame. After setting up your working directory, you may use the following R code to read the data in.

##read the .csv data into R
working<-read.csv('pr4_Q2data.csv')  

##OR read the .sav (spss) data into R
##if you use the following R codes, make sure to install 'foreign' package first 
## install.packages('foreign')
#working<-foreign::read.spss('pr4_Q2data.sav', to.data.frame = TRUE)

[2 points] Focus on the variable ‘recall’ in the data, check whether there is any missing value for variable ‘recall’. If there is any missing value for variable ‘recall’, remove the observation(s) with missing value in ‘recall’.

## your R codes for 2a
working<-read.csv('pr4_Q2data.csv')  
pr4_Q2data<-read.csv('pr4_Q2data.csv')  #also naming it this

head(pr4_Q2data, 3) #to read first 3 rows of data

##   Age Recall group
## 1   1      9     1
## 2   1      8     1
## 3   1      6     1

head(working, 3)

##   Age Recall group
## 1   1      9     1
## 2   1      8     1
## 3   1      6     1

sum(is.na(pr4_Q2data$Recall)) ##check for missing value

## [1] 0

#[1]0 means no missing values 
#pr4_Q2data[!is.na(pr4_Q2data$Recall), #the code i would use to remove observations with missing  value in variable 'recall' if there were any

[4 points] Generate side-by-side boxplots of ‘Recall” by ‘group’, with a main title ‘Boxplot of recall by group’ and lable Y axis as ‘Recall of words’. Describe briefly what the boxplots reveal in terms of the relationship between ’Recall” and the learning conditions.

## your R codes for 2b
boxplot(pr4_Q2data$Recall ~ pr4_Q2data$group,
main="Boxplot of recall by group",
ylab = 'Recall of words',
xlab='Group',
frame.plot=FALSE)

#The boxplots reveal that participants in groups 1 and 2 recalled fewer words overall and on average, as demonstrated by the smaller means and shorter whiskers. Participants in group 3 recalled more words on average and overall than groups 1 and 2, and was also the group with the most outliers. Group 4 recalled the most words on average and overall. Group 5 recalled slightly fewer words on average and had an overall larger range of words recalled than any other group.

[2 points] Generate a histogram of ‘Recall” with a main title ‘Histogram of Recall of Words’, label X axis as ‘Recall of words’, label Y axis as ‘Density’ (make sure that the scale of Y is density, not frequency). Is this distribution a sample distribution or a sampling distribution? (you may put the answer to this question within your R code chunk as comments).

[4 points] Add a normal density curve and an empirical density curve to the histogram in 2c. Describe briefly the distribution of variable ’Recall”.

## your R codes for 1c & 1d
hist(pr4_Q2data$Recall,
    prob=TRUE, ## argument 'prob=TRUE' specifies 'density' as the scale of Y-axis
    ## without 'prob=TRUE', the default of Y-axis is frequency
      main='Histogram of Recall of Words',
     xlab='Recall of words',
     ylab = 'Density',
     density = 15)

#2cThis is a sample distribution because it is data from one sample, not data from the entire population or statistics from multiple samples.


##2d add normal curve to my histogram
curve(dnorm(x, mean=11.61, sd=5.191), 
      from=3.0, 
      to=23, 
      col='green', add=TRUE) 

##Add kernel density curve/empirical density curve
lines(density(pr4_Q2data$Recall), col='purple')

#Find out mean, sd, summary to add curve 
mean=mean(pr4_Q2data$Recall)
sd=sd(pr4_Q2data$Recall)
summary=summary(pr4_Q2data$Recall)

#2d The distribution of recall of words is slightly positively skewed, as shown by the curves that are 'bunched up' on the left side of the x axis

[4 points] Suppose the data on ’Recall” in Eysenck study was a random sample (with sample size 100) drawn from the population of ’Recall” data with population mean 12.5 (i.e., \(\mu\)=12.5) and population standard deviation 6 (i.e., \(\sigma\)=6). Based upon this information, present the sampling distribution of sample means (where each sample size is 100, drawn from this same population) [Hint: consider what are the mean and SE of the sampling distribution of sample means]. Make sure that your graph has appropriate title and label x-axis and y-axis appropriately.

[2 points] Place/mark the mean of ‘Recall’ from Eysenck’s study on the sampling distribution you presented in 2e. Given this sampling distribution, what is the probability that a randomly drawn sample from this population (with sample size 100) will have sample mean less than or equal to the “Recall” mean of Eysenck’s study?

## your R codes for 2e & 2f

##2E
#Create a sampling distribution assuming normality bc of central limit theorem
n<-100 #sample size 100

#Define population mean/mu and sd/sigma   
pop.mean<-12.5
pop.sd<-6

#create standard error variable to put into curve 

se<-6/sqrt(n)
#Mean of sampling distribution of sample means
mean.samplingdist<-pop.mean


#Create the theoretical sampling distribution of sample means
curve(dnorm(x, mean=pop.mean, sd=se), 
      from = pop.mean-5*se, to=pop.mean+5*se,
      main="Theoretical sampling distribution of sample means of recall scores (n=100) \ndotted red line:sample mean",
      xlab="recall scores", ylab="Density", col="purple")
abline(v=12.5, lty='dotted', col='red')

##2F There's a 44.10% chance a randomly drawn sample from this population (with sample size 100) will have sample mean less than or equal to the "Recall" mean
#the probability of observing a sample mean at 12.5 or larger

pnorm(mean(working$Recall), 12.5, 6)

## [1] 0.44103986

[4 points] Compare the distribution of sample means in 2e) with the distribution of ’Recall” based upon Eysenck’s data (you presented this distribution in 2c). Comment on the differences and similarities between these two distributions, cite appropriate statistics, if necessary, to support your comments.

Your answers for 2g here:

## possible R codes, if any,  for 2g

#Make histogram with curve of Theoretical sampling distribution of sample means of recall scores and sample distribution
n<-100 #sample size 100
#Define population mean/mu and sd/sigma   
pop.mean<-12.5
pop.sd<-6

#standard error of sampling distribution of the sample means 
standarderror.mean<-pop.mean/sqrt(n)

#Mean of sampling distribution of sample means
mean.samplingdist<-pop.mean

#Create the theoretical sampling distribution of sample means
curve(dnorm(x, mean=pop.mean, sd=standarderror.mean), 
      from = pop.mean-12*standarderror.mean, to=pop.mean+12*standarderror.mean,
      main="Theoretical sampling distribution of sample means of recall scores (n=100) \ndotted red line:sample mean \npurple curve: theoretical sampling distribution \ngreen curve: sample distribution of sample 2c\n",
      xlab="recall scores", ylab="Density", col="purple")
#sample curve
curve(dnorm(x, mean=11.61, sd=5.191), 
   col='green', add=TRUE) 
abline(v=12.5, lty='dotted', col='red')

cat("\nSample Stats and Population Parameters\n")

## 
## Sample Stats and Population Parameters

SampleDisribution.Stats <- c(mean=mean(pr4_Q2data$Recall), sd=sd(pr4_Q2data$Recall))
Population.Parameters <- c(mean=mean(12.5), sd=6)
rbind(SampleDisribution.Stats,Population.Parameters)

##                          mean       sd
## SampleDisribution.Stats 11.61 5.191086
## Population.Parameters   12.50 6.000000

#As shown by the curves in the graph and the sample statistics/population parameters chart, both distributions are about normal and have similar means (population mean=6, sample mean=11.61) and sd (population sd=6, sample sd=5.19). The population has overall more samples and greater density around the mean than the sample, as shown by the taller, skinnier purple curve as compared to the shorter, more spread out green curve (sample).

[totally 10 points] Using the ’Recall” data in Eysenck’s study from question 2 (you showed its distribution in 2c):

[2 points] Generate a graph to show the empirical cumulative distribution function (eCDF) of ’Recall” with appropriate title and labels for the X and Y axises.

## your R codes for 3a

#3a
prob<-ecdf(pr4_Q2data$Recall)
#prob

plot(prob, main='The empirical cumulative function (eCF) of recall scores',
    xlab='Recall score',
     ylab='Pr(x<=given recall score)',
     col='blue')

[2 points] Based upon the eCDF, compute the proportion of observations with ’Recall” value greater than 15.

[2 points] Based upon the eCDF, if randomly selecting one observation from the data, what is the probability that the ’Recall” value of this observation will be between 11 and 15.

## your R codes for 3b & 3c

##3b: 25% of the observations in recall will be greater than 15 
summary(pr4_Q2data$Recall) #run summary of data to find out max to help calculate probability of values between 15 and 23

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    7.00   11.00   11.61   15.25   23.00

prob<-ecdf(pr4_Q2data$Recall) #create ecd function, object prob is function
prob(23)##cumulative probability of recall score <=23

## [1] 1

prob(15)##cumulative probability of recall score <=15

## [1] 0.75

prob(23)-prob(15)

## [1] 0.25

paste('[1] 0.25')

## [1] "[1] 0.25"

#easier way to calculate this:
1-prob(15)

## [1] 0.25

##3c Based on the eCDF, the probability that a randomly selected observation will be between 11 and 15 is 16%
prob(15)

## [1] 0.75

prob(11)

## [1] 0.59

prob(15)-prob(11)

## [1] 0.16

paste('[1] 0.16')

## [1] "[1] 0.16"

[4 points] If randomly selecting a person from the data (let’s say his name is John) and John’s ’Recall” value is 15. Would John be from the old age group (variable Age=1) or from the young age group (Age=2)? Support your answer with appropriate statistics or graphs (but do NOT do any hypothesis testing here) [hint: think about the probability of ’Recall” >=15 for each age group].

#The probability of randomly selecting John (score 15) from the older group is 8% but the probability of selecting him from the younger group is 42%, therefore it's more likely that John belongs to the younger age group. 

older<-subset(pr4_Q2data, Age==1,select = c('Recall', 'group')) #subset older adults
younger<-subset(pr4_Q2data, Age==2,select = c('Recall', 'group')) #subset younger adults)

##show summary stats by age
older.stats <- with(older, c(n=length(Recall),mean=mean(Recall), sd=sd(Recall),min=min(Recall), max=max(Recall)))
younger.stats <- with(younger, c(n=length(Recall),mean=mean(Recall), sd=sd(Recall),min=min(Recall), max=max(Recall)))

cat('The sample statistics by age\n') #add a spiffy title to my summary stats table

## The sample statistics by age

rbind(older.stats,younger.stats)

##                n  mean        sd min max
## older.stats   50 10.06 4.0071874   3  23
## younger.stats 50 13.16 5.7865432   4  22

older.ecdf<-ecdf(older$Recall) #create ecdf by age group function
younger.ecdf<-ecdf(younger$Recall)

#Probability of recall score being 15 in older and younger age group: 
1-(older.ecdf(15))

## [1] 0.08

1- younger.ecdf(15)

## [1] 0.42

[totally 6 points] Suppose the sample data ’Recall” in Eysenck’s study was drawn from a population of ’Recall” which is normally distributed, with population mean 12.5 (i.e., \(\mu\)=12.5) and population standard deviation 6 (i.e., \(\sigma\)=6):

[2 points] Based upon the assumed population distribution, compute, in the population, the proportion of observations with ’Recall” value greater than 15.

## your R codes for 4a

#4a. Assuming a normal distribution, about 33.85% of observations are greater than 15. 
pnorm(15, mean=12.5, sd=6, lower.tail=FALSE)

## [1] 0.33846112

paste('[1] 0.3384611')

## [1] "[1] 0.3384611"

[2 points] Based upon the assumed population distribution, if you randomly select one observation from the population, what is the probability that the ’Recall” value of this observation will be between 11 and 15.

## your R codes for 4b
##4b There's about a 26.02% chance of a randomly selected observation being between 11 and 15 
p15<-pnorm(15, mean=12.5, sd=6, lower.tail=TRUE)

p11<-pnorm(11, mean=12.5, sd=6, lower.tail=TRUE)

p15-p11

## [1] 0.26024521

[2 points] If randomly selecting a sample with sample size 400 from this population, what is the expected difference between this sample mean and the true population mean? Show how you estimate this difference.

## your R codes for 4c
#The standard error is .3. This indicates that the mean of this randomly selected sample is expected to be .3 away from the population mean due to sampling error. If the mean of the sample is within this range, it indicates that the population mean and sample mean are about the same (Central Limit Theorem), so it means the sample is a reliable estimate of the whole population. If it's larger than this, we know it's unlikely that this sample was random or that it came from the same population. 

#population mean 12.5 (i.e., $\mu$=12.5) and population standard deviation 6 (i.e., $\sigma$=6):   
#calculate standard error; se= tells you how close sample mean is to pop mean 
mu<-12.5
sigma<-6

n=400

##standard error of sampling distribution of sample means based upon population sd
se.mean<-sigma/sqrt(n)

Psyc210a Practice 04, due by 10:30pm, Wednesday 9/28

Caley M Mikesell