If you use this .Rmd template file to do this practice, make sure to submit two documents: 1) Psc210a_PR04_yourInitials.Rmd; 2) Psc210a_PR04_yourInitials.html (this .html file should be generated from your .Rmd file). Before you submit your work, replace ‘yourInitials’ by your own initials in the file name.
If you use any other statistical program (e.g. SPSS), make sure to submit your SPSS syntax codes together with your answers to the questions.
##set my working directory
setwd('~/Desktop/Psy210a_F2022')
##clear my work space
#rm(list=ls()) #Note: I did clear my data, but I commented this out so it would stop clearing my data
First read the data into R as a data frame. After setting up your working directory, you may use the following R code to read the data in.
##read the .csv data into R
working<-read.csv('pr4_Q2data.csv')
##OR read the .sav (spss) data into R
##if you use the following R codes, make sure to install 'foreign' package first
## install.packages('foreign')
#working<-foreign::read.spss('pr4_Q2data.sav', to.data.frame = TRUE)
## your R codes for 2a
working<-read.csv('pr4_Q2data.csv')
pr4_Q2data<-read.csv('pr4_Q2data.csv') #also naming it this
head(pr4_Q2data, 3) #to read first 3 rows of data
## Age Recall group
## 1 1 9 1
## 2 1 8 1
## 3 1 6 1
head(working, 3)
## Age Recall group
## 1 1 9 1
## 2 1 8 1
## 3 1 6 1
sum(is.na(pr4_Q2data$Recall)) ##check for missing value
## [1] 0
#[1]0 means no missing values
#pr4_Q2data[!is.na(pr4_Q2data$Recall), #the code i would use to remove observations with missing value in variable 'recall' if there were any
## your R codes for 2b
boxplot(pr4_Q2data$Recall ~ pr4_Q2data$group,
main="Boxplot of recall by group",
ylab = 'Recall of words',
xlab='Group',
frame.plot=FALSE)
#The boxplots reveal that participants in groups 1 and 2 recalled fewer words overall and on average, as demonstrated by the smaller means and shorter whiskers. Participants in group 3 recalled more words on average and overall than groups 1 and 2, and was also the group with the most outliers. Group 4 recalled the most words on average and overall. Group 5 recalled slightly fewer words on average and had an overall larger range of words recalled than any other group.
## your R codes for 1c & 1d
hist(pr4_Q2data$Recall,
prob=TRUE, ## argument 'prob=TRUE' specifies 'density' as the scale of Y-axis
## without 'prob=TRUE', the default of Y-axis is frequency
main='Histogram of Recall of Words',
xlab='Recall of words',
ylab = 'Density',
density = 15)
#2cThis is a sample distribution because it is data from one sample, not data from the entire population or statistics from multiple samples.
##2d add normal curve to my histogram
curve(dnorm(x, mean=11.61, sd=5.191),
from=3.0,
to=23,
col='green', add=TRUE)
##Add kernel density curve/empirical density curve
lines(density(pr4_Q2data$Recall), col='purple')
#Find out mean, sd, summary to add curve
mean=mean(pr4_Q2data$Recall)
sd=sd(pr4_Q2data$Recall)
summary=summary(pr4_Q2data$Recall)
#2d The distribution of recall of words is slightly positively skewed, as shown by the curves that are 'bunched up' on the left side of the x axis
## your R codes for 2e & 2f
##2E
#Create a sampling distribution assuming normality bc of central limit theorem
n<-100 #sample size 100
#Define population mean/mu and sd/sigma
pop.mean<-12.5
pop.sd<-6
#create standard error variable to put into curve
se<-6/sqrt(n)
#Mean of sampling distribution of sample means
mean.samplingdist<-pop.mean
#Create the theoretical sampling distribution of sample means
curve(dnorm(x, mean=pop.mean, sd=se),
from = pop.mean-5*se, to=pop.mean+5*se,
main="Theoretical sampling distribution of sample means of recall scores (n=100) \ndotted red line:sample mean",
xlab="recall scores", ylab="Density", col="purple")
abline(v=12.5, lty='dotted', col='red')
##2F There's a 44.10% chance a randomly drawn sample from this population (with sample size 100) will have sample mean less than or equal to the "Recall" mean
#the probability of observing a sample mean at 12.5 or larger
pnorm(mean(working$Recall), 12.5, 6)
## [1] 0.44103986
Your answers for 2g here:
## possible R codes, if any, for 2g
#Make histogram with curve of Theoretical sampling distribution of sample means of recall scores and sample distribution
n<-100 #sample size 100
#Define population mean/mu and sd/sigma
pop.mean<-12.5
pop.sd<-6
#standard error of sampling distribution of the sample means
standarderror.mean<-pop.mean/sqrt(n)
#Mean of sampling distribution of sample means
mean.samplingdist<-pop.mean
#Create the theoretical sampling distribution of sample means
curve(dnorm(x, mean=pop.mean, sd=standarderror.mean),
from = pop.mean-12*standarderror.mean, to=pop.mean+12*standarderror.mean,
main="Theoretical sampling distribution of sample means of recall scores (n=100) \ndotted red line:sample mean \npurple curve: theoretical sampling distribution \ngreen curve: sample distribution of sample 2c\n",
xlab="recall scores", ylab="Density", col="purple")
#sample curve
curve(dnorm(x, mean=11.61, sd=5.191),
col='green', add=TRUE)
abline(v=12.5, lty='dotted', col='red')
cat("\nSample Stats and Population Parameters\n")
##
## Sample Stats and Population Parameters
SampleDisribution.Stats <- c(mean=mean(pr4_Q2data$Recall), sd=sd(pr4_Q2data$Recall))
Population.Parameters <- c(mean=mean(12.5), sd=6)
rbind(SampleDisribution.Stats,Population.Parameters)
## mean sd
## SampleDisribution.Stats 11.61 5.191086
## Population.Parameters 12.50 6.000000
#As shown by the curves in the graph and the sample statistics/population parameters chart, both distributions are about normal and have similar means (population mean=6, sample mean=11.61) and sd (population sd=6, sample sd=5.19). The population has overall more samples and greater density around the mean than the sample, as shown by the taller, skinnier purple curve as compared to the shorter, more spread out green curve (sample).
## your R codes for 3a
#3a
prob<-ecdf(pr4_Q2data$Recall)
#prob
plot(prob, main='The empirical cumulative function (eCF) of recall scores',
xlab='Recall score',
ylab='Pr(x<=given recall score)',
col='blue')
## your R codes for 3b & 3c
##3b: 25% of the observations in recall will be greater than 15
summary(pr4_Q2data$Recall) #run summary of data to find out max to help calculate probability of values between 15 and 23
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 7.00 11.00 11.61 15.25 23.00
prob<-ecdf(pr4_Q2data$Recall) #create ecd function, object prob is function
prob(23)##cumulative probability of recall score <=23
## [1] 1
prob(15)##cumulative probability of recall score <=15
## [1] 0.75
prob(23)-prob(15)
## [1] 0.25
paste('[1] 0.25')
## [1] "[1] 0.25"
#easier way to calculate this:
1-prob(15)
## [1] 0.25
##3c Based on the eCDF, the probability that a randomly selected observation will be between 11 and 15 is 16%
prob(15)
## [1] 0.75
prob(11)
## [1] 0.59
prob(15)-prob(11)
## [1] 0.16
paste('[1] 0.16')
## [1] "[1] 0.16"
#The probability of randomly selecting John (score 15) from the older group is 8% but the probability of selecting him from the younger group is 42%, therefore it's more likely that John belongs to the younger age group.
older<-subset(pr4_Q2data, Age==1,select = c('Recall', 'group')) #subset older adults
younger<-subset(pr4_Q2data, Age==2,select = c('Recall', 'group')) #subset younger adults)
##show summary stats by age
older.stats <- with(older, c(n=length(Recall),mean=mean(Recall), sd=sd(Recall),min=min(Recall), max=max(Recall)))
younger.stats <- with(younger, c(n=length(Recall),mean=mean(Recall), sd=sd(Recall),min=min(Recall), max=max(Recall)))
cat('The sample statistics by age\n') #add a spiffy title to my summary stats table
## The sample statistics by age
rbind(older.stats,younger.stats)
## n mean sd min max
## older.stats 50 10.06 4.0071874 3 23
## younger.stats 50 13.16 5.7865432 4 22
older.ecdf<-ecdf(older$Recall) #create ecdf by age group function
younger.ecdf<-ecdf(younger$Recall)
#Probability of recall score being 15 in older and younger age group:
1-(older.ecdf(15))
## [1] 0.08
1- younger.ecdf(15)
## [1] 0.42
## your R codes for 4a
#4a. Assuming a normal distribution, about 33.85% of observations are greater than 15.
pnorm(15, mean=12.5, sd=6, lower.tail=FALSE)
## [1] 0.33846112
paste('[1] 0.3384611')
## [1] "[1] 0.3384611"
## your R codes for 4b
##4b There's about a 26.02% chance of a randomly selected observation being between 11 and 15
p15<-pnorm(15, mean=12.5, sd=6, lower.tail=TRUE)
p11<-pnorm(11, mean=12.5, sd=6, lower.tail=TRUE)
p15-p11
## [1] 0.26024521
## your R codes for 4c
#The standard error is .3. This indicates that the mean of this randomly selected sample is expected to be .3 away from the population mean due to sampling error. If the mean of the sample is within this range, it indicates that the population mean and sample mean are about the same (Central Limit Theorem), so it means the sample is a reliable estimate of the whole population. If it's larger than this, we know it's unlikely that this sample was random or that it came from the same population.
#population mean 12.5 (i.e., $\mu$=12.5) and population standard deviation 6 (i.e., $\sigma$=6):
#calculate standard error; se= tells you how close sample mean is to pop mean
mu<-12.5
sigma<-6
n=400
##standard error of sampling distribution of sample means based upon population sd
se.mean<-sigma/sqrt(n)