If you use this .Rmd template file to do this practice, make sure to submit two documents: 1) Psc210a_PR03_yourInitials.Rmd; 2) Psc210a_PR03_yourInitials.html (this .html file should be generated from your .Rmd file). Before you submit your work, replace ‘yourInitials’ by your own initials in the file name.
If you use any other statistical program (e.g. SPSS), make sure to submit your SPSS syntax codes together with your answers to the questions.
##set your working directory
#setwd('~/Desktop/Psy210a_F2022')
# I use Rcloud so setwd doesn't work because I have to import data,but Im including as comment bc would be needed for desktop version
##clear your work space
rm(list=ls())
(2.5 points) Generate a new variable ‘z_rating’ to hold the
z-scores for variable “rating” among all observations (i.e. to
standardize the variable ‘rating’ to a distribution with mean zero and
standard deviation one. This is a z-score transformation);
(2.5 points) Transform variable “rating” into a new variable “rating_rescale” with mean 20 and standard deviation 5 (this is a standardization with given mean and sd);
(5 points) Generate and compare the distributions of variables “rating” & “rating_rescale”. Present the distributions and briefly comment on their differences and similarities (you may include your answer here as comments in R code chunk);
(5 points) Generate and compare sample statistics (e.g. mean, standard deviation, range) for variables ‘rating’ and ‘rating_rescale’. Which variable has larger standard deviation? Does the variable with larger standard deviation vary more (i.e. has larger variation) than the other variable? Explain why or why not (you may include your answer here as comments in R code chunk);
(5 points) Generate a side by side box-plot for “rating” by “composite” variable. Present the side by side box-plot (with appropriate title/caption) and briefly comment on what it tells;
(5 points) Generate mean, median, SD, and variance for “rating” by “composite”. Present the results in a table format (with appropriate title);
(5 points). Present QQ-plot for ‘rating’. Comment on whether the distribution of ‘rating’ approximates normal based upon the QQ-plot (you may include your answer here as comments in R code chunk).
##your R codes for question 1
##you may include your answers in question 1 in this R code chunk as comments
'pr3_q1' <- read.csv('pr3_q1data.csv') #to read my imported data and name it pr3_q1#
head(pr3_q1, 3) #to read first 3 rows of data
## ID Composite Rating
## 1 1 1 1.20
## 2 2 1 1.82
## 3 3 1 1.93
sum(is.na(pr3_q1$Rating)) ##check for missing values
## [1] 0
cat('\nRaw ratings descriptive statistics:\n')##Name des stats table
##
## Raw ratings descriptive statistics:
'ratings.des' <- c(n=length(pr3_q1$Rating), mean=mean(pr3_q1$Rating),
sd=sd(pr3_q1$Rating), min=min(pr3_q1$Rating), max=max(pr3_q1$Rating), cv=sd(pr3_q1$Rating)/mean(pr3_q1$Rating))
round(ratings.des, 2)
## n mean sd min max cv
## 40.00 2.95 0.56 1.20 4.02 0.19
##1A: step 1: generate z-score and store as z_rating
z_rating <- (pr3_q1$Rating - mean(pr3_q1$Rating))/sd(pr3_q1$Rating)
#Verify the mean and sd of z_rating and compare with that of raw ratings
cat('\nCompare mean and SD of z_rating and raw ratings:\n')
##
## Compare mean and SD of z_rating and raw ratings:
'comparisons'<-rbind(z_rating=c(mean=mean(z_rating), sd=sd(z_rating)),
Rating=c(mean=mean(pr3_q1$Rating), sd=sd(pr3_q1$Rating)))
round(comparisons, 2)
## mean sd
## z_rating 0.00 1.00
## Rating 2.95 0.56
paste('z_rating mean is 0 and SD is 1')
## [1] "z_rating mean is 0 and SD is 1"
##1B Transform variable “rating” into a new variable “rating_rescale” with mean 20 and standard deviation 5
## transform the z-score into distribution with mean=20 sd=5
rating_rescale<-z_rating*5+20
cat('\nRating_rescale descriptive statistics:\n') ##Name des stats table
##
## Rating_rescale descriptive statistics:
'rescale.des' <- c(n=length(rating_rescale), mean=mean(rating_rescale),
sd=sd(rating_rescale), min=min(rating_rescale), max=max(rating_rescale), cv=sd(rating_rescale)/mean(rating_rescale))
round(rescale.des, 2)
## n mean sd min max cv
## 40.00 20.00 5.00 4.23 29.60 0.25
##Verify and compare the mean and sd of z_rating, rating_rescale, and raw ratings
cat('\nCompare mean and SD of z_rating, rating_scale and raw ratings:\n')
##
## Compare mean and SD of z_rating, rating_scale and raw ratings:
'newcomparisons'<-rbind(z_rating=c(mean=mean(z_rating), sd=sd(z_rating)),
rating_rescale=c(mean=mean(rating_rescale), sd=sd(rating_rescale)),
Rating=c(mean=mean(pr3_q1$Rating), sd=sd(pr3_q1$Rating)))
round(newcomparisons, 2)
## mean sd
## z_rating 0.00 1.00
## rating_rescale 20.00 5.00
## Rating 2.95 0.56
#c) (5 points) Generate and compare the distributions of variables “rating” & “rating_rescale”. Present the distributions and briefly comment on their differences and similarities (you may include your answer here as comments in R code chunk);
hist(pr3_q1$Rating, prob=TRUE,
main='Distribution of original ratings',
xlab='Ratings',
ylab='Density')
curve(dnorm(x,mean = 2.95, sd=.56),
from = min(pr3_q1$Rating), to=(pr3_q1$Rating),
col='yellow', add=TRUE)
labels=paste('sample mean=',round(mean(pr3_q1$Rating),0), '\nsample sd= ',
round(sd(pr3_q1$Rating),0))
#Kernel dist curve
lines(density(pr3_q1$Rating), col='red')
hist(rating_rescale, prob=TRUE,
main='Distribution of rescaled ratings',
xlab='Ratings',
ylab='Density')
curve(dnorm(x,mean = 20, sd=5),
from = min(rating_rescale), to=(rating_rescale),
col='purple', add=TRUE)
labels=paste('sample mean=',round(mean(rating_rescale),0), '\nsample sd= ',
round(sd(rating_rescale),0))
#Kernel dist curve
lines(density(rating_rescale), col='blue')
##normal QQ plot to compare distribution of raw data ratings
paste("1. Q-Q Plot of raw data ratings:")
## [1] "1. Q-Q Plot of raw data ratings:"
qqnorm(pr3_q1$Rating)
qqline(pr3_q1$Rating, col='yellow')
##normal QQ plot to compare distribution of rescaled ratings
paste("2. Q-Q Plot of rescaled ratings:")
## [1] "2. Q-Q Plot of rescaled ratings:"
qqnorm(rating_rescale)
qqline(rating_rescale, col='blue')
cat('\nRaw ratings descriptive statistics:\n')##Name des stats table
##
## Raw ratings descriptive statistics:
'ratings.des' <- c(n=length(pr3_q1$Rating), mean=mean(pr3_q1$Rating),
sd=sd(pr3_q1$Rating), min=min(pr3_q1$Rating), max=max(pr3_q1$Rating), cv=sd(pr3_q1$Rating)/mean(pr3_q1$Rating))
round(ratings.des, 2)
## n mean sd min max cv
## 40.00 2.95 0.56 1.20 4.02 0.19
cat('\nRating_rescale descriptive statistics:\n') ##Name des stats table
##
## Rating_rescale descriptive statistics:
'rescale.des' <- c(n=length(rating_rescale), mean=mean(rating_rescale),
sd=sd(rating_rescale), min=min(rating_rescale), max=max(rating_rescale), cv=sd(rating_rescale)/mean(rating_rescale))
round(rescale.des, 2)
## n mean sd min max cv
## 40.00 20.00 5.00 4.23 29.60 0.25
##As demonstrated by the histogram with kernel density lines and QQ Plots, the distribution shapes of the raw data and rating_rescale data (underwent z-score transformation) are the same, even though the scales (as shown by the y axes on both the histograms and Q-Q plots, and the height of the bars on the histograms) are different.
##Descriptive statistics show that raw data mean=2.95, SD= .56, min=1.2, max=4.02, cv=.19; rating_rescale mean=20, SD = 5, min=4.23, max=29.6, cv=.25. When comparing these values it's important to take into consideration that the raw data underwent z_score transformation to become the rating_rescale data, thus the seemingly smaller values of the raw data across all of these measures are to be expected. The means refer to the average ratings of the data set, the SD refers to the measure of how dispersed the data points are in relation to the mean of the data set,the minimum refers to the minimum rating score, the maximum refers to the maximum rating score, and the cv refers to the size of the SD relative to the data set's mean.
##1D Generate and compare sample statistics (e.g. mean, standard deviation, range) for variables ‘rating’ and ‘rating_rescale’. Which variable has larger standard deviation? Does the variable with larger standard deviation vary more (i.e. has larger variation) than the other variable? Explain why or why not (you may include your answer here as comments in R code chunk);
cat('\nRaw ratings descriptive statistics:\n')##Name des stats table
##
## Raw ratings descriptive statistics:
'ratings.des' <- c(n=length(pr3_q1$Rating), mean=mean(pr3_q1$Rating),
sd=sd(pr3_q1$Rating), min=min(pr3_q1$Rating), max=max(pr3_q1$Rating), cv=sd(pr3_q1$Rating)/mean(pr3_q1$Rating))
round(ratings.des, 2)
## n mean sd min max cv
## 40.00 2.95 0.56 1.20 4.02 0.19
cat('\nRating_rescale descriptive statistics:\n') ##Name des stats table
##
## Rating_rescale descriptive statistics:
'rescale.des' <- c(n=length(rating_rescale), mean=mean(rating_rescale),
sd=sd(rating_rescale), min=min(rating_rescale), max=max(rating_rescale), cv=sd(rating_rescale)/mean(rating_rescale))
round(rescale.des, 2)
## n mean sd min max cv
## 40.00 20.00 5.00 4.23 29.60 0.25
##Descriptive statistics show that raw data mean=2.95, SD= .56, min=1.2, max=4.02, cv=.19; rating_rescale mean=20, SD = 5, min=4.23, max=29.6, cv=.25. When comparing these values it's important to take into consideration that the raw data underwent z_score transformation to become the rating_rescale data, thus the seemingly smaller values of the raw data across all of these measures are to be expected. The means refer to the average ratings of the data set, the SD refers to the measure of how dispersed the data points are in relation to the mean of the data set,the minimum refers to the minimum rating score, the maximum refers to the maximum rating score, and the cv refers to the size of the SD relative to the data set's mean.
#In absolute terms, the SD, CV, min, and max (measures to help us understand how much the data varies) of the raw data are smaller values than those of the rating_rescale data. However, this doesn't mean that the rating_rescale data varies more than the raw data, because ultimately it's the same data represented on two different scales
##1e) (5 points) Generate a side by side box-plot for “rating” by “composite” variable. Present the side by side box-plot (with appropriate title/caption) and briefly comment on what it tells;
##side-by-side boxplots of rating by composite
#?
boxplot(pr3_q1$Rating ~ pr3_q1$Composite,
horizontal = FALSE,
main = "Raw data of rating score by composite",
ylab = 'Rating',
xlab = 'Composite',
frame.plot = FALSE,
col=c('blue', 'green'))
head(pr3_q1$Composite,100)
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
##1f) (5 points) Generate mean, median, SD, and variance for “rating” by “composite”. Present the results in a table format (with appropriate title);
cat('\nMean, median, min, max, and 1st/3rd quantiles by composite:\n')##Name table
##
## Mean, median, min, max, and 1st/3rd quantiles by composite:
a<-tapply(pr3_q1$Rating, pr3_q1$Composite, summary)
cbind(Composite1=a[[1]], Composite2=a[[2]])
## Composite1 Composite2
## Min. 1.2000 3.1300
## 1st Qu. 2.3225 3.2000
## Median 2.5950 3.2650
## Mean 2.6445 3.2615
## 3rd Qu. 2.9825 3.3100
## Max. 4.0200 3.3800
cat('\nMean, sd, and variance for rating by composite:\n')
##
## Mean, sd, and variance for rating by composite:
Composite1<-subset(pr3_q1, Composite =="1")
Composite5<-subset(pr3_q1, Composite =="5")
cvcomp1=sd(Composite1$Rating)/mean(Composite1$Rating)
cv5=sd(Composite5$Rating)/mean(Composite5$Rating)
descriptivesComp1 <- c(mean=mean(Composite1$Rating), SD=sd(Composite1$Rating),cvcomp1)
descriptivesComp5 <- c(mean=mean(Composite5$Rating), SD=sd(Composite5$Rating), cv5)
rbind(descriptivesComp1,descriptivesComp5)
## mean SD
## descriptivesComp1 2.6445 0.655201656 0.247760127
## descriptivesComp5 3.2615 0.068922153 0.021132041
##1g) (5 points). Present a QQ-plot for ‘rating’. Comment on whether the distribution of ‘rating’ approximates normal based upon the QQ-plot (you may include your answer here as comments in R code chunk).
##normal QQ plot to compare distribution of raw data ratings
qqnorm(pr3_q1$Rating)
qqline(pr3_q1$Rating, col='yellow')
#Since the observations do not neatly follow the q-q plot line and there are outliers that means that the data is not normal.
#This matches what is seen in the histogram (highest density of ratings between 3 and 3.5 and a slightly negative skew with a cluster of scores between 1 and 3)
##your R codes for question 2
'color_pr3' <- read.csv('color_pr3.csv') #to read my imported data and name it color_pr3#
head(color_pr3, 3) #to read first 3 rows of data
## Region Eyes Hair Count
## 1 1 blue fair 23
## 2 1 blue fair 23
## 3 1 blue fair 23
##Name table
cat('\nFrequency of Hair Color:\n')
##
## Frequency of Hair Color:
##Make function table with frequency of hair by color
(frequency<-table(color_pr3$Hair))
##
## black dark fair medium red
## 22 182 228 217 113
##Name table
cat('\n\nRelative Frequencey of Hair Color:\n')
##
##
## Relative Frequencey of Hair Color:
##Make function table with relative frequency of hair color in sample
(relative.frequency<-frequency/nrow(color_pr3))
##
## black dark fair medium red
## 0.028871391 0.238845144 0.299212598 0.284776903 0.148293963
##Name table
cat('\n\nDistribution of Hair Color:\n')
##
##
## Distribution of Hair Color:
##combine frequency and relative frequency into one object
(haircolor.frequency<-cbind(frequency, relative.frequency))
## frequency relative.frequency
## black 22 0.028871391
## dark 182 0.238845144
## fair 228 0.299212598
## medium 217 0.284776903
## red 113 0.148293963
print('Black hair represents about 2.89% of the hair color of the sample, while dark represents 23.88%, fair 29.92%, medium 28.48%, and red 14.83%. There are 22 people with black hair, 182 with dark, 228 with fair, 217 with medium, and 113 with red.')
## [1] "Black hair represents about 2.89% of the hair color of the sample, while dark represents 23.88%, fair 29.92%, medium 28.48%, and red 14.83%. There are 22 people with black hair, 182 with dark, 228 with fair, 217 with medium, and 113 with red."
##add labels for columns
colnames(haircolor.frequency)<-c('Frequencey', 'RelativeFrequency')
##Barplot
barplot(frequency,
main='Distribution of Hair Color',
xlab="Hair color",
ylab= "Number of people",
names.arg=c('Black', 'Dark','Fair', 'Medium', 'Red'),
col=c('black','gray','yellow','brown','red'),
ylim=c(0, 300))
##pie chart
pie(frequency,
main='Piechart of hair color distribution')
[1 point] What is the possible values of Y?
[2 points] If purely by guessing, what is the probability that one would guess the answers of 25 questions correctly?
[2 points] If purely by guessing, what is the probability that one would guess the answers of at least 25 questions correctly?
[3 points] Jane wants to apply for a private school. From historical data, Jane thinks that (1) on average, the test-takers got 40 questions correct; and (2) the private school only accepted students whose score in this standard reading test is in the top 10% of the distribution. Given Jane’s understanding, what is the minimum number of questions Jane would aim to answer correctly in order to be in the top 10%?
[4 points] This standard test was generated with the idea that, on average, the test taker will get 70% of the questions correctly. Given this design (i.e. assuming on average, the test taker will get 70% quesionts correct), what will be the average test score for 200 test-takers? Simulate and show the distribution of the test scores from 200 test-takers?
##your R codes for question 4
##you may include your answers in question 4 in this R code chunk as comments
#a. Y=0:60 The possible values of Y are between 0 and 60. A student could get between 0 answers correct and 60 answers correct.
#b. The probability that one would guess the answers of 25 questions correctly is .195%
n<-60 ##60 questions
q.correct<-1/4 ##probability of guessing each answer correctly
('4b'<-dbinom(25, n, q.correct)) #(#questions correct, n, probability of getting each question correct)
## [1] 0.0019540738
paste("[1] 0.001954074")
## [1] "[1] 0.001954074"
#c.the probability that one would guess the answers of at least 25 questions correctly is .147%
n<-60 ##60 questions
q.correct<-1/4 ##probability of guessing each answer correctly
1-pbinom(25, n, q.correct) ##calculate cumulative probability of guessing at least 25 correct
## [1] 0.0014693482
paste("[1] 0.001469348")
## [1] "[1] 0.001469348"
#d. In order to get into the top 10% of test takers, Jane would need to get at least 45 questions correct.
qbinom(.90, size=60, prob=.40/60)
## [1] 1
#E. The average test taker will get 70% of 60 questions correct = 42 questions.I showed the distribution below with a histogram and QQ plot, which showed that the distribution was not normal (histogram shows it's negatively skewed, QQplot shows that the data points don't neatly follow the QQ plot line.)
#70% of 60 questions = 42
randodata <- rbinom(n=200, size= 60, prob=70/100) #generate random sample of test takers
#Histogram
hist(rando.data, prob=TRUE,
main='Q3:The distribution of Standard Reading Test test takers',
xlab='Score', ylab='Density',
ylim=c(0,.15))
## Error in hist(rando.data, prob = TRUE, main = "Q3:The distribution of Standard Reading Test test takers", : object 'rando.data' not found
lines(density(rando.data), col='red')
## Error in density(rando.data): object 'rando.data' not found
##normal QQ plot to see if data is normally distributed
qqnorm(rando.data)
## Error in qqnorm(rando.data): object 'rando.data' not found
qqline(rando.data, col='yellow')
## Error in quantile(y, probs, names = FALSE, type = qtype, na.rm = TRUE): object 'rando.data' not found