Psyc210a Practice 03, due by 10:30pm, Wednesday 9/21

If you use this .Rmd template file to do this practice, make sure to submit two documents: 1) Psc210a_PR03_yourInitials.Rmd; 2) Psc210a_PR03_yourInitials.html (this .html file should be generated from your .Rmd file). Before you submit your work, replace ‘yourInitials’ by your own initials in the file name.

If you use any other statistical program (e.g. SPSS), make sure to submit your SPSS syntax codes together with your answers to the questions.

Write R codes to set your working directory to ‘Psy210a_F2022’. Make sure to include appropriate path.

##set your working directory  
#setwd('~/Desktop/Psy210a_F2022')
# I use Rcloud so setwd doesn't work because I have to import data,but Im including as comment bc would be needed for desktop version

##clear your work space
rm(list=ls())

[total 30 points] Read the dataset “pr3_q1data.csv”(.csv format) into R as a data frame and store it in object ‘pr3_q1’, then to do the following:

(2.5 points) Generate a new variable ‘z_rating’ to hold the z-scores for variable “rating” among all observations (i.e. to standardize the variable ‘rating’ to a distribution with mean zero and standard deviation one. This is a z-score transformation);
(2.5 points) Transform variable “rating” into a new variable “rating_rescale” with mean 20 and standard deviation 5 (this is a standardization with given mean and sd);
(5 points) Generate and compare the distributions of variables “rating” & “rating_rescale”. Present the distributions and briefly comment on their differences and similarities (you may include your answer here as comments in R code chunk);
(5 points) Generate and compare sample statistics (e.g. mean, standard deviation, range) for variables ‘rating’ and ‘rating_rescale’. Which variable has larger standard deviation? Does the variable with larger standard deviation vary more (i.e. has larger variation) than the other variable? Explain why or why not (you may include your answer here as comments in R code chunk);
(5 points) Generate a side by side box-plot for “rating” by “composite” variable. Present the side by side box-plot (with appropriate title/caption) and briefly comment on what it tells;
(5 points) Generate mean, median, SD, and variance for “rating” by “composite”. Present the results in a table format (with appropriate title);
(5 points). Present QQ-plot for ‘rating’. Comment on whether the distribution of ‘rating’ approximates normal based upon the QQ-plot (you may include your answer here as comments in R code chunk).

##your R codes for question 1
##you may include your answers in question 1 in this R code chunk as comments

'pr3_q1' <- read.csv('pr3_q1data.csv') #to read my imported data and name it pr3_q1#
head(pr3_q1, 3) #to read first 3 rows of data

##   ID Composite Rating
## 1  1         1   1.20
## 2  2         1   1.82
## 3  3         1   1.93

sum(is.na(pr3_q1$Rating)) ##check for missing values

## [1] 0

cat('\nRaw ratings descriptive statistics:\n')##Name des stats table

## 
## Raw ratings descriptive statistics:

'ratings.des' <- c(n=length(pr3_q1$Rating), mean=mean(pr3_q1$Rating),
           sd=sd(pr3_q1$Rating), min=min(pr3_q1$Rating), max=max(pr3_q1$Rating), cv=sd(pr3_q1$Rating)/mean(pr3_q1$Rating))
round(ratings.des, 2)

##     n  mean    sd   min   max    cv 
## 40.00  2.95  0.56  1.20  4.02  0.19

##1A: step 1: generate z-score and store as z_rating
z_rating <- (pr3_q1$Rating - mean(pr3_q1$Rating))/sd(pr3_q1$Rating)

#Verify the mean and sd of z_rating and compare with that of raw ratings
 
cat('\nCompare mean and SD of z_rating and raw ratings:\n')

## 
## Compare mean and SD of z_rating and raw ratings:

'comparisons'<-rbind(z_rating=c(mean=mean(z_rating), sd=sd(z_rating)),
      Rating=c(mean=mean(pr3_q1$Rating), sd=sd(pr3_q1$Rating)))
round(comparisons, 2)

##          mean   sd
## z_rating 0.00 1.00
## Rating   2.95 0.56

paste('z_rating mean is 0 and SD is 1')

## [1] "z_rating mean is 0 and SD is 1"

##1B Transform variable “rating” into a new variable “rating_rescale” with mean 20 and standard deviation 5

## transform the z-score into distribution with mean=20 sd=5
rating_rescale<-z_rating*5+20

 cat('\nRating_rescale descriptive statistics:\n') ##Name des stats table

## 
## Rating_rescale descriptive statistics:

'rescale.des' <- c(n=length(rating_rescale), mean=mean(rating_rescale),
           sd=sd(rating_rescale), min=min(rating_rescale), max=max(rating_rescale), cv=sd(rating_rescale)/mean(rating_rescale))
round(rescale.des, 2)

##     n  mean    sd   min   max    cv 
## 40.00 20.00  5.00  4.23 29.60  0.25

##Verify and compare the mean and sd of z_rating, rating_rescale, and raw ratings
cat('\nCompare mean and SD of z_rating, rating_scale and raw ratings:\n')

## 
## Compare mean and SD of z_rating, rating_scale and raw ratings:

'newcomparisons'<-rbind(z_rating=c(mean=mean(z_rating), sd=sd(z_rating)),
            rating_rescale=c(mean=mean(rating_rescale), sd=sd(rating_rescale)), 
             Rating=c(mean=mean(pr3_q1$Rating), sd=sd(pr3_q1$Rating)))
round(newcomparisons, 2)

##                 mean   sd
## z_rating        0.00 1.00
## rating_rescale 20.00 5.00
## Rating          2.95 0.56

#c) (5 points) Generate and compare the distributions of variables “rating” & “rating_rescale”. Present the distributions and briefly comment on their differences and similarities (you may include your answer here as comments in R code chunk);    

hist(pr3_q1$Rating, prob=TRUE,
    main='Distribution of original ratings',
    xlab='Ratings',
    ylab='Density')
curve(dnorm(x,mean = 2.95, sd=.56),
      from = min(pr3_q1$Rating), to=(pr3_q1$Rating),
      col='yellow', add=TRUE)
labels=paste('sample mean=',round(mean(pr3_q1$Rating),0), '\nsample sd= ',
             round(sd(pr3_q1$Rating),0))
#Kernel dist curve
lines(density(pr3_q1$Rating), col='red')

hist(rating_rescale, prob=TRUE,
    main='Distribution of rescaled ratings',
    xlab='Ratings',
    ylab='Density')
curve(dnorm(x,mean = 20, sd=5),
      from = min(rating_rescale), to=(rating_rescale),
      col='purple', add=TRUE)
labels=paste('sample mean=',round(mean(rating_rescale),0), '\nsample sd= ',
             round(sd(rating_rescale),0))
#Kernel dist curve
lines(density(rating_rescale), col='blue')

##normal QQ plot to compare distribution of raw data ratings
paste("1. Q-Q Plot of raw data ratings:")

## [1] "1. Q-Q Plot of raw data ratings:"

qqnorm(pr3_q1$Rating)
qqline(pr3_q1$Rating, col='yellow')

##normal QQ plot to compare distribution of rescaled ratings
paste("2. Q-Q Plot of rescaled ratings:")

## [1] "2. Q-Q Plot of rescaled ratings:"

qqnorm(rating_rescale)
qqline(rating_rescale, col='blue')

cat('\nRaw ratings descriptive statistics:\n')##Name des stats table

## 
## Raw ratings descriptive statistics:

'ratings.des' <- c(n=length(pr3_q1$Rating), mean=mean(pr3_q1$Rating),
           sd=sd(pr3_q1$Rating), min=min(pr3_q1$Rating), max=max(pr3_q1$Rating), cv=sd(pr3_q1$Rating)/mean(pr3_q1$Rating))
round(ratings.des, 2)

##     n  mean    sd   min   max    cv 
## 40.00  2.95  0.56  1.20  4.02  0.19

 cat('\nRating_rescale descriptive statistics:\n') ##Name des stats table

## 
## Rating_rescale descriptive statistics:

'rescale.des' <- c(n=length(rating_rescale), mean=mean(rating_rescale),
           sd=sd(rating_rescale), min=min(rating_rescale), max=max(rating_rescale), cv=sd(rating_rescale)/mean(rating_rescale))
round(rescale.des, 2)

##     n  mean    sd   min   max    cv 
## 40.00 20.00  5.00  4.23 29.60  0.25

##As demonstrated by the histogram with kernel density lines and QQ Plots, the distribution shapes of the raw data and rating_rescale data (underwent z-score transformation) are the same, even though the scales (as shown by the y axes on both the histograms and Q-Q plots, and the height of the bars on the histograms) are different. 
##Descriptive statistics show that raw data mean=2.95, SD= .56, min=1.2, max=4.02, cv=.19; rating_rescale mean=20, SD = 5, min=4.23, max=29.6, cv=.25. When comparing these values it's important to take into consideration that the raw data underwent z_score transformation to become the rating_rescale data, thus the seemingly smaller values of the raw data across all of these measures are to be expected. The means refer to the average ratings of the data set, the SD refers to the measure of how dispersed the data points are in relation to the mean of the data set,the minimum refers to the minimum rating score, the maximum refers to the maximum rating score, and the cv refers to the size of the SD relative to the data set's mean. 


##1D Generate and compare sample statistics (e.g. mean, standard deviation, range) for variables ‘rating’ and ‘rating_rescale’. Which variable has larger standard deviation? Does the variable with larger standard deviation vary more (i.e. has larger variation) than the other variable? Explain why or why not (you may include your answer here as comments in R code chunk);

cat('\nRaw ratings descriptive statistics:\n')##Name des stats table

## 
## Raw ratings descriptive statistics:

'ratings.des' <- c(n=length(pr3_q1$Rating), mean=mean(pr3_q1$Rating),
           sd=sd(pr3_q1$Rating), min=min(pr3_q1$Rating), max=max(pr3_q1$Rating), cv=sd(pr3_q1$Rating)/mean(pr3_q1$Rating))
round(ratings.des, 2)

##     n  mean    sd   min   max    cv 
## 40.00  2.95  0.56  1.20  4.02  0.19

 cat('\nRating_rescale descriptive statistics:\n') ##Name des stats table

## 
## Rating_rescale descriptive statistics:

'rescale.des' <- c(n=length(rating_rescale), mean=mean(rating_rescale),
           sd=sd(rating_rescale), min=min(rating_rescale), max=max(rating_rescale), cv=sd(rating_rescale)/mean(rating_rescale))
round(rescale.des, 2)

##     n  mean    sd   min   max    cv 
## 40.00 20.00  5.00  4.23 29.60  0.25

##Descriptive statistics show that raw data mean=2.95, SD= .56, min=1.2, max=4.02, cv=.19; rating_rescale mean=20, SD = 5, min=4.23, max=29.6, cv=.25. When comparing these values it's important to take into consideration that the raw data underwent z_score transformation to become the rating_rescale data, thus the seemingly smaller values of the raw data across all of these measures are to be expected. The means refer to the average ratings of the data set, the SD refers to the measure of how dispersed the data points are in relation to the mean of the data set,the minimum refers to the minimum rating score, the maximum refers to the maximum rating score, and the cv refers to the size of the SD relative to the data set's mean.
#In absolute terms, the SD, CV, min, and max (measures to help us understand how much the data varies) of the raw data are smaller values than those of the rating_rescale data. However, this doesn't mean that the rating_rescale data varies more than the raw data, because ultimately it's the same data represented on two different scales 

##1e)   (5 points) Generate a side by side box-plot for “rating” by “composite” variable. Present the side by side box-plot (with appropriate title/caption) and briefly comment on what it tells;    

##side-by-side boxplots of rating by composite
#?
boxplot(pr3_q1$Rating ~ pr3_q1$Composite,
        horizontal = FALSE,
        main = "Raw data of rating score by composite",
        ylab = 'Rating',
        xlab = 'Composite',
        frame.plot = FALSE,
        col=c('blue', 'green'))

head(pr3_q1$Composite,100)

##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

##1f)   (5 points) Generate mean, median, SD, and variance for “rating” by “composite”. Present the results in a table format (with appropriate title);   

cat('\nMean, median, min, max, and 1st/3rd quantiles by composite:\n')##Name table

## 
## Mean, median, min, max, and 1st/3rd quantiles by composite:

a<-tapply(pr3_q1$Rating, pr3_q1$Composite, summary)
cbind(Composite1=a[[1]], Composite2=a[[2]])

##         Composite1 Composite2
## Min.        1.2000     3.1300
## 1st Qu.     2.3225     3.2000
## Median      2.5950     3.2650
## Mean        2.6445     3.2615
## 3rd Qu.     2.9825     3.3100
## Max.        4.0200     3.3800

cat('\nMean, sd, and variance for rating by composite:\n')

## 
## Mean, sd, and variance for rating by composite:

Composite1<-subset(pr3_q1, Composite =="1")
Composite5<-subset(pr3_q1, Composite =="5")
cvcomp1=sd(Composite1$Rating)/mean(Composite1$Rating)
cv5=sd(Composite5$Rating)/mean(Composite5$Rating)

descriptivesComp1 <- c(mean=mean(Composite1$Rating), SD=sd(Composite1$Rating),cvcomp1)
descriptivesComp5 <- c(mean=mean(Composite5$Rating), SD=sd(Composite5$Rating), cv5)
rbind(descriptivesComp1,descriptivesComp5)

##                     mean          SD            
## descriptivesComp1 2.6445 0.655201656 0.247760127
## descriptivesComp5 3.2615 0.068922153 0.021132041

##1g)   (5 points). Present a QQ-plot for ‘rating’. Comment on whether the distribution of ‘rating’ approximates normal based upon the QQ-plot (you may include your answer here as comments in R code chunk).  

##normal QQ plot to compare distribution of raw data ratings
qqnorm(pr3_q1$Rating)
qqline(pr3_q1$Rating, col='yellow')

#Since the observations do not neatly follow the q-q plot line and there are outliers that means that the data is not normal. 
#This matches what is seen in the histogram (highest density of ratings between 3 and 3.5 and a slightly negative skew with a cluster of scores between 1 and 3)

[6 points] The dataset: color_pr3.csv contains information from a sample of participants from Northern Europe. Variable ‘Hair’ contains information of hair color of each participant. Read in the data and generate a table to show the frequency and relative frequency of each hair color. In addition, generate a pie chart or bar chart to show the distribution of the hair color.

##your R codes for question 2
'color_pr3' <- read.csv('color_pr3.csv') #to read my imported data and name it color_pr3#
head(color_pr3, 3) #to read first 3 rows of data

##   Region Eyes Hair Count
## 1      1 blue fair    23
## 2      1 blue fair    23
## 3      1 blue fair    23

##Name table
 cat('\nFrequency of Hair Color:\n')

## 
## Frequency of Hair Color:

##Make function table with frequency of hair by color
(frequency<-table(color_pr3$Hair))

## 
##  black   dark   fair medium    red 
##     22    182    228    217    113

 ##Name table
 cat('\n\nRelative Frequencey of Hair Color:\n')

## 
## 
## Relative Frequencey of Hair Color:

##Make function table with relative frequency of hair color in sample
(relative.frequency<-frequency/nrow(color_pr3))

## 
##       black        dark        fair      medium         red 
## 0.028871391 0.238845144 0.299212598 0.284776903 0.148293963

##Name table
 cat('\n\nDistribution of Hair Color:\n')

## 
## 
## Distribution of Hair Color:

##combine frequency and relative frequency into one object
(haircolor.frequency<-cbind(frequency, relative.frequency))

##        frequency relative.frequency
## black         22        0.028871391
## dark         182        0.238845144
## fair         228        0.299212598
## medium       217        0.284776903
## red          113        0.148293963

print('Black hair represents about 2.89% of the hair color of the sample, while dark represents 23.88%, fair 29.92%, medium 28.48%, and red 14.83%. There are 22 people with black hair, 182 with dark, 228 with fair, 217 with medium, and 113 with red.')

## [1] "Black hair represents about 2.89% of the hair color of the sample, while dark represents 23.88%, fair 29.92%, medium 28.48%, and red 14.83%. There are 22 people with black hair, 182 with dark, 228 with fair, 217 with medium, and 113 with red."

##add labels for columns
colnames(haircolor.frequency)<-c('Frequencey', 'RelativeFrequency')

##Barplot
barplot(frequency, 
        main='Distribution of Hair Color', 
        xlab="Hair color", 
        ylab= "Number of people",
        names.arg=c('Black', 'Dark','Fair', 'Medium', 'Red'), 
        col=c('black','gray','yellow','brown','red'),
        ylim=c(0, 300))

##pie chart
pie(frequency, 
    main='Piechart of hair color distribution')

[totally 12 points] In a standard reading test, there are a total of 60 multiple choice questions. Each question is given four possible choices with only one correct answer. Each correct answer will get one point. Let Y be the number of correct answers in the test. For following questions, assume that the questions are independent (i.e. the answer to one question has nothing to do with the answer to other questions) and the questions have the same difficulty level.

[1 point] What is the possible values of Y?
[2 points] If purely by guessing, what is the probability that one would guess the answers of 25 questions correctly?
[2 points] If purely by guessing, what is the probability that one would guess the answers of at least 25 questions correctly?
[3 points] Jane wants to apply for a private school. From historical data, Jane thinks that (1) on average, the test-takers got 40 questions correct; and (2) the private school only accepted students whose score in this standard reading test is in the top 10% of the distribution. Given Jane’s understanding, what is the minimum number of questions Jane would aim to answer correctly in order to be in the top 10%?
[4 points] This standard test was generated with the idea that, on average, the test taker will get 70% of the questions correctly. Given this design (i.e. assuming on average, the test taker will get 70% quesionts correct), what will be the average test score for 200 test-takers? Simulate and show the distribution of the test scores from 200 test-takers?

##your R codes for question 4
##you may include your answers in question 4 in this R code chunk as comments

#a. Y=0:60 The possible values of Y are between 0 and 60. A student could get between 0 answers correct and 60 answers correct. 

#b. The probability that one would guess the answers of 25 questions correctly is .195%
n<-60  ##60 questions
q.correct<-1/4  ##probability of guessing each answer correctly
('4b'<-dbinom(25, n, q.correct)) #(#questions correct, n, probability of getting each question correct)

## [1] 0.0019540738

paste("[1] 0.001954074")

## [1] "[1] 0.001954074"

#c.the probability that one would guess the answers of at least 25 questions correctly is .147%
n<-60  ##60 questions
q.correct<-1/4  ##probability of guessing each answer correctly 
1-pbinom(25, n, q.correct) ##calculate cumulative probability of guessing at least 25 correct

## [1] 0.0014693482

paste("[1] 0.001469348")

## [1] "[1] 0.001469348"

#d. In order to get into the top 10% of test takers, Jane would need to get at least 45 questions correct. 
qbinom(.90, size=60, prob=.40/60)

## [1] 1

#E. The average test taker will get 70% of 60 questions correct = 42 questions.I showed the distribution below with a histogram and QQ plot, which showed that the distribution was not normal (histogram shows it's negatively skewed, QQplot shows that the data points don't neatly follow the QQ plot line.)

#70% of 60 questions = 42 
randodata <- rbinom(n=200, size= 60, prob=70/100) #generate random sample of test takers

#Histogram 
hist(rando.data, prob=TRUE,
     main='Q3:The distribution of Standard Reading Test test takers',
     xlab='Score', ylab='Density',
     ylim=c(0,.15))

## Error in hist(rando.data, prob = TRUE, main = "Q3:The distribution of Standard Reading Test test takers", : object 'rando.data' not found

lines(density(rando.data), col='red')

## Error in density(rando.data): object 'rando.data' not found

##normal QQ plot to see if data is normally distributed
qqnorm(rando.data)

## Error in qqnorm(rando.data): object 'rando.data' not found

qqline(rando.data, col='yellow')

## Error in quantile(y, probs, names = FALSE, type = qtype, na.rm = TRUE): object 'rando.data' not found

Now click ‘knit’ tab and select ‘knit to html’, an html file will be generated in your working folder. You should submit both .rmd file and .html file for this practice.

Psyc210a Practice 03, due by 10:30pm, Wednesday 9/21

Caley Mikesell