Introduction

This homework will give you more experience using R/Markdown and executing statistical tests in R. You will work through some sample data and perform EDA and tests, and then interpret results to draw inferences.

How to submit

You should complete all of this R Markdown file (fill in the missing code, answer the questions, etc.) and then knit this file to an html or pdf document}. You will upload this file to canvas using the submission link on Canvas/the course website. Please update the “author” section of the header at the top of this document to include your name instead of mine.

Basic commands

Before we continue, let’s cover a few basic statistical commands in R. Below I will generate a random data set. It will be stored in the variable “sample”. Run the code block using the little green play button. You should see the values show up below the code block.

As you may already know, it’s often useful to calculate mean, median, and/or standard deviation when you’re examining a data set.

Q1:

Using the mean, median, and standard deviation commands, please calculate the mean, median, and standard deviation of the sample in R. You can look up how to use these commands on the “sample” variable using the ? command (ex. ?mean). If you can’t find the commands using the ? function, do a quick google search and you wil be able to find the–you might need to look up the standard deviation command in R (hint: it’s 2 letters!). Run the code and then write what the mean, median, and standard deviation are below the code block (using text just like this).

a. (2pts.)

mean(sample)
## [1] 460.15
median(sample)
## [1] 439
sd(sample)
## [1] 288.3848
length(sample)
## [1] 20

Mean:460.15 Median: 439 Standard Deviation: 288.3848

b. (2 pts.)

What is the sample size? How many degrees of freedom does this sample have? (If you don’t want to count, you can use the length() command).

Sample Size: 20 Degrees of Freedom: 19

Q2:

a. (2 pts.)

Let’s import some data and look at a distribution. First, download the “distribution.csv” file and run this code to import the information.

#update code below with YOUR path to the csv file
dist.values <- read.csv("distribution.csv")

To see how the data are distrubuted, we’ll just use the hist() command to save a bit of time, but you might want to use ggplot2’s geom_histogram() option in the future to make your plots look nicer. Run the code below and take a look at the chart.

hist(dist.values$obs, breaks=20)

Describe what you see. How are the data on the plot grouped? How does the plot look? You can describe this generally or in terms of skew (see https://en.wikipedia.org/wiki/Skewness#:~:text=Skewness%20is%20a%20descriptive%20statistic,deviation%20from%20the%20normal%20distribution)/

The histogram represents the frequency of the values (1-8) within the data set.The values with the most frequency are centered around 1-4. Plot has its peak in the middle.

#b. (2 pts.)

Let’s suppose these data represent the average number of times someone gets a spam call in a given month. Based on this information (and the plot if it’s helpful), what kind of distribution do you think best represents this variable? Hint: see the types of probability distributions at the beginning of the lecture notes.

Type of Probability Distribution: Normal Distribution

Q3:

Next, let’s learn how to use box plots! First, you’ll need to import the .csv file I’ve created for this assignment. Download the file, and then import it below using the read.csv command.

This csv provides information on average test scores on an exam for fifty right-handed and fifty left-handed students. Once you’ve imported the csv, you will see “scores.table” show up in the “environment” section in the top right of your RStudio window. Clicking this will open the csv so you can look at it and see the variable names. Alternatively in the console below, you can type “View(scores.table)”

a. (2 pts.)

# update code below with YOUR path to the csv file
scores.table <- read.csv("test_scores.csv")

What are the two variable names shown explicitly in the table (hint: column names)? What’s the independent/predictor variable? What’s the outcome/dependent variable? (Hint: the outcome variable is not explicitly labeled in the table but you can explain it based on the description above, and the predictors are shown as variable names)

Two Variable Names: Right Handed and Left Handed Independent Variable: Dominant Hand(Right or Left) Dependent Variable: Average Test Scores

b. (4 pts.)

Let’s do the first bit of exploratory data analysis: generating histograms. We showed how to generate a histogram in the question above. Use similar code to generate a histogram of the scores for each group (right_handed and left_handed). Rember, you can select a colum using table_name$column_name syntax. Set breaks to equal 20.

hist(scores.table$right_handed, breaks = 20)

hist(scores.table$left_handed, breaks = 20)

What are your impressions? Do the outcome variables look approximately normally distributed? (hint: it doesn’t have to be a perfect bell curve!)

Looking at both graphs, they are different as to where the value with the most frequency lies, hinting at the possibility that those who are left-handed have higher test scores. Additionally, both graphs do seem to be normally distributed.

c. (2 pts.)

Now let’s actually generate the box plot. I am providing the skeleton of the code you’ll need, you just need to fill in the variables! Make sure you run the code block with “melt” before completing your code portion. Once you’ve updated the scores table using melt, take another look at it. Change

to the correct name, and set x= to x= and y= to y=.

x = handed y = scores

#this first command restructures the data in a way that makes it possible to use ggplot. You can take a look at what happened here if you like!
melted.scores.table <- melt(scores.table, variable.name = "handed", value.name = "score")
## No id variables; using all as measure variables
# Update code below with the correct table and x and y values
# the table name is a little different now! (see the code block above)
# if you're confused about aes(x,y) look at the ggplot documentation and at the way the new table is structured
ggplot(melted.scores.table, aes(x= handed, y= score)) + 
  geom_boxplot()

d. (3 pts.)

Now that you have the box plots generated, what are your impressions? Do the median values look substantially different? Do you think one of the groups (right or left handed) performed better on the exam?

The median value for the left-handed plot is higher than the one for right-handed. Additionally, while both have similar upper quartiles, right-handed’s lower quartile is significantly less than left-handed’s. Therefore, I would conclude that the left-handed group performed better than the right-handed one.

Q4:

We’ve done a bit of EDA now, so let’s perform a statistical test and interpret the results! Specifically, we’ll be using a t-test.

a. (2 pts.)

First, based on the description of the data above, which kind of t-test do you think we should use, and why? (one-sample, 2-sample, or paired?) Hint: think about how many unique populations of individuals are in the data set (is it one or two?)

The type of t-test that should be used is 2-sample as there are two distinct populations of left-handed and then right-handed in the data set.

b. (3 pts.)

What are the assumptions of this test? (You can assume they are true in subsequent steps)

  1. Continuous Data
  2. Normally Distributed
  3. Samples are independent
  4. Variances of two samples are the same
  5. Random Sampling from both populations

c. (5 pts.)

Let’s run the test and see some results–finally! Update the paired=[TRUE/FALSE] to only TRUE or FALSE depending on whether or not you think it’s a paired test. Pay attention to the syntax here as it’s useful if you need to run a t-test in the future!

set.seed(123)
# update the "true/false" to be only TRUE (if you think it's paired) or FALSE (if you don't think it's paired)
t.test(scores.table$right_handed, scores.table$left_handed,paired=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  scores.table$right_handed and scores.table$left_handed
## t = -1.1153, df = 90.687, p-value = 0.2676
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.2398701  0.9098933
## sample estimates:
## mean of x mean of y 
##  90.29496  91.45995

Now that you’ve gotten the results, what is the p-value and what is the confidence interval? Does the p-value show a statistically significant difference in results between the two groups (right or left handed)? How about the confidence interval? For both, why or why not? Assume a significance level of \(\alpha\) = 0.05 (5%). Remember, you compare the p-value to the significance level, and look for zero in the confidence interval.

p-value: 0.2676 confidence interval: (-3.2398701,09098933)

Neither the p-value or the confidence interval show a statistically significant difference as the p-value exceeds 0.05 of alpha and the confidence interval contains 0, thereby making us unable to reject the null hypothesis.

d. (5 pts.)

I’ll show you a common way to report statistical findings when you’ve run a t-test. First, you might need some summary statistics: the mean, the standard deviation, and the degrees of freedom for the samples. Recall: n (sample size for both) is 50. You can run the same commands to get these as you did in question 1.

# optionally, run summary commands here
mean(scores.table$right_handed)
## [1] 90.29496
mean(scores.table$left_handed)
## [1] 91.45995
sd(scores.table$right_handed)
## [1] 5.917809
sd(scores.table$left_handed)
## [1] 4.419197

Degrees of Freedom: 49 for each sample 98 for total sample

Use the format: t(degress of freedom) = , <p < .05 or p > .05>. Fill in the missing details and write the sentence below.

t(98) = -1.1153, p > .05.

Then rewrite the following with the correct answer:

New Sentence: There was not a statistically significant difference in test performance between left and right handed students.

If and only if your findings were statistically significant, also fill in the following sentence. If not, don’t include the sentence. Either way, make sure you review the sentence so you get the general idea of how to report t-test results.

The 50 participants who were right or left-handed (M = mean, SD = std. dev.) compared to the 50 participants in the right or left-handed group (std. dev.) demonstrated statistically significantly higher, t(deg freedom) = t statistic, p < or p > value.

Congrats! You’ve worked through the process of exploring data and running a statistical test!