Psych Lab #1

Tam, Yas,

September 30, 2017

1. Plot two frequency histograms, one for each set of scores. The scores on this test can range from 0 to 20. You don’t need to divide the scores into categories, just plot the frequency of each possible score.

Answer: There are two main samples in this problem set. One is the large sample, which contains the representative of 100 people. The “Small” vector contains the representative of 20 people. Both of these vectors represent “the depression level” of people who were tested.

### Sketching the histograms of Depression

### storing the datasets
largeSample <- c(0,0,5,7,8,6,7,5,0,0,6,11,6,13,6,1,0,1,14,9,9,14,7,2,1,3,7,15,7,1,1,3,16,9,9,15,7,3,2,4,7,9,10,14,7,4,2,2,5,14,10,17,7,5,2,2,8,17,10,10,18,8,5,3,3,5,8,11,11,18,6,3,8,17,11,17,8,6,4,4,6,14,12,12,8,6,4,4,6,8,13,12,8,7,5,5,7,20,13,7)

Small <- c(0,2,2,2,3,4,4,5,5,5,6,7,7,7,8,8,9,12,13,17)
  1. Plot two frequency histograms:
# combine 2 histograms (1)
par(mfrow=c(1,2))
hist(largeSample, xlab = "Depression Level", main = "Histogram of large sample")
hist(Small, xlab = "Depression level", main = "Histogram of small sample")

2. Calculate the mean and median of each set of scores:

  1. Statistical Models:

The mean of the small and large dataset is:

### mean of these datasets
mean(Small)
## [1] 6.3
mean(largeSample)
## [1] 7.53
### median of these datasets
median(Small)
## [1] 5.5
median(largeSample)
## [1] 7

3. Calculate the range (you can do this by sight), and the standard deviation of each set of scores. Answer:

### range of the small sample
max(Small) - min(Small)
## [1] 17
### range of the large sample
max(largeSample) - min(largeSample)
## [1] 20
###standard deviation of the large sample
sd(largeSample)
## [1] 4.83559
###standard deviation of the small sample
sd(Small)
## [1] 4.156162

There is a faster and more handy way to compare these two vectors. This is done by adding the function “describe” from the psych package. (2)

library(psych)
describe(largeSample)
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 100 7.53 4.84      7    7.22 4.45   0  20    20 0.52    -0.44 0.48
describe(Small)
##    vars  n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 20  6.3 4.16    5.5    5.88 3.71   0  17    17 0.83     0.19 0.93

These results show that the range of the small sample (17) is smaller than that of the large sample (20). It also follows that the standard deviation of the large sample (4.84) is slightly higher than the small one (4.16). This suggests that the large sample has more variability than the small sample.

4. What happens to the standard deviations if you add another score of 20 to each set of scores? Is the impact on the standard deviation of the same relative size for each set? Is this what you would expect?

To answer the fourth question, we need to store new datasets and calculate the standard deviation for each of the dataset.

We add the number 20 to each of the data set and set the new names.

smallNew <- c(0,2,2,2,3,4,4,5,5,5,6,7,7,7,8,8,9,12,13,17, 20)

largeSampleNew <- c(0,0,5,7,8,6,7,5,0,0,6,11,6,13,6,1,0,1,14,9,9,14,7,2,1,3,7,15,7,1,1,3,16,9,9,15,7,3,2,4,7,9,10,14,7,4,2,2,5,14,10,17,7,5,2,2,8,17,10,10,18,8,5,3,3,5,8,11,11,18,6,3,8,17,11,17,8,6,4,4,6,14,12,12,8,6,4,4,6,8,13,12,8,7,5,5,7,20,13,7, 20)

We calculate the new standard deviations for these new datasets:

sd(smallNew)
## [1] 5.034642
sd(largeSampleNew)
## [1] 4.968774
# or use the describe function
describe(smallNew)
##    vars  n mean   sd median trimmed  mad min max range skew kurtosis  se
## X1    1 21 6.95 5.03      6    6.29 2.97   0  20    20 1.03     0.39 1.1
describe(largeSampleNew)
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 101 7.65 4.97      7    7.32 4.45   0  20    20 0.56     -0.4 0.49

When we add 20 to each of the datasets, the variability of the small dataset is affected much more than that of the large dataset. The SD of the small dataset changes from 4.16 to 5.03, while the SD of the large dataset only shifts from 4.84 to 4.97.

As expected, these new standard deviations are larger than those of the original datasets. This can be due to the fact that the number 20 is considered an outlier, because the mean of the small and the large datasets (6.3 and 7.53) are fairly small compared to 20. Also, adding 20 changes the range of the samples, thus affecting the standard deviations.

5. Even though the same test was used to obtain each set of scores, and even though both samples were representative of the general population, the descriptive statistics are somewhat different for each sample. Why do you think this happened? Do the differences in the numbers tell you that there is something different going on with the two groups of people, or not?

Answer: We clearly see that the small sample is rightly skewed, and has fewer variablity than the large sample. This is because there is a big difference between the sample sizes of these two samples: 100 people versus 20 people. The differences between these two samples can be reduced if either the sample sizes of these two datasets are large enough or not very different from each other.

References:

(1): combine two rows: http://www.shizukalab.com/toolkits/overlapping-histograms

(2): I learned the psych package from here: https://www.youtube.com/watch?v=vsaCS6l9XaY