SOC574 Probem Set #4

Problem 1

As part of a study concerning the effect of physical activity on mental health, a psychologist rated 231 subjects on a scale of 1 to 5 (1 being highly inactive and 5 being highly active). The sample mean was 3.94 and the sample standard deviation was.75.

a) Construct a 90% confidence interval for the mean

# Inputting the related statistics 
n <- 231
xbar <- 3.94
s <- 0.75

# Calculating the margin of error
margin <- qt(0.975,df=n-1)*s/sqrt(n)

# Calculating the lower and upper boundaries of the 95% CI
lowerinterval95 <- xbar - margin
lowerinterval95

## [1] 3.842771

upperinterval95 <- xbar + margin
upperinterval95

## [1] 4.037229

So the 95% confidence interval is: (3.842771, 4.037229)

b) Construct a 95% confidence interval for the mean

We can use the same logic to calculate the 90% Confidence interval

# Calculating the margin of error
margin <- qt(0.950,df=n-1)*s/sqrt(n)

# Calculating the lower and upper boundaries of the 95% CI
lowerinterval90 <- xbar - margin
lowerinterval90

## [1] 3.858504

upperinterval90 <- xbar + margin
upperinterval90

## [1] 4.021496

Therefore, the 90% confidence interval for the mean is: (3.858504, 4.021496)

Problem 2:

For this question, you will need to analyze the NLSY data which we have already used on several occasions. Please show all relevant Stata output to support your answers. There has been some debate over limiting the number of hours that high school students are allowed to work while still in school. One basic argument which proponents of a ‘limit’ often cite has to do with the potentially harmful effects of intensive work on school-based outcomes, such as grades.

Let’s first load the datase:

# Load haven package to read into .dta dataset
library(haven)

# Read into the data
nlsy <- read_dta('nlsy.dta')
head(nlsy)

## # A tibble: 6 × 22
##      id  male white mless income teenmom  memp confid nabsnt grades8  piat
##   <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl>  <dbl>  <dbl>   <dbl> <dbl>
## 1     3     0     0     0 63000        0     1      1   1       6      7  
## 2     7     1     0     0 46603.       0     1      1  17       4     63  
## 3    10     1     1     0 46603.       0     1      1   3       8     50  
## 4    15     0     0     0 46603.       0     0      0   1       7     52.2
## 5    23     0     0     1 46603.       0     1      1   3.74    4     73  
## 6    24     1     0     1 46603.       1     0      1  35       5.69  25  
## # … with 11 more variables: schpos <dbl>, rept <dbl>, thomewk <dbl>,
## #   antisoc <dbl>, fgang <dbl>, susp <dbl>, int1wksc <dbl>, vcrime14 <dbl>,
## #   vcrime17 <dbl>, femp <dbl>, remedial <dbl>

a) Using the NLSY data, create a 95% confidence interval for the average grade score, grades8, for those individuals who do engage in intensive work during high school. Interpret this interval. (Recall that the variable int1wksc = 1 if youth was employed at least one week during the school year at age 16 and worked more than 20 hours per week on average.).

We need to pull out entries with int1wksc = 1. To do this, we need to load the advanced data manipulation package tidyverse. We subset only data entries with int1wksc = 1, and store it in a new dataframe mydata.

# Load relevant packages
# install.packages('tidyverse', repos = "http://cran.us.r-project.org")
library(tidyverse)

# Subset data
mydata <- nlsy %>% filter(int1wksc == 1 )

Now, we can calculate the 95% confidence interval for the average grade score, we first do it the “hard way”, and calculate by pulling out sample statistics by hand, and then do the math:

# Pulling out sample statistics
n <- nrow(mydata)
xbar <- mean(mydata$grades8)
s <- sd(mydata$grades8)

# Calculate the margin or error
margin <- qt(0.975,df=n-1)*s/sqrt(n)

# Calculate 95% CI
lowerinterval <- xbar - margin
lowerinterval

## [1] 5.264914

upperinterval <- xbar + margin
upperinterval

## [1] 5.687991

We can also let R do all the work for us, no package needed:

t.test(mydata$grades8)

## 
##  One Sample t-test
## 
## data:  mydata$grades8
## t = 50.991, df = 247, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  5.264914 5.687991
## sample estimates:
## mean of x 
##  5.476452

Both giving the same result, which is: (5.264914, 5.687991)

b) Create a 95% CI for those individuals who do not engage in intensive work during high school. Interpret this interval.

We start again by subsetting rows according to the given criteria, this time we want all the rows that have the variable int1wksc = 0

# Subset data
mydata1 <- nlsy %>% filter(int1wksc == 0 )

Since in the previous question, I have shown how to calculate a 95% confidence interval by hand, I will just let R function to handle the job this time:

t.test(mydata1$grades8)

## 
##  One Sample t-test
## 
## data:  mydata1$grades8
## t = 100.8, df = 882, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  5.642234 5.866321
## sample estimates:
## mean of x 
##  5.754278

So the 95% confidence interval is: (5.642234, 5.866321). It means for students who do not engage in intensive work during high school, we are 95% confident that their mean grade is going to fall between (5.642234, 5.866321).

c) Based on a comparison of these two intervals, what can you infer regarding the grades of those who do and do not work? What does this imply for proponents of a ‘limit’ law on hours of work at least regarding effects on schooling?

Recall that the 95% confidence interval for individuals who do work is (5.264914, 5.687991), and for those who do not work is: (5.642234, 5.866321). It seems that students who do not work generally perform better academically then students who do work. However, since the two confidence intervals OVERLAP, the difference in academic performance between the two groups of students is not statistically significant.

Therefore, our result provides no substantial evidence for limiting hours of work for high school students.

d) Now repeat parts a and b this time by creating a 98% confidence interval. How do your results change?

The 98% confidence interval for both can be calculated by:

# Students who do work
t.test(mydata$grades8, conf.level = .98)

## 
##  One Sample t-test
## 
## data:  mydata$grades8
## t = 50.991, df = 247, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 98 percent confidence interval:
##  5.224968 5.727936
## sample estimates:
## mean of x 
##  5.476452

# Students who do not work
t.test(mydata1$grades8, conf.level = .98)

## 
##  One Sample t-test
## 
## data:  mydata1$grades8
## t = 100.8, df = 882, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 98 percent confidence interval:
##  5.621230 5.887325
## sample estimates:
## mean of x 
##  5.754278

So, for students who work, the 98% CI is: (5.224968, 5.727936); for students who do not work, the 98% CI is: (5.621230, 5.887325).

The confidence interval for both gets larger in range. This is expected, because as we loook for more confidence, we lose some accuracy.

Problem 3

A University task force on health and safety wishes to crack down on campus-wide drinking. In order to market message, the staff wishes to find out the actual percentage of college students who engage in binge drinking at least once a week and be accurate within 2 percent either way. (Binge drinking is defined by consumption of 5 or more alcoholic beverages in a night.) The task force wants to be 95% certain of their estimate, and they appoint you to find this out. How many students should we then sample? State any assumptions you need to make.

We know that the value of z for 95% CI is: 1.96. We also know that the margin of error E is: 2% or 0.02. Given:

\[ E = Z \cdot \sqrt{\frac{(p \cdot (1 – p))} {n}} \]

Now we know E and z, we need to know p, our sample proportion to determine the sample size.

Suppose the sample proportion is 0.5. Then we have: n = 2401.00004533, which means if we assume that the percentage of binge drinking students is 50%, then we need to sample at least 2402 students.

Taking a closer look, we know:

\[ n = \frac{(p \cdot (1-p) \cdot z^2)} {E^2} \]

This tells us that as p gets farther away from 0.5, our numerator is going to become smaller, keeping z and E unchanged. This means our estimate of n = 2402 is quite conservative, and it is not likely a larger sample is needed given the current level of confidence and tolerance of error.

Problem 4

An index that measures strategic behaviors in organizational settings, Measure of Ingratiatory Behaviors in Organizational Settings (MIBOS), which is measured on a 5-point scale (with higher scores indicating more strategic thinkers), was given to employees at a certain manufacturing firm. The test was given to 288 managers, who had a mean score of 2.41 and a standard deviation of .74. The test was also given to 110 clerical personnel who had an average score of 1.90 with a standard deviation of .59. Construct a 95% confidence interval for the difference in the mean MIBOS score of managers and clerical staff.

We know that:

# Sample statistics for managers
n_man <- 288
xbar_man <- 2.41
sd_man <- .74

# Sample statistics for clercials
n_cle <- 110
xbar_cle <- 1.90
sd_cle <- .59

We know that the confidence interval for comparing two means is given by:

\[ (\bar{x_1} - \bar{x_2}) \pm z \cdot \sqrt{(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2})} \]

We know that z = 1.96 for the 95% confidence interval, we just need a little algebra to have the confidence interval:

mean_diff <- xbar_man - xbar_cle
se <- sqrt(sd_man^2/n_man + sd_cle^2/n_cle)
z <- 1.96

lower_end <- mean_diff - z * se
lower_end

## [1] 0.3704963

upper_end <- mean_diff + z * se
upper_end

## [1] 0.6495037

Therefore, we know that a 95% confidence interval for the difference in mean score of managers and clerical staff is: (0.3704963, 0.6495037).

Let’s do something fun to verify our results. We now use the sample statistics we have to generate a set of simulated suedo-random data for MIBOS of clerical personals and managers (assuming that the population follows normal distribution).

To do this, we need the MASS package:

# install.packages('MASS')
library(MASS)

Now, let’s generate pesdo-random data for both manager and clerical personals:

# Generate Simulated data given sample statistics
set.seed(1568) # Set the seed for the pseudo-random data so the result is reproducible
manager <- mvrnorm(n=288, mu=2.41, Sigma=.74^2)
clerical <- mvrnorm(n=110, mu=1.90, Sigma=.59^2)

Finally, we run a two-sample t-test over our simulated data:

t.test(manager,clerical, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  manager and clerical
## t = 6.9977, df = 396, p-value = 1.117e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.3671344 0.6540235
## sample estimates:
## mean of x mean of y 
##  2.410857  1.900278

The 95% CI of our randomly-generated data is (0.3671344, 0.6540235). The result is not too different from our calculation using the actual formula. However, do notice that the simulated result will change as we change the number of “seed” (because it is pseudo random).

Problem 5

Refer to the prior question about MIBOS. State the null and alternative hypothesis you would use to test for a difference in mean scores for managers and clerical personnel.

Null Hypothesis: There is no difference between mean MIBOS scores for managers and clerical personnel.

Alternative Hypothesis: There is difference between mean MIBOS scores for managers and clerical personnel.