R Homework 1

Larisa Omanović

Research question: Is there a significant difference in the average real earnings among Black individuals who underwent job training and those who did not in the year 1978?

Null Hypothesis (H0): There is no difference in the average real earnings among Black individuals who underwent job training and those who did not in the year 1978.

Alternative Hypothesis (H1): There is a difference in the average real earnings among Black individuals who underwent job training and those who did not in the year 1978.

#install.packages("twang")
library(twang)
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "ndiMatrix" of class "replValueSp"; definition not updated
## To reproduce results from prior versions of the twang package, please see the version="legacy" option described in the documentation.
data("lalonde")
mydata <- lalonde
head(mydata, 5) #Table with first 5 rows.
##   treat age educ black hispan married nodegree re74 re75       re78
## 1     1  37   11     1      0       1        1    0    0  9930.0460
## 2     1  22    9     0      1       0        1    0    0  3595.8940
## 3     1  30   12     1      0       0        0    0    0 24909.4500
## 4     1  27   11     1      0       0        1    0    0  7506.1460
## 5     1  33    8     1      0       0        1    0    0   289.7899

Description of data - treat: binary variable; if the individual had job training program - 1 if treated, 0 if not treated.

  • age: age of the unit in years.

  • educ: years of education.

  • black:binary variable; if unit if Black - 1 if Black, 0 if not Black.

  • hispan:binary variable; if unit is Hispanic - 1 if Hispanic, 0 if not Hispanic.

  • married:binary variable; if the individual is married - 1 if married, 0 if not married.

  • nodegree:binary variable; if the individual has no degree. 1 if no degree, 0 if has a degree.

  • re74: real earnings in 1974 in dollars.

  • re75: real earnings in 1975 in dollars.

  • re78: real earnings in 1978 in dollars.

This dataset is the LaLonde Experimental Data. This dataset originates from a study conducted in the 1970s, evaluating the impact of the National Supported Work Demonstration (NSW) program on participants’ employment outcomes.

We have a sample size of 614 units. The unit of observation is a random person that participated in NSW program.There are 10 variables (described above).

any(is.na(mydata)) #I looked if there are some missing data in dataset.
## [1] FALSE

There is no missing data.

Data manipulation and descriptive analysis

Now I will remove some data, change decimals, name etc., factor variables and display the table for practice.

mydata <- mydata[, -8]  #I removed the eigth column (re74)
mydata <- mydata[, -6]  #I removed the sixth column (married)
mydata <- mydata[, -3]  #I removed the third column (educ), because years of education and degree/no degree are similar.
mydata <- subset(mydata, re75 != 0) #I removed all the units that had 0 real earnings in 1975 (re75 = 0) 
mydata <- subset(mydata, re78 != 0) #I removed all the units that also had 0 real earnings but in 1978 (re78 = 0). I removed this and previous rows, because my research question is based on comparison of job earnings and people, whose earnings are 0 in a year were probably unemployed.
mydata$re75 <- round(mydata$re75, 2)
mydata$re78 <- round(mydata$re78, 2) #I rounded the both real earnings to 2 decimals.
names(mydata)[names(mydata) == "treat"] <- "train" #For training purposes, I changed column name to train so it is clearer when reading the table
mydata$train <- factor(mydata$train, 
                       levels = c(1, 0), 
                       labels = c("trained", "not trained"))

mydata$black <- factor(mydata$black, 
                       levels = c(1, 0), 
                       labels = c("Black", "not Black"))

mydata$hispan <- factor(mydata$hispan, 
                       levels = c(1, 0), 
                       labels = c("Hispanic", "not Hispanic"))

mydata$nodegree <- factor(mydata$nodegree, 
                       levels = c(1, 0), 
                       labels = c("no degree", "degree")) #I did the factoring of certain variables.
black_data <- subset(mydata, black == "Black") #Data filtering to include only Black individuals
head(black_data, 3) #First 3 rows
##       train age black       hispan  nodegree   re75     re78
## 114 trained  18 Black not Hispanic no degree 214.56   929.88
## 117 trained  27 Black not Hispanic    degree 357.95 22163.25
## 118 trained  20 Black not Hispanic    degree 377.57  1652.64
summary(black_data[, c("train", "re78")]) #I only selected columns relevant for research question.
##          train         re78         
##  trained    :47   Min.   :   33.99  
##  not trained:33   1st Qu.: 1745.19  
##                   Median : 6224.56  
##                   Mean   : 8447.64  
##                   3rd Qu.:11053.23  
##                   Max.   :60307.93

We can see that among Black individuals 47 were part of job training program and 33 were not. Race is a categorical variable so frequency is displayed.

We can see that in year 1978 25% of Black individuals earned 1745.19 dollars or less. Half of the individuals had earnings of 6224.56 dollars or less. In 1978 minimal earning was 33.99 dollars and maximum was 60307.93 dollars. Average earning in 1978 was 8447.64 dollars.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
analysis <- black_data %>%
  group_by(train) %>%
  summarize(across(c(re78), list(mean = mean, median = median, sd = sd))) #Create separate summaries for trained and not trained individuals

print(analysis)
## # A tibble: 2 × 4
##   train       re78_mean re78_median re78_sd
##   <fct>           <dbl>       <dbl>   <dbl>
## 1 trained         9003.       6168.  10500.
## 2 not trained     7657.       7216.   6750.

This table summarizes the real earnings of Black individuals in the year 1978 but divided by training (yes/no). For individuals who underwent job training, by 1978 the average real earnings were 9003.113 dollars, with a median (50th percentile) of 6167.68 dollars and a standard deviation of 10499.847 dollars. For individuals who did not undergo job training in 1978 the average real earnings were 7656.504 dollars, with a median of 7215.74 dollars and a standard deviation of 6750.021 dollars (standard deviation indicates a range of earnings, reflecting the diversity of individual experiences).

library(psych)
describeBy(black_data$re78, group = black_data$train)
## 
##  Descriptive statistics by group 
## group: trained
##    vars  n    mean       sd  median trimmed     mad    min      max   range
## X1    1 47 9003.11 10499.85 6167.68    7042 6220.38 671.33 60307.93 59636.6
##    skew kurtosis      se
## X1 2.92    10.47 1531.56
## ------------------------------------------------------------ 
## group: not trained
##    vars  n   mean      sd  median trimmed     mad   min      max    range skew
## X1    1 33 7656.5 6750.02 7215.74 7165.55 8809.52 33.99 20243.38 20209.39 0.54
##    kurtosis      se
## X1    -1.14 1175.03
boxplot(black_data$re78 ~ black_data$train, 
        col = c("blue", "red"),
        main = "Boxplot of Real Earnings in 1978 by Training Status",
        xlab = "Training Status",
        ylab = "Real Earnings in 1978")

From these boxplots it seems like in my data sample I have outliers within the trained group. I browsed the internet to find a common way to look for outliers and found a way - checking z-score. Z-score is measured in terms of standard deviations from the mean. I took that a y-score of 2.0 indicates a value that is an outlier in my dataset.

black_data$z_score <- scale(black_data$re78)
outliers <- black_data[abs(black_data$z_score) > 2, ]
black_data <- black_data[abs(black_data$z_score) <= 2, ]
boxplot(black_data$re78 ~ black_data$train, 
        col = c("blue", "red"),
        main = "Real earnings in 1978 by training",
        xlab = "Training Status",
        ylab = "Real Earnings in 1978") #Boxplot without calculated outliers

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
ggplot(data = black_data, aes(x = re78, fill = train)) +
  geom_histogram(binwidth = 1000, position = "identity", alpha = 0.7) +
  facet_wrap(~train, scales = "free_y", ncol = 2) +
  labs(
    title = "Histogram of Real Earnings in 1978 for Black Individuals",
    x = "Real Earnings in 1978",
    y = "Frequency",
    fill = "Training Status"
  ) +
  scale_fill_manual(values = c("trained" = "blue", "not trained" = "red")) +
  theme_minimal()

From graph the normality assumption seems to be violated, that is why, I will test normality with Shapiro-Wilk test first.

Hypothesis testing

#install.packages("rstatix")
library(rstatix)
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
black_data %>%
  group_by(train) %>%
  shapiro_test(re78) #Shapiro-Wilk test for real earnings in 1978
## # A tibble: 2 × 4
##   train       variable statistic        p
##   <fct>       <chr>        <dbl>    <dbl>
## 1 trained     re78         0.888 0.000411
## 2 not trained re78         0.887 0.00253

Null Hypothesis (H0): The data follows a normal distribution. Alternative Hypothesis (H1): The data does not follow a normal distribution.

Since the p-value is small (p<0.001) I reject the null hypothesis. This suggests that the assumption of normality is violated, that is why I need to perform a non-parametric test.

wilcox.test(black_data$re78 ~ black_data$train,
            paired = FALSE,
            correct = FALSE,
            exact= FALSE,
            alternative = "two.sided") #Non-parametric Wilcoxon rank sum test
## 
##  Wilcoxon rank sum test
## 
## data:  black_data$re78 by black_data$train
## W = 763, p-value = 0.8358
## alternative hypothesis: true location shift is not equal to 0

Null Hypothesis (H0): There is no difference in the average real earnings among Black individuals who underwent job training and those who did not in the year 1978.

Alternative Hypothesis (H1): There is a difference in the average real earnings among Black individuals who underwent job training and those who did not in the year 1978.

Based on sampled data I reject the null hypothesis (p-value=0.8358) that the distribution of real earnings in 1978 is the same between Black individuals who underwent job training and those who did not.

library(effectsize)
## 
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared
## The following object is masked from 'package:psych':
## 
##     phi
effectsize(wilcox.test(black_data$re78 ~ black_data$train,
                       paired = FALSE,
                       correct = FALSE,
                       exact= FALSE,
                       alternative = "two.sided"))
## r (rank biserial) |        95% CI
## ---------------------------------
## 0.03              | [-0.23, 0.28]
interpret_rank_biserial(0.03)
## [1] "tiny"
## (Rules: funder2019)

The effect size is tiny (d=0.03).

t.test(black_data$re78 ~ black_data$train, 
       paired = FALSE,
       var.equal = FALSE,
       alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  black_data$re78 by black_data$train
## t = -0.27769, df = 63.492, p-value = 0.7822
## alternative hypothesis: true difference in means between group trained and group not trained is not equal to 0
## 95 percent confidence interval:
##  -3342.058  2526.446
## sample estimates:
##     mean in group trained mean in group not trained 
##                  7248.698                  7656.504

I still did a Welch t-test, but due to violation of normality assumption, this is not that reliable. But I still can reject the null hypothesis (p-value = 0.7822).

install.packages("effectsize")
## Warning: package 'effectsize' is in use and will not be installed
library(effectsize)
effectsize::cohens_d(black_data$re78 ~ black_data$train, 
                      pooled_sd = FALSE)
## Cohen's d |        95% CI
## -------------------------
## -0.06     | [-0.52, 0.39]
## 
## - Estimated using un-pooled SD.
interpret_cohens_d(0.06, rules = "sawilowsky2009")
## [1] "tiny"
## (Rules: sawilowsky2009)

The effect size is tiny (d=-0.06).