The Assignment

For this assignment, I will conduct regression analyses using several variables from the General Social Survey 2012 Cross-Section and Panel Combined. Specifically, I will be exploring whether there is a relationship between the number of hours a respondent worked in the last week and sex, race, and number of wage earners in the respondent’s household.

1. Preparation

First, I load the packages that I will need. This includes “foreign,” which allows me to read non-R files like the SPSS one used for this assignment.

require(foreign)
## Loading required package: foreign
require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(ggvis)
## Loading required package: ggvis
require(magrittr)
## Loading required package: magrittr

Next, I read the SPSS .SAV file in to R and then read the newly-read data from GSS.SAV into a data frame in R

GSSdata <- read.spss("GSS.SAV",  
                     max.value.labels=TRUE, to.data.frame=FALSE,
                     trim.factor.names=FALSE, 
                     reencode=NA, use.missings=to.data.frame)

GSSdata <- data.frame(GSSdata)

I then convert the data frame to a table frame by using the dplyr function, tbl_df

GSSdata <- tbl_df(GSSdata)

To be sure I have the data read correctly, I will run some descriptive statistics (mean, standard deviation, and count of cases) for the dependent variable I will use, HRS1:

GSSdata %>%
  summarise(mean_hours=mean(HRS1), sd_hours=sd(HRS1), n=n())
## Source: local data frame [1 x 3]
## 
##   mean_hours sd_hours     n
##        (dbl)    (dbl) (int)
## 1   24.07573 24.19891  4820
summary(GSSdata$HRS1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -1.00   -1.00   25.00   24.08   40.00   99.00

1. Visualizations

1a. Before I create the two histograms needed for #1, I need to get rid of the “non-numbers” (-1) in the HRS1 set and change the “99” (no answer) responses to NA.

GSSdata$HRS1 <- ifelse(GSSdata$HRS1 == 99, NA, GSSdata$HRS1)
GSSdata$HRS1 <- ifelse(GSSdata$HRS1 == 98, NA, GSSdata$HRS1)
GSSdata$HRS1 <- ifelse(GSSdata$HRS1 == -1, NA, GSSdata$HRS1)

GSSdata <- tbl_df(GSSdata)

summary(GSSdata$HRS1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   35.00   40.00   40.26   50.00   89.00    1966
GSSdata %>%
  summarise(mean_hours=mean(HRS1, na.rm=TRUE), sd_hours=sd(HRS1, na.rm=TRUE), n=n())
## Source: local data frame [1 x 3]
## 
##   mean_hours sd_hours     n
##        (dbl)    (dbl) (int)
## 1   40.26384 15.47928  4820

Now I create my histogram of HRS1

GSSdata %>% ggvis(~GSSdata$HRS1) %>% layer_histograms() %>%
  add_axis("x", title = "Hours Worked per Week", title_offset="50") %>%
  add_axis("y", title = "Number of Respondents", title_offset="50") %>%
  add_axis("x", orient = "top", ticks = 0, title = "Histogram of Hours Worked per Week",
           properties = axis_props(
             axis = list(stroke = "white"),
             labels = list(fontSize = 0)))
## Guessing width = 5 # range / 18

1b. Create another histogram of logarithm of HRS1

Now I am ready to create the histogram of logarithm of HRS1

GSSdata %>% ggvis(~log(GSSdata$HRS1)) %>% layer_histograms() %>%
  add_axis("x", title = "Logarithm of Hours Worked per Week", title_offset="50") %>%
  add_axis("y", title = "Number of Respondents", title_offset="50") %>%
  add_axis("x", orient = "top", ticks = 0, title = "Histogram of the Logarithm of Hours Worked per Week",
           properties = axis_props(
             axis = list(stroke = "white"),
             labels = list(fontSize = 0)))
## Guessing width = 0.2 # range / 23

2. Analyses

First, I will run frequency distributions to get a better look at the data

hoursworked <- table(GSSdata$HRS1)
hoursworked
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
##   9   2   1  10  10   6   5  19  95  20   2  20   5   5  32  24   3   8 
##  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36 
##   3 104   7   9   2  30  50   9   9  13   2 105   6  69   8  10  84  50 
##  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54 
##  36  43   9 831  15  39  17  35 149  29  11  66   3 219   3  25   6   1 
##  55  56  57  58  59  60  62  63  64  65  66  67  68  70  72  74  75  77 
##  84  18   9  11   2 166   8   3   4  30   1   3   9  51   9   1   6   2 
##  80  82  85  86  89 
##  27   1   3   2  21
earners <- table(GSSdata$EARNRS)
earners
## 
##    0    1    2    3    4    5    6    7    8    9 
##  913 1807 1555  324  147   33   13    5    4   19
sex <- table(GSSdata$SEX)
sex
## 
##    1    2 
## 2132 2688
race <- table(GSSdata$RACE)
race
## 
##    1    2    3 
## 3700  722  398

All of my variables first need to be transformed.

I start by creating a new variable for RACE that consists of only “White” (1) or “Non-White” (0) which will combine Black (2) and Other (3) into a single category. (This will also enable me to assign meaningful numerical values to this categorical variable.)

GSSdata$RACEnew <- 0 
GSSdata$RACEnew <- ifelse(GSSdata$RACE == 1, 1, GSSdata$RACEnew)
racetest <- table(GSSdata$RACEnew)
racetest
## 
##    0    1 
## 1120 3700

For EARNRS, I want to create two new variables from the EARNRS data so that in my least squares regression, I can see 1) if there is any relationship between homes that have 1 or more wage earners and hours worked in the past week and 2) if there is any relationship between homes that have 2 or more wage earners and hours worked in the past week.

The data for EARNRS also reveals that the 9s (no answer) were not changed into NAs. I will have to do that myself for both of the new variables so R will recognize those as “NA”

earners <- table(GSSdata$EARNRS)
earners
## 
##    0    1    2    3    4    5    6    7    8    9 
##  913 1807 1555  324  147   33   13    5    4   19
GSSdata$EARNRS1 <- 0 
GSSdata$EARNRS1 <- ifelse(GSSdata$EARNRS >= 1 & GSSdata$EARNRS < 9, 1, GSSdata$EARNRS1)
GSSdata$EARNRS1 <- ifelse(GSSdata$EARNRS == 9, NA, GSSdata$EARNRS1)

GSSdata$EARNRS2 <- 0
GSSdata$EARNRS2 <- ifelse(GSSdata$EARNRS >= 2 & GSSdata$EARNRS < 9, 1, GSSdata$EARNRS2)
GSSdata$EARNRS2 <- ifelse(GSSdata$EARNRS == 9, NA, GSSdata$EARNRS2)

earntest1 <- table(GSSdata$EARNRS1)
earntest1
## 
##    0    1 
##  913 3888
earntest2 <- table(GSSdata$EARNRS2)
earntest2
## 
##    0    1 
## 2720 2081

SEX is a binary categorical variable, coded as either 1 (Male) or 2 (Female). To assign meaningful numerical values to these level of categorical variables, I need to create a new dummy variable for SEX that will recode Male (currently 1) to 0 and Female (currently 2) to 1.

GSSdata$SEXnew <- 0 
GSSdata$SEXnew <- ifelse(GSSdata$SEX == 2, 1, GSSdata$SEXnew)

sex <- table(GSSdata$SEX)
sex
## 
##    1    2 
## 2132 2688
sextest <- table(GSSdata$SEXnew)
sextest
## 
##    0    1 
## 2132 2688

2a. Conduct an ordinary least squares regression analysis using HRS1 as the dependent variable and my four independent variables: SEX, RACEnew, EARNRS1, and EARNRS2.

leastsquares1 <- lm(HRS1 ~ SEXnew + RACEnew + EARNRS1 + EARNRS2,data=GSSdata)
summary(leastsquares1)
## 
## Call:
## lm(formula = HRS1 ~ SEXnew + RACEnew + EARNRS1 + EARNRS2, data = GSSdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.371  -6.242   1.256   6.521  52.131 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.3671     2.3695  15.348  < 2e-16 ***
## SEXnew       -6.8924     0.5653 -12.192  < 2e-16 ***
## RACEnew       0.6269     0.6740   0.930  0.35239    
## EARNRS1       7.3769     2.3407   3.152  0.00164 ** 
## EARNRS2      -0.6099     0.5774  -1.056  0.29099    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.08 on 2844 degrees of freedom
##   (1971 observations deleted due to missingness)
## Multiple R-squared:  0.05327,    Adjusted R-squared:  0.05194 
## F-statistic: 40.01 on 4 and 2844 DF,  p-value: < 2.2e-16
confint(leastsquares1)
##                  2.5 %     97.5 %
## (Intercept) 31.7210911 41.0131768
## SEXnew      -8.0008978 -5.7839403
## RACEnew     -0.6947109  1.9485622
## EARNRS1      2.7871952 11.9665397
## EARNRS2     -1.7421130  0.5223776

2a. Hypothesis tests

Is the entire set of independent variables related to the dependent variable?

I read from the findings that:

## Multiple R-squared: 0.05327, Adjusted R-squared: 0.05194

## F-statistic: 40.01 on 4 and 2844 DF, p-value: < 2.2e-16

If α = .05, then the p-value, < 2.2e-16, is less than α. Therefore, I reject the null hypothesis that there is no relationship between the dependent variable and the entire set of independent variables.

Since I reject the null hypothesis, I will now explore whether each of the independent variables is related to the dependent variable.

First I examine the regression coefficient for one independent variable, SEXnew.

## Estimate Std. Error t value Pr(>|t|)

## SEXnew -6.8924 0.5653 -12.192 < 2e-16

Again, if α = .05, then the p-value, < 2e-16, is less than α. Therefore, I reject the null hypothesis that there is no relationship between HRS1 and SEXnew.

Point estimate for SEXnew

The SEXnew variable was coded as either 0 (Male) or 1 (Female). The estimate of the population coefficient is -6.89 (rounded). This means that there is a negative relationship between HRS1 and SEXnew. On average, women work 6.89% hours per week less than men.

Interval estimate for SEXnew

How accurate is the estimate of the relationship between hours worked per week and sex? Here is the 95% confidence interval for SEXnew:

## 2.5 % 97.5 %

## SEXnew -8.0008978 -5.7839403

Our best estimate is that, on average, women work 7.49% hours per week less than men. However, we are 95% confident that women work between 5.78% and 8.00% hours per week less than men.

Next I examine the regression coefficient for another independent variable, RACEnew.

## Estimate Std. Error t value Pr(>|t|)

## RACEnew 0.6269 0.6740 0.930 0.35239

If α = .05, then the p-value, 0.35, is more than α. Therefore, I fail to reject the null hypothesis that there is no relationship between HRS1 and RACEnew.

I then examine the regression coefficient for another independent variable, EARNRS1.

## Estimate Std. Error t value Pr(>|t|)

## EARNRS1 7.3769 2.3407 3.152 0.00164

EARNRS1 describes how many family members in the household earned any money in the past year from any job or employment. The EARNRS1 variable was coded as either 0 (No wage earners in the respondent’s family in the household) or 1 (1 or more wage earners in the respondent’s family in the household).

If α = .05, then the p-value, < .002, is less than α. Therefore, I reject the null hypothesis that there is no relationship between HRS1 and EARNRS1.

Point estimate for EARNRS1

The estimate of the population coefficient is 7.38 (rounded). This means that there is a positive relationship between HRS1 and EARNRS1. On average, households with 1 or more wage earners report working 7.38% hours more per week than households with no wage earners.

Interval estimate for EARNRS1

How accurate is the estimate of the relationship between hours worked per week and the number of a household’s family wage earners? Here is the 95% confidence interval for EARNRS1:

## 2.5 % 97.5 %

## EARNRS1 2.7871952 11.9665397

Our best estimate is that, on average, households with 1 or more wage earners report working 7.38% hours more per week than households with no wage earners. However, we are 95% confident that households with 1 or more wage earners report working between 2.79% and 11.97% hours more per week.

I then examine the regression coefficient for another independent variable, EARNRS2.

## Estimate Std. Error t value Pr(>|t|)

## EARNRS2 -0.6099 0.5774 -1.056 0.29099

EARNRS2 describes how many family members in the household earned any money in the past year from any job or employment. The EARNRS1 variable was coded as either 0 (0 to 1 wage earners in the respondent’s family in the household) or 1 (2 or more wage earners in the respondent’s family in the household).

If α = .05, then the p-value, .29, is more than α. Therefore, I fail to reject the null hypothesis that there is no relationship between HRS1 and EARNRS2.

2b. Conduct an ordinary least squares regression analysis using ln(HRS1) as the dependent variable and my four independent variables: SEX_new, RACEnew, EARNRS1, and EARNRS2.

leastsquares2 <- lm(log(HRS1) ~ SEXnew + RACEnew + EARNRS1 + EARNRS2,data=GSSdata)
summary(leastsquares2)
## 
## Call:
## lm(formula = log(HRS1) ~ SEXnew + RACEnew + EARNRS1 + EARNRS2, 
##     data = GSSdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6951 -0.0575  0.1381  0.2562  1.0091 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.482908   0.085484  40.743   <2e-16 ***
## SEXnew      -0.207448   0.020395 -10.171   <2e-16 ***
## RACEnew     -0.003406   0.024317  -0.140   0.8886    
## EARNRS1      0.215589   0.084447   2.553   0.0107 *  
## EARNRS2     -0.004572   0.020833  -0.219   0.8263    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5439 on 2844 degrees of freedom
##   (1971 observations deleted due to missingness)
## Multiple R-squared:  0.03708,    Adjusted R-squared:  0.03573 
## F-statistic: 27.38 on 4 and 2844 DF,  p-value: < 2.2e-16
confint(leastsquares2)
##                   2.5 %      97.5 %
## (Intercept)  3.31529104  3.65052540
## SEXnew      -0.24743871 -0.16745663
## RACEnew     -0.05108686  0.04427558
## EARNRS1      0.05000585  0.38117279
## EARNRS2     -0.04542040  0.03627654

2b. Hypothesis tests

Is the entire set of independent variables related to the dependent variable?

I read from the findings that:

## Multiple R-squared: 0.03708, Adjusted R-squared: 0.03573

## F-statistic: 27.38 on 4 and 2844 DF, p-value: < 2.2e-16

If α = .05, then the p-value, < 2.2e-16, is less than α. Therefore, I reject the null hypothesis that there is no relationship between the dependent variable and the entire set of independent variables.

Since I reject the null hypothesis, I will now explore whether each of the independent variables is related to the dependent variable.

First I examine the regression coefficient for one independent variable, SEXnew.

## Estimate Std. Error t value Pr(>|t|)

## SEXnew -0.207448 0.020395 -10.171 <2e-16

Again, if α = .05, then the p-value, < 2e-16, is less than α. Therefore, I reject the null hypothesis that there is no relationship between ln(HRS1) and SEXnew.

Point estimate for SEXnew

The SEXnew variable was coded as either 0 (Male) or 1 (Female). The estimate of the population coefficient is -0.21 (rounded). This means that there is a negative relationship between ln(HRS1) and SEXnew. On average, women work .21% hours per week less than men.

Interval estimate for SEXnew

How accurate is the estimate of the relationship between hours worked per week and sex? Here is the 95% confidence interval for SEXnew:

## 2.5 % 97.5 %

## SEXnew -0.24743871 -0.16745663

Our best estimate is that, on average, women work .21% hours per week less than men. However, we are 95% confident that women work between .17% and .25% hours per week less than men.

Next I examine the regression coefficient for another independent variable, RACEnew.

## Estimate Std. Error t value Pr(>|t|)

## RACEnew -0.003406 0.024317 -0.140 0.8886

If α = .05, then the p-value, 0.89, is more than α. Therefore, I fail to reject the null hypothesis that there is no relationship between HRS1 and RACEnew.

I then examine the regression coefficient for another independent variable, EARNRS1.

## Estimate Std. Error t value Pr(>|t|)

## EARNRS1 0.215589 0.084447 2.553 0.0107

EARNRS1 describes how many family members in the household earned any money in the past year from any job or employment. The EARNRS1 variable was coded as either 0 (No wage earners in the respondent’s family in the household) or 1 (1 or more wage earners in the respondent’s family in the household).

If α = .05, then the p-value, < .01, is less than α. Therefore, I reject the null hypothesis that there is no relationship between HRS1 and EARNRS1.

Point estimate for EARNRS1

The estimate of the population coefficient is .22 (rounded). This means that there is a positive relationship between HRS1 and EARNRS1. On average, households with 1 or more wage earners report working .22% hours more per week than households with no wage earners.

Interval estimate for EARNRS1

How accurate is the estimate of the relationship between hours worked per week and the number of a household’s family wage earners? Here is the 95% confidence interval for EARNRS1:

## 2.5 % 97.5 %

## EARNRS1 0.05000585 0.38117279

Our best estimate is that, on average, households with 1 or more wage earners report working .22% hours more per week than households with no wage earners. However, we are 95% confident that households with 1 or more wage earners report working between .05% and .38% hours more per week.

I then examine the regression coefficient for another independent variable, EARNRS2.

## Estimate Std. Error t value Pr(>|t|)

## EARNRS2 -0.004572 0.020833 -0.219 0.8263

EARNRS2 describes how many family members in the household earned any money in the past year from any job or employment. The EARNRS2 variable was coded as either 0 (0 to 1 wage earners in the respondent’s family in the household) or 1 (2 or more wage earners in the respondent’s family in the household).

If α = .05, then the p-value, < .83, is more than α. Therefore, I fail to reject the null hypothesis that there is no relationship between HRS1 and EARNRS2.

3. Reports

#First, I need means and standard deviations of all variables in each equation.

GSSdata %>%
  summarise(mean_hours=mean(HRS1, na.rm=TRUE), mean_sex=mean(SEXnew), mean_race=mean(RACEnew), mean_earners1 = mean(EARNRS1,na.rm=TRUE), mean_earners2 = mean(EARNRS2,na.rm=TRUE), n=n())
## Source: local data frame [1 x 6]
## 
##   mean_hours  mean_sex mean_race mean_earners1 mean_earners2     n
##        (dbl)     (dbl)     (dbl)         (dbl)         (dbl) (int)
## 1   40.26384 0.5576763 0.7676349     0.8098313     0.4334514  4820
GSSdata %>%
  summarise(sd_hours=sd(HRS1,na.rm=TRUE), sd_sex=sd(SEXnew), sd_race=sd(RACEnew), sd_earners1 = sd(EARNRS1,na.rm=TRUE), sd_earners2 = sd(EARNRS2,na.rm=TRUE), n=n())
## Source: local data frame [1 x 6]
## 
##   sd_hours    sd_sex   sd_race sd_earners1 sd_earners2     n
##      (dbl)     (dbl)     (dbl)       (dbl)       (dbl) (int)
## 1 15.47928 0.4967138 0.4223844    0.392475   0.4956031  4820
GSSdata %>%
  summarise(mean_hours=mean(log(GSSdata$HRS1),na.rm=TRUE), mean_sex=mean(SEXnew), mean_race=mean(RACEnew), mean_earners1 = mean(EARNRS1,na.rm=TRUE), mean_earners2 = mean(EARNRS2,na.rm=TRUE), n=n())
## Source: local data frame [1 x 6]
## 
##   mean_hours  mean_sex mean_race mean_earners1 mean_earners2     n
##        (dbl)     (dbl)     (dbl)         (dbl)         (dbl) (int)
## 1   3.584611 0.5576763 0.7676349     0.8098313     0.4334514  4820
GSSdata %>%
  summarise(sd_hours=sd(log(GSSdata$HRS1),na.rm=TRUE), sd_sex=sd(SEXnew), sd_race=sd(RACEnew), sd_earners1 = sd(EARNRS1,na.rm=TRUE), sd_earners2 = sd(EARNRS2,na.rm=TRUE), n=n())
## Source: local data frame [1 x 6]
## 
##    sd_hours    sd_sex   sd_race sd_earners1 sd_earners2     n
##       (dbl)     (dbl)     (dbl)       (dbl)       (dbl) (int)
## 1 0.5535699 0.4967138 0.4223844    0.392475   0.4956031  4820