Data for this assignment came from The General Social Surveys (GSS) Cross-Section and Panel Combined and can be accessed here.

Refining, Analyzing, Visualizing & Reporting Data from the General Social Surveys 2012 Cross-Section & Panel Combined

1.) As directed, I have created a new project in R and have downloaded the data files from the NLS into my project folder.

## Source: local data frame [1 x 3]
## 
##   mean_hours sd_hours     n
##        (dbl)    (dbl) (int)
## 1   24.07573 24.19891  4820

Visulaizations

As instructed, I created a histogram of the variable HRS1, the dependent variable for this assignment, to help better understand the data and its distributioin.

Histogram of HRS1

GSSdata %>% ggvis(~GSSdata$HRS1) %>% layer_histograms() %>% add_axis("x", title = "Hours Worked") %>% add_axis("y", title = "Response Count") 
## Guessing width = 5 # range / 21

I noted the large number of responses less than and equal to zero, so as instructed, I created another histogram of the natural log of HRS1. First, though, I filtered out the zero responses with the ‘log1p’ function because the natural log of zero is undefined:

Histogram of the natural log of HRS1

lHRS1 <- log1p(GSSdata$HRS1)
GSSdata %>% ggvis(~log(lHRS1)) %>% layer_histograms()%>% add_axis("x", title = "Hours Worked") %>% add_axis("y", title = "Response Count")
## Warning in log(lHRS1): NaNs produced
## Warning in log(lHRS1): NaNs produced
## Guessing width = 0.05 # range / 39

I saw that HRS1 had some responses that needed to be recoded as “NAs”, which I do below:

GSSdata$HRS1[GSSdata$HRS1 < 0] = NA  # 
GSSdata$HRS1[GSSdata$HRS1 == 99] = NA  # 
GSSdata$HRS1[GSSdata$HRS1 == 98] = NA  # 
summary(GSSdata$HRS1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   35.00   40.00   40.26   50.00   89.00    1966

My data now looks much more representative, as the median and mean match well right around 40 hours per week.

Analyses

As instructed, I have conducted two ordinary least squares regression analyses. Before beginning my analysis, I trimed the data, removing non-applicable responses:

Re-code Health

GSSdata$HEALTH[GSSdata$HEALTH == 0] = NA  # 
GSSdata$HEALTH[GSSdata$HEALTH == 9] = NA  # Dont know
GSSdata$HEALTH[GSSdata$HEALTH == 8] = NA  # Invalid missing 
summary(GSSdata$HEALTH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000   2.000   2.086   3.000   4.000    1672

Re-code HAPPY

GSSdata$HAPPY[GSSdata$HAPPY == 8] = NA  # 
GSSdata$HAPPY[GSSdata$HAPPY == 9] = NA  # 
summary(GSSdata$HAPPY)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000   2.000   1.848   2.000   3.000      14

Checking AGE

GSSdata$AGE[GSSdata$AGE == 99] = NA  # 
summary(GSSdata$AGE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   18.00   35.00   49.00   49.59   62.00   89.00      51
GSSdata %>% ggvis(~GSSdata$AGE) %>% layer_histograms()
## Guessing width = 2 # range / 36

I saw that 75% of respondents were between the ages of 18 and 62, but my maximum reported age was 89. I trimed the data of extreme values so as to have a more accurate mean for this variable:

GSSdata$AGE[GSSdata$AGE > 78] = NA  # 
summary(GSSdata$AGE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   18.00   34.00   48.00   47.53   60.00   78.00     322
GSSdata %>% ggvis(~GSSdata$AGE) %>% layer_histograms()
## Guessing width = 2 # range / 30

With HRS1 (hours worked at all jobs last week) as the dependent variable, I created a multiple regression containing three independent variables: HEALTH, HAPPINESS and AGE (Varables defined here: http://thearda.com/Archive/Files/Codebooks/GSS12PAN_CB.asp):

rg1 <- lm(HRS1 ~ HEALTH + HAPPY + AGE,data=GSSdata)
summary(rg1)
## 
## Call:
## lm(formula = HRS1 ~ HEALTH + HAPPY + AGE, data = GSSdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.728  -5.908  -0.044   7.728  50.542 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 46.97657    1.73394  27.092  < 2e-16 ***
## HEALTH      -0.78560    0.49650  -1.582  0.11376    
## HAPPY       -1.93689    0.60590  -3.197  0.00141 ** 
## AGE         -0.03814    0.02746  -1.389  0.16495    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.27 on 1824 degrees of freedom
##   (2992 observations deleted due to missingness)
## Multiple R-squared:  0.01036,    Adjusted R-squared:  0.008732 
## F-statistic: 6.365 on 3 and 1824 DF,  p-value: 0.0002734

I also took the natural log of HRS1 and regressed it against the same three independent variables of Health, Happiness and Age:

lHRS1 <- log1p(GSSdata$HRS1)
rg2 <-lm(log(lHRS1) ~ HEALTH + HAPPY + AGE,data=GSSdata)
summary(rg2)
## 
## Call:
## lm(formula = log(lHRS1) ~ HEALTH + HAPPY + AGE, data = GSSdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.65702 -0.00234  0.04073  0.08719  0.25191 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.3329854  0.0214276  62.209   <2e-16 ***
## HEALTH      -0.0081181  0.0061356  -1.323   0.1860    
## HAPPY       -0.0099180  0.0074876  -1.325   0.1855    
## AGE         -0.0006111  0.0003393  -1.801   0.0719 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1888 on 1824 degrees of freedom
##   (2992 observations deleted due to missingness)
## Multiple R-squared:  0.004575,   Adjusted R-squared:  0.002938 
## F-statistic: 2.794 on 3 and 1824 DF,  p-value: 0.03902

Hypothesis testing:

Reporting Results

None of the variables have a statistically significant relationship with HRS1 with alpha set to .05, so I will fail to reject the null hypothesis that there is no relationship between HEALTH, HAPPY, AGE and HRS1.

         Estimate Std. Error t value Pr(>|t|)    

(Intercept) 1.3329854 0.0214276 62.209 <2e-16 *** HEALTH -0.0081181 0.0061356 -1.323 0.1860
HAPPY -0.0099180 0.0074876 -1.325 0.1855
AGE -0.0006111 0.0003393 -1.801 0.0719 .

Written Interpretation of Two Regression Coefficients

Once again looking to the results of rg2 (the analysis involving the natural log of HRS1), below is an interpretation of the HAPPY and AGE variable regression results:

HAPPY

The p-value for the HAPPY coefficient (0.25007)is greater than alpha so I have failed to reject the null hypothesis that the difference between the means of the natural log of HRS1 and the HAPPY variable is equal to zero.

Population Coefficient:

HAPPY -0.0058113 0.0050515 -1.150 0.25007

Interval Estimate:

            2.5 %       97.5 %

HAPPY -0.024603155 4.767202e-03

Note that the confidence interval includes ‘0’, which supports my decision to fail to reject the null hypothesis.

AGE

The p-value for the AGE (0.0719) coefficient is greater than alpha so I have failed to reject the null hypothesis that the difference between the means of the natural log of HRS1 and the HAPPY variable is equal to zero.

Population Coefficient:

AGE -0.0006111 0.0003393 -1.801 0.0719

Interval Estimate:

            2.5 %       97.5 %

AGE -0.001276534 5.436623e-05

Note that the confidence interval includes ‘0’, which supports my decision to fail to reject the null hypothesis.

3.) Before reporting the regression results in an APA complient table, I had to calculate the means, standard deviations and confidance intervals of the variables:

Means:

GS <- tbl_df(GSSdata)
GS %>%
  summarise(mean_HRS1=mean(HRS1), mean_HEALTH=mean(HEALTH), mean_HAPPY=mean(HAPPY),
            mean_AGE = mean(AGE)) 
## Source: local data frame [1 x 4]
## 
##   mean_HRS1 mean_HEALTH mean_HAPPY mean_AGE
##       (dbl)       (dbl)      (dbl)    (dbl)
## 1        NA          NA         NA       NA

and the natural log of HRS1:

GS <- tbl_df(GSSdata)
GS %>%
  summarise(mean_lHRS1=mean(lHRS1, na.rm=TRUE), mean_HEALTH=mean(HEALTH), mean_HAPPY=mean(HAPPY),
            mean_AGE = mean(AGE), n=n()) 
## Source: local data frame [1 x 5]
## 
##   mean_lHRS1 mean_HEALTH mean_HAPPY mean_AGE     n
##        (dbl)       (dbl)      (dbl)    (dbl) (int)
## 1   3.618764          NA         NA       NA  4820

Standard Deviations:

GS %>%
  summarise(sd_HRS1=sd(HRS1, na.rm=TRUE), sd_HEALTH=sd(HEALTH), sd_HAPPY=sd(HAPPY),
            sd_AGE = sd(AGE), n=n())
## Source: local data frame [1 x 5]
## 
##    sd_HRS1 sd_HEALTH sd_HAPPY sd_AGE     n
##      (dbl)     (dbl)    (dbl)  (dbl) (int)
## 1 15.47928       NaN      NaN    NaN  4820

and the natural log of HRS1:

GS %>%
  summarise(sd_lHRS1=sd(lHRS1, na.rm=TRUE), sd_HEALTH=sd(HEALTH), sd_HAPPY=sd(HAPPY),
            sd_AGE = sd(AGE), n=n())
## Source: local data frame [1 x 5]
## 
##    sd_lHRS1 sd_HEALTH sd_HAPPY sd_AGE     n
##       (dbl)     (dbl)    (dbl)  (dbl) (int)
## 1 0.5170823       NaN      NaN    NaN  4820

Confidence Intervals:

confint(rg1)
##                   2.5 %      97.5 %
## (Intercept) 43.57586154 50.37728218
## HEALTH      -1.75936730  0.18817138
## HAPPY       -3.12522384 -0.74855302
## AGE         -0.09199036  0.01570709

and the natural log of HRS1:

confint(rg2)
##                    2.5 %       97.5 %
## (Intercept)  1.290960188 1.375011e+00
## HEALTH      -0.020151735 3.915504e-03
## HAPPY       -0.024603155 4.767202e-03
## AGE         -0.001276534 5.436623e-05

The Tables:

Here is the table reporting the results for rg1:

Here is the table reporting the results for item rg2: