Data for this assignment came from The General Social Surveys (GSS) Cross-Section and Panel Combined and can be accessed here.
As instructed, I created a histogram of the variable HRS1, the dependent variable for this assignment, to help better understand the data and its distributioin.
GSSdata %>% ggvis(~GSSdata$HRS1) %>% layer_histograms() %>% add_axis("x", title = "Hours Worked") %>% add_axis("y", title = "Response Count")
## Guessing width = 5 # range / 21
I noted the large number of responses less than and equal to zero, so as instructed, I created another histogram of the natural log of HRS1. First, though, I filtered out the zero responses with the ‘log1p’ function because the natural log of zero is undefined:
lHRS1 <- log1p(GSSdata$HRS1)
GSSdata %>% ggvis(~log(lHRS1)) %>% layer_histograms()%>% add_axis("x", title = "Hours Worked") %>% add_axis("y", title = "Response Count")
## Warning in log(lHRS1): NaNs produced
## Warning in log(lHRS1): NaNs produced
## Guessing width = 0.05 # range / 39
I saw that HRS1 had some responses that needed to be recoded as “NAs”, which I do below:
GSSdata$HRS1[GSSdata$HRS1 < 0] = NA #
GSSdata$HRS1[GSSdata$HRS1 == 99] = NA #
GSSdata$HRS1[GSSdata$HRS1 == 98] = NA #
summary(GSSdata$HRS1)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 35.00 40.00 40.26 50.00 89.00 1966
My data now looks much more representative, as the median and mean match well right around 40 hours per week.
As instructed, I have conducted two ordinary least squares regression analyses. Before beginning my analysis, I trimed the data, removing non-applicable responses:
Re-code Health
GSSdata$HEALTH[GSSdata$HEALTH == 0] = NA #
GSSdata$HEALTH[GSSdata$HEALTH == 9] = NA # Dont know
GSSdata$HEALTH[GSSdata$HEALTH == 8] = NA # Invalid missing
summary(GSSdata$HEALTH)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 1.000 2.000 2.086 3.000 4.000 1672
Re-code HAPPY
GSSdata$HAPPY[GSSdata$HAPPY == 8] = NA #
GSSdata$HAPPY[GSSdata$HAPPY == 9] = NA #
summary(GSSdata$HAPPY)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 1.000 2.000 1.848 2.000 3.000 14
Checking AGE
GSSdata$AGE[GSSdata$AGE == 99] = NA #
summary(GSSdata$AGE)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 18.00 35.00 49.00 49.59 62.00 89.00 51
GSSdata %>% ggvis(~GSSdata$AGE) %>% layer_histograms()
## Guessing width = 2 # range / 36
I saw that 75% of respondents were between the ages of 18 and 62, but my maximum reported age was 89. I trimed the data of extreme values so as to have a more accurate mean for this variable:
GSSdata$AGE[GSSdata$AGE > 78] = NA #
summary(GSSdata$AGE)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 18.00 34.00 48.00 47.53 60.00 78.00 322
GSSdata %>% ggvis(~GSSdata$AGE) %>% layer_histograms()
## Guessing width = 2 # range / 30
With HRS1 (hours worked at all jobs last week) as the dependent variable, I created a multiple regression containing three independent variables: HEALTH, HAPPINESS and AGE (Varables defined here: http://thearda.com/Archive/Files/Codebooks/GSS12PAN_CB.asp):
rg1 <- lm(HRS1 ~ HEALTH + HAPPY + AGE,data=GSSdata)
summary(rg1)
##
## Call:
## lm(formula = HRS1 ~ HEALTH + HAPPY + AGE, data = GSSdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.728 -5.908 -0.044 7.728 50.542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.97657 1.73394 27.092 < 2e-16 ***
## HEALTH -0.78560 0.49650 -1.582 0.11376
## HAPPY -1.93689 0.60590 -3.197 0.00141 **
## AGE -0.03814 0.02746 -1.389 0.16495
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.27 on 1824 degrees of freedom
## (2992 observations deleted due to missingness)
## Multiple R-squared: 0.01036, Adjusted R-squared: 0.008732
## F-statistic: 6.365 on 3 and 1824 DF, p-value: 0.0002734
I also took the natural log of HRS1 and regressed it against the same three independent variables of Health, Happiness and Age:
lHRS1 <- log1p(GSSdata$HRS1)
rg2 <-lm(log(lHRS1) ~ HEALTH + HAPPY + AGE,data=GSSdata)
summary(rg2)
##
## Call:
## lm(formula = log(lHRS1) ~ HEALTH + HAPPY + AGE, data = GSSdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.65702 -0.00234 0.04073 0.08719 0.25191
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3329854 0.0214276 62.209 <2e-16 ***
## HEALTH -0.0081181 0.0061356 -1.323 0.1860
## HAPPY -0.0099180 0.0074876 -1.325 0.1855
## AGE -0.0006111 0.0003393 -1.801 0.0719 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1888 on 1824 degrees of freedom
## (2992 observations deleted due to missingness)
## Multiple R-squared: 0.004575, Adjusted R-squared: 0.002938
## F-statistic: 2.794 on 3 and 1824 DF, p-value: 0.03902
None of the variables have a statistically significant relationship with HRS1 with alpha set to .05, so I will fail to reject the null hypothesis that there is no relationship between HEALTH, HAPPY, AGE and HRS1.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.3329854 0.0214276 62.209 <2e-16 *** HEALTH -0.0081181 0.0061356 -1.323 0.1860
HAPPY -0.0099180 0.0074876 -1.325 0.1855
AGE -0.0006111 0.0003393 -1.801 0.0719 .
Once again looking to the results of rg2 (the analysis involving the natural log of HRS1), below is an interpretation of the HAPPY and AGE variable regression results:
The p-value for the HAPPY coefficient (0.25007)is greater than alpha so I have failed to reject the null hypothesis that the difference between the means of the natural log of HRS1 and the HAPPY variable is equal to zero.
Population Coefficient:
HAPPY -0.0058113 0.0050515 -1.150 0.25007
Interval Estimate:
2.5 % 97.5 %
HAPPY -0.024603155 4.767202e-03
Note that the confidence interval includes ‘0’, which supports my decision to fail to reject the null hypothesis.
The p-value for the AGE (0.0719) coefficient is greater than alpha so I have failed to reject the null hypothesis that the difference between the means of the natural log of HRS1 and the HAPPY variable is equal to zero.
Population Coefficient:
AGE -0.0006111 0.0003393 -1.801 0.0719
Interval Estimate:
2.5 % 97.5 %
AGE -0.001276534 5.436623e-05
Note that the confidence interval includes ‘0’, which supports my decision to fail to reject the null hypothesis.
3.) Before reporting the regression results in an APA complient table, I had to calculate the means, standard deviations and confidance intervals of the variables:
Means:
GS <- tbl_df(GSSdata)
GS %>%
summarise(mean_HRS1=mean(HRS1), mean_HEALTH=mean(HEALTH), mean_HAPPY=mean(HAPPY),
mean_AGE = mean(AGE))
## Source: local data frame [1 x 4]
##
## mean_HRS1 mean_HEALTH mean_HAPPY mean_AGE
## (dbl) (dbl) (dbl) (dbl)
## 1 NA NA NA NA
and the natural log of HRS1:
GS <- tbl_df(GSSdata)
GS %>%
summarise(mean_lHRS1=mean(lHRS1, na.rm=TRUE), mean_HEALTH=mean(HEALTH), mean_HAPPY=mean(HAPPY),
mean_AGE = mean(AGE), n=n())
## Source: local data frame [1 x 5]
##
## mean_lHRS1 mean_HEALTH mean_HAPPY mean_AGE n
## (dbl) (dbl) (dbl) (dbl) (int)
## 1 3.618764 NA NA NA 4820
Standard Deviations:
GS %>%
summarise(sd_HRS1=sd(HRS1, na.rm=TRUE), sd_HEALTH=sd(HEALTH), sd_HAPPY=sd(HAPPY),
sd_AGE = sd(AGE), n=n())
## Source: local data frame [1 x 5]
##
## sd_HRS1 sd_HEALTH sd_HAPPY sd_AGE n
## (dbl) (dbl) (dbl) (dbl) (int)
## 1 15.47928 NaN NaN NaN 4820
and the natural log of HRS1:
GS %>%
summarise(sd_lHRS1=sd(lHRS1, na.rm=TRUE), sd_HEALTH=sd(HEALTH), sd_HAPPY=sd(HAPPY),
sd_AGE = sd(AGE), n=n())
## Source: local data frame [1 x 5]
##
## sd_lHRS1 sd_HEALTH sd_HAPPY sd_AGE n
## (dbl) (dbl) (dbl) (dbl) (int)
## 1 0.5170823 NaN NaN NaN 4820
Confidence Intervals:
confint(rg1)
## 2.5 % 97.5 %
## (Intercept) 43.57586154 50.37728218
## HEALTH -1.75936730 0.18817138
## HAPPY -3.12522384 -0.74855302
## AGE -0.09199036 0.01570709
and the natural log of HRS1:
confint(rg2)
## 2.5 % 97.5 %
## (Intercept) 1.290960188 1.375011e+00
## HEALTH -0.020151735 3.915504e-03
## HAPPY -0.024603155 4.767202e-03
## AGE -0.001276534 5.436623e-05
The Tables:
Here is the table reporting the results for rg1:
Here is the table reporting the results for item rg2: