First I downloaded the file from the General Social Survey 2012 Cross-Section and Panel Combined.SAV. I renamed the file GSS.SAV. I also loaded the necessary packages.

Next, I read the newly-read data from GSS.SAV into a data frame in R. Then converted the data frame to a table frame and took a glimpse.

GSSdata <- data.frame(GSSdata)
GSSdata <- tbl_df(GSSdata)

I accessed the variable HRS1 and filtered the data to remove the -1’s (inapplicable) and 99’s (no answer). I ran a table to confirm the values had been filtered out. Then I created a histogram of the HRS1 data.

GSSdatanew<-GSSdata %>% filter(GSSdata$HRS1 > 0, GSSdata$HRS1 < 99)
tabGSSdatanew <- table (GSSdatanew$HRS1)
GSSdatanew %>% ggvis(~GSSdatanew$HRS1) %>% layer_histograms()%>% add_axis("x", title="Respondents")%>% add_axis("y", title = "Hours")

Next I took the natural log of the newly filtered data and created a histogram.

nlHRS1 <- log(GSSdatanew$HRS1)
GSSdatanew %>% ggvis(~nlHRS1) %>% layer_histograms()%>% add_axis("x", title= "Respondents", title_offset=50)%>% add_axis("y", title = "Hours", title_offset=50)

I selected three independent variables from the GSS 2012 Cross-Section and Panel Combined data set, AGE, SEX and BORN. BORN tells whether the respondent was born in this country. Were you born in this country? Yes=1 and No=2, 9= No answer. I recoded category 9 to NA and ran a summary.

summary(GSSdata$BORN)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.121   1.000   9.000
GSSdata$BORN<- ifelse (GSSdata$BORN == 9, NA, GSSdata$BORN)
summary (GSSdata$BORN)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000   1.000   1.115   1.000   2.000       4

For AGE, I recoded 99 to NA and ran a summary.

GSSdata$AGE <- ifelse (GSSdata$AGE == 99, NA, GSSdata$AGE)
summary (GSSdata$AGE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   18.00   35.00   49.00   49.59   62.00   89.00      51

For SEX, I also recoded 99 to NA and ran a summary. There were no “no answers”.

GSSdata$SEX <- ifelse (GSSdata$SEX == 99, NA, GSSdata$SEX)
summary (GSSdata$SEX)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.558   2.000   2.000

Next, I conducted an ordinary least squares regression analyses with HRS1 as the dependent Variable and AGE, SEX, and BORN as independent variables.

## 
## Call:
## lm(formula = HRS1 ~ AGE + SEX + BORN, data = GSSdatanew)
## 
## Coefficients:
## (Intercept)          AGE          SEX         BORN  
##    52.89746     -0.04054     -6.92199     -0.29241
## 
## Call:
## lm(formula = HRS1 ~ AGE + SEX + BORN, data = GSSdatanew)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.264  -5.984   1.506   6.716  62.450 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 52.89746    1.55972  33.915   <2e-16 ***
## AGE         -0.04054    0.01974  -2.054   0.0401 *  
## SEX         -6.92199    0.56756 -12.196   <2e-16 ***
## BORN        -0.29241    0.77167  -0.379   0.7048    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.16 on 2852 degrees of freedom
## Multiple R-squared:  0.05082,    Adjusted R-squared:  0.04983 
## F-statistic:  50.9 on 3 and 2852 DF,  p-value: < 2.2e-16
##                   2.5 %       97.5 %
## (Intercept) 49.83916552 55.955745610
## AGE         -0.07924505 -0.001830301
## SEX         -8.03485939 -5.809122363
## BORN        -1.80550346  1.220684395

Hypothesis tests:

Interpretation

Point estimate

The estimate of the population coefficient is -0.04. The means that there is a negative relationship between AGE and HRS1 – as age increases, hours worked decreases.

Interval estimate

How accurate is the estimate of the relationship between age and hours worked? I examined the 95% confidence interval for AGE:

## 2.5 % 97.5 %

## AGE -0.08253953 0.001146172

I am 95% confident that this reduction in age is between -0.082 and 0.001.

Next, I conducted an ordinary least squares regression analyses with the natural log HRS1 as the dependent Variable and AGE, SEX, and BORN as independent variables.

## 
## Call:
## lm(formula = nlHRS1 ~ AGE + SEX + BORN, data = GSSdatanew)
## 
## Coefficients:
## (Intercept)          AGE          SEX         BORN  
##    3.956806    -0.001822    -0.207103     0.020000
## 
## Call:
## lm(formula = nlHRS1 ~ AGE + SEX + BORN, data = GSSdatanew)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7059 -0.0532  0.1434  0.2563  1.1336 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.9568063  0.0559631  70.704   <2e-16 ***
## AGE         -0.0018223  0.0007083  -2.573   0.0101 *  
## SEX         -0.2071033  0.0203642 -10.170   <2e-16 ***
## BORN         0.0200002  0.0276879   0.722   0.4701    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5438 on 2852 degrees of freedom
## Multiple R-squared:  0.03742,    Adjusted R-squared:  0.03641 
## F-statistic: 36.96 on 3 and 2852 DF,  p-value: < 2.2e-16
##                    2.5 %        97.5 %
## (Intercept)  3.847073941  4.0665385612
## AGE         -0.003211084 -0.0004334212
## SEX         -0.247033298 -0.1671732255
## BORN        -0.034289994  0.0742904806

Interpretation

Point estimate

The estimate of the population coefficient is -0.002. The means that there is a negative relationship between AGE and nlHRS1 – as age increases, hours worked decreases.

Interval estimate

How accurate is the estimate of the relationship between age and hours worked? I examined the 95% confidence interval for AGE:

## 2.5 % 97.5 %

## AGE -0.0033577875 0.0005721174

I am 95% confident that this reduction in age is between -0.082 and 0.001.

Tables

Before making the tables I need means and standard deviations of all variables in the equation.

tab <- tbl_df(GSSdatanew)
tab %>%
  summarise(mean_HRS1=mean(HRS1), mean_SEX=mean(SEX), mean_BORN=mean(BORN, na.rm=TRUE),mean_AGE = mean(AGE, na.rm=TRUE))
## Source: local data frame [1 x 4]
## 
##   mean_HRS1 mean_SEX mean_BORN mean_AGE
## 1  40.30427 1.509104  1.133403 44.79202
tab %>%
  summarise(sd_HRS1=sd(HRS1), sd_SEX=sd(SEX), sd_BORN=sd(BORN, na.rm=TRUE),
            sd_AGE = sd(AGE, na.rm=TRUE))
## Source: local data frame [1 x 4]
## 
##    sd_HRS1    sd_SEX sd_BORN   sd_AGE
## 1 15.54908 0.5000047 0.36778 14.37105
tab %>%
  summarise(mean_nlHRS1=mean(nlHRS1), mean_SEX=mean(SEX), mean_BORN=mean(BORN, na.rm=TRUE),mean_AGE = mean(AGE, na.rm=TRUE))
## Source: local data frame [1 x 4]
## 
##   mean_nlHRS1 mean_SEX mean_BORN mean_AGE
## 1    3.585312 1.509104  1.133403 44.79202
tab %>%
  summarise(sd_nlHRS1=sd(nlHRS1), sd_SEX=sd(SEX), sd_BORN=sd(BORN, na.rm=TRUE),
            sd_AGE = sd(AGE, na.rm=TRUE))
## Source: local data frame [1 x 4]
## 
##   sd_nlHRS1    sd_SEX sd_BORN   sd_AGE
## 1 0.5540086 0.5000047 0.36778 14.37105

Then I create the following tables:

alt text

alt text