First I downloaded the file from the General Social Survey 2012 Cross-Section and Panel Combined.SAV. I renamed the file GSS.SAV. I also loaded the necessary packages.
Next, I read the newly-read data from GSS.SAV into a data frame in R. Then converted the data frame to a table frame and took a glimpse.
GSSdata <- data.frame(GSSdata)
GSSdata <- tbl_df(GSSdata)
I accessed the variable HRS1 and filtered the data to remove the -1’s (inapplicable) and 99’s (no answer). I ran a table to confirm the values had been filtered out. Then I created a histogram of the HRS1 data.
GSSdatanew<-GSSdata %>% filter(GSSdata$HRS1 > 0, GSSdata$HRS1 < 99)
tabGSSdatanew <- table (GSSdatanew$HRS1)
GSSdatanew %>% ggvis(~GSSdatanew$HRS1) %>% layer_histograms()%>% add_axis("x", title="Respondents")%>% add_axis("y", title = "Hours")
Next I took the natural log of the newly filtered data and created a histogram.
nlHRS1 <- log(GSSdatanew$HRS1)
GSSdatanew %>% ggvis(~nlHRS1) %>% layer_histograms()%>% add_axis("x", title= "Respondents", title_offset=50)%>% add_axis("y", title = "Hours", title_offset=50)
I selected three independent variables from the GSS 2012 Cross-Section and Panel Combined data set, AGE, SEX and BORN. BORN tells whether the respondent was born in this country. Were you born in this country? Yes=1 and No=2, 9= No answer. I recoded category 9 to NA and ran a summary.
summary(GSSdata$BORN)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.121 1.000 9.000
GSSdata$BORN<- ifelse (GSSdata$BORN == 9, NA, GSSdata$BORN)
summary (GSSdata$BORN)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 1.000 1.000 1.115 1.000 2.000 4
For AGE, I recoded 99 to NA and ran a summary.
GSSdata$AGE <- ifelse (GSSdata$AGE == 99, NA, GSSdata$AGE)
summary (GSSdata$AGE)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 18.00 35.00 49.00 49.59 62.00 89.00 51
For SEX, I also recoded 99 to NA and ran a summary. There were no “no answers”.
GSSdata$SEX <- ifelse (GSSdata$SEX == 99, NA, GSSdata$SEX)
summary (GSSdata$SEX)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 1.558 2.000 2.000
Next, I conducted an ordinary least squares regression analyses with HRS1 as the dependent Variable and AGE, SEX, and BORN as independent variables.
##
## Call:
## lm(formula = HRS1 ~ AGE + SEX + BORN, data = GSSdatanew)
##
## Coefficients:
## (Intercept) AGE SEX BORN
## 52.89746 -0.04054 -6.92199 -0.29241
##
## Call:
## lm(formula = HRS1 ~ AGE + SEX + BORN, data = GSSdatanew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.264 -5.984 1.506 6.716 62.450
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.89746 1.55972 33.915 <2e-16 ***
## AGE -0.04054 0.01974 -2.054 0.0401 *
## SEX -6.92199 0.56756 -12.196 <2e-16 ***
## BORN -0.29241 0.77167 -0.379 0.7048
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.16 on 2852 degrees of freedom
## Multiple R-squared: 0.05082, Adjusted R-squared: 0.04983
## F-statistic: 50.9 on 3 and 2852 DF, p-value: < 2.2e-16
## 2.5 % 97.5 %
## (Intercept) 49.83916552 55.955745610
## AGE -0.07924505 -0.001830301
## SEX -8.03485939 -5.809122363
## BORN -1.80550346 1.220684395
The estimate of the population coefficient is -0.04. The means that there is a negative relationship between AGE and HRS1 – as age increases, hours worked decreases.
How accurate is the estimate of the relationship between age and hours worked? I examined the 95% confidence interval for AGE:
## 2.5 % 97.5 %
## AGE -0.08253953 0.001146172
I am 95% confident that this reduction in age is between -0.082 and 0.001.
Next, I conducted an ordinary least squares regression analyses with the natural log HRS1 as the dependent Variable and AGE, SEX, and BORN as independent variables.
##
## Call:
## lm(formula = nlHRS1 ~ AGE + SEX + BORN, data = GSSdatanew)
##
## Coefficients:
## (Intercept) AGE SEX BORN
## 3.956806 -0.001822 -0.207103 0.020000
##
## Call:
## lm(formula = nlHRS1 ~ AGE + SEX + BORN, data = GSSdatanew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7059 -0.0532 0.1434 0.2563 1.1336
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.9568063 0.0559631 70.704 <2e-16 ***
## AGE -0.0018223 0.0007083 -2.573 0.0101 *
## SEX -0.2071033 0.0203642 -10.170 <2e-16 ***
## BORN 0.0200002 0.0276879 0.722 0.4701
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5438 on 2852 degrees of freedom
## Multiple R-squared: 0.03742, Adjusted R-squared: 0.03641
## F-statistic: 36.96 on 3 and 2852 DF, p-value: < 2.2e-16
## 2.5 % 97.5 %
## (Intercept) 3.847073941 4.0665385612
## AGE -0.003211084 -0.0004334212
## SEX -0.247033298 -0.1671732255
## BORN -0.034289994 0.0742904806
The estimate of the population coefficient is -0.002. The means that there is a negative relationship between AGE and nlHRS1 – as age increases, hours worked decreases.
How accurate is the estimate of the relationship between age and hours worked? I examined the 95% confidence interval for AGE:
## 2.5 % 97.5 %
## AGE -0.0033577875 0.0005721174
I am 95% confident that this reduction in age is between -0.082 and 0.001.
Before making the tables I need means and standard deviations of all variables in the equation.
tab <- tbl_df(GSSdatanew)
tab %>%
summarise(mean_HRS1=mean(HRS1), mean_SEX=mean(SEX), mean_BORN=mean(BORN, na.rm=TRUE),mean_AGE = mean(AGE, na.rm=TRUE))
## Source: local data frame [1 x 4]
##
## mean_HRS1 mean_SEX mean_BORN mean_AGE
## 1 40.30427 1.509104 1.133403 44.79202
tab %>%
summarise(sd_HRS1=sd(HRS1), sd_SEX=sd(SEX), sd_BORN=sd(BORN, na.rm=TRUE),
sd_AGE = sd(AGE, na.rm=TRUE))
## Source: local data frame [1 x 4]
##
## sd_HRS1 sd_SEX sd_BORN sd_AGE
## 1 15.54908 0.5000047 0.36778 14.37105
tab %>%
summarise(mean_nlHRS1=mean(nlHRS1), mean_SEX=mean(SEX), mean_BORN=mean(BORN, na.rm=TRUE),mean_AGE = mean(AGE, na.rm=TRUE))
## Source: local data frame [1 x 4]
##
## mean_nlHRS1 mean_SEX mean_BORN mean_AGE
## 1 3.585312 1.509104 1.133403 44.79202
tab %>%
summarise(sd_nlHRS1=sd(nlHRS1), sd_SEX=sd(SEX), sd_BORN=sd(BORN, na.rm=TRUE),
sd_AGE = sd(AGE, na.rm=TRUE))
## Source: local data frame [1 x 4]
##
## sd_nlHRS1 sd_SEX sd_BORN sd_AGE
## 1 0.5540086 0.5000047 0.36778 14.37105
Then I create the following tables: