Assignment 3

First I downloaded the file from the General Social Survey 2012 Cross-Section and Panel Combined.SAV. I renamed the file GSS.SAV. I also loaded the necessary packages.

Next, I read the newly-read data from GSS.SAV into a data frame in R. Then converted the data frame to a table frame and took a glimpse.

GSSdata <- data.frame(GSSdata)
GSSdata <- tbl_df(GSSdata)

I accessed the variable HRS1 and filtered the data to remove the -1’s (inapplicable) and 99’s (no answer). I ran a table to confirm the values had been filtered out. Then I created a histogram of the HRS1 data.

GSSdatanew<-GSSdata %>% filter(GSSdata$HRS1 > 0, GSSdata$HRS1 < 99)
tabGSSdatanew <- table (GSSdatanew$HRS1)
GSSdatanew %>% ggvis(~GSSdatanew$HRS1) %>% layer_histograms()%>% add_axis("x", title="Respondents")%>% add_axis("y", title = "Hours")

Next I took the natural log of the newly filtered data and created a histogram.

nlHRS1 <- log(GSSdatanew$HRS1)
GSSdatanew %>% ggvis(~nlHRS1) %>% layer_histograms()%>% add_axis("x", title= "Respondents", title_offset=50)%>% add_axis("y", title = "Hours", title_offset=50)

I selected three independent variables from the GSS 2012 Cross-Section and Panel Combined data set, AGE, SEX and BORN. BORN tells whether the respondent was born in this country. Were you born in this country? Yes=1 and No=2, 9= No answer. I recoded category 9 to NA and ran a summary.

summary(GSSdata$BORN)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.121   1.000   9.000

GSSdata$BORN<- ifelse (GSSdata$BORN == 9, NA, GSSdata$BORN)
summary (GSSdata$BORN)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000   1.000   1.115   1.000   2.000       4

For AGE, I recoded 99 to NA and ran a summary.

GSSdata$AGE <- ifelse (GSSdata$AGE == 99, NA, GSSdata$AGE)
summary (GSSdata$AGE)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   18.00   35.00   49.00   49.59   62.00   89.00      51

For SEX, I also recoded 99 to NA and ran a summary. There were no “no answers”.

GSSdata$SEX <- ifelse (GSSdata$SEX == 99, NA, GSSdata$SEX)
summary (GSSdata$SEX)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.558   2.000   2.000

Next, I conducted an ordinary least squares regression analyses with HRS1 as the dependent Variable and AGE, SEX, and BORN as independent variables.

## 
## Call:
## lm(formula = HRS1 ~ AGE + SEX + BORN, data = GSSdatanew)
## 
## Coefficients:
## (Intercept)          AGE          SEX         BORN  
##    52.89746     -0.04054     -6.92199     -0.29241

## 
## Call:
## lm(formula = HRS1 ~ AGE + SEX + BORN, data = GSSdatanew)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.264  -5.984   1.506   6.716  62.450 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 52.89746    1.55972  33.915   <2e-16 ***
## AGE         -0.04054    0.01974  -2.054   0.0401 *  
## SEX         -6.92199    0.56756 -12.196   <2e-16 ***
## BORN        -0.29241    0.77167  -0.379   0.7048    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.16 on 2852 degrees of freedom
## Multiple R-squared:  0.05082,    Adjusted R-squared:  0.04983 
## F-statistic:  50.9 on 3 and 2852 DF,  p-value: < 2.2e-16

##                   2.5 %       97.5 %
## (Intercept) 49.83916552 55.955745610
## AGE         -0.07924505 -0.001830301
## SEX         -8.03485939 -5.809122363
## BORN        -1.80550346  1.220684395

Hypothesis tests:

Interpretation

Point estimate

The estimate of the population coefficient is -0.04. The means that there is a negative relationship between AGE and HRS1 – as age increases, hours worked decreases.

Interval estimate

How accurate is the estimate of the relationship between age and hours worked? I examined the 95% confidence interval for AGE:

## 2.5 % 97.5 %

## AGE -0.08253953 0.001146172

I am 95% confident that this reduction in age is between -0.082 and 0.001.

Next, I conducted an ordinary least squares regression analyses with the natural log HRS1 as the dependent Variable and AGE, SEX, and BORN as independent variables.

## 
## Call:
## lm(formula = nlHRS1 ~ AGE + SEX + BORN, data = GSSdatanew)
## 
## Coefficients:
## (Intercept)          AGE          SEX         BORN  
##    3.956806    -0.001822    -0.207103     0.020000

## 
## Call:
## lm(formula = nlHRS1 ~ AGE + SEX + BORN, data = GSSdatanew)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7059 -0.0532  0.1434  0.2563  1.1336 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.9568063  0.0559631  70.704   <2e-16 ***
## AGE         -0.0018223  0.0007083  -2.573   0.0101 *  
## SEX         -0.2071033  0.0203642 -10.170   <2e-16 ***
## BORN         0.0200002  0.0276879   0.722   0.4701    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5438 on 2852 degrees of freedom
## Multiple R-squared:  0.03742,    Adjusted R-squared:  0.03641 
## F-statistic: 36.96 on 3 and 2852 DF,  p-value: < 2.2e-16

##                    2.5 %        97.5 %
## (Intercept)  3.847073941  4.0665385612
## AGE         -0.003211084 -0.0004334212
## SEX         -0.247033298 -0.1671732255
## BORN        -0.034289994  0.0742904806

Interpretation

Point estimate

The estimate of the population coefficient is -0.002. The means that there is a negative relationship between AGE and nlHRS1 – as age increases, hours worked decreases.

Interval estimate

How accurate is the estimate of the relationship between age and hours worked? I examined the 95% confidence interval for AGE:

## 2.5 % 97.5 %

## AGE -0.0033577875 0.0005721174

I am 95% confident that this reduction in age is between -0.082 and 0.001.

Tables

Before making the tables I need means and standard deviations of all variables in the equation.

tab <- tbl_df(GSSdatanew)
tab %>%
  summarise(mean_HRS1=mean(HRS1), mean_SEX=mean(SEX), mean_BORN=mean(BORN, na.rm=TRUE),mean_AGE = mean(AGE, na.rm=TRUE))

## Source: local data frame [1 x 4]
## 
##   mean_HRS1 mean_SEX mean_BORN mean_AGE
## 1  40.30427 1.509104  1.133403 44.79202

tab %>%
  summarise(sd_HRS1=sd(HRS1), sd_SEX=sd(SEX), sd_BORN=sd(BORN, na.rm=TRUE),
            sd_AGE = sd(AGE, na.rm=TRUE))

## Source: local data frame [1 x 4]
## 
##    sd_HRS1    sd_SEX sd_BORN   sd_AGE
## 1 15.54908 0.5000047 0.36778 14.37105

tab %>%
  summarise(mean_nlHRS1=mean(nlHRS1), mean_SEX=mean(SEX), mean_BORN=mean(BORN, na.rm=TRUE),mean_AGE = mean(AGE, na.rm=TRUE))

## Source: local data frame [1 x 4]
## 
##   mean_nlHRS1 mean_SEX mean_BORN mean_AGE
## 1    3.585312 1.509104  1.133403 44.79202

tab %>%
  summarise(sd_nlHRS1=sd(nlHRS1), sd_SEX=sd(SEX), sd_BORN=sd(BORN, na.rm=TRUE),
            sd_AGE = sd(AGE, na.rm=TRUE))

## Source: local data frame [1 x 4]
## 
##   sd_nlHRS1    sd_SEX sd_BORN   sd_AGE
## 1 0.5540086 0.5000047 0.36778 14.37105

Then I create the following tables:

alt text

Assignment 3

Tina Thomas

November 23, 2015

Hypothesis tests:

Interpretation

Point estimate

Interval estimate

Interpretation

Point estimate

Interval estimate

Tables

Assignment 3

Tina Thomas

November 23, 2015

Hypothesis tests:

Is the entire set of independent variables related to the dependent variable? The following is from the findings:*

Is each independent variables related to the dependent variable?

Interpretation

Point estimate

Interval estimate

Is the entire set of independent variables related to the dependent variable, the natural log of HRS1? The following is from the findings:*

Is each independent variables related to the dependent variable?

Interpretation

Point estimate

Interval estimate

Tables