For this assignment, I will conduct regression analyses using several variables from the General Social Survey 2012 Cross-Section and Panel Combined. Specifically, I will be exploring whether there is a relationship between the number of hours a respondent worked in the last week and sex, race, and number of wage earners in the respondent’s household.
First, I load the packages that I will need. This includes “foreign,” which allows me to read non-R files like the SPSS one used for this assignment.
require(foreign)
## Loading required package: foreign
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(ggvis)
## Loading required package: ggvis
require(magrittr)
## Loading required package: magrittr
Next, I read the SPSS .SAV file in to R and then read the newly-read data from GSS.SAV into a data frame in R
GSSdata <- read.spss("GSS.SAV",
max.value.labels=TRUE, to.data.frame=FALSE,
trim.factor.names=FALSE,
reencode=NA, use.missings=to.data.frame)
GSSdata <- data.frame(GSSdata)
I then convert the data frame to a table frame by using the dplyr function, tbl_df
GSSdata <- tbl_df(GSSdata)
To be sure I have the data read correctly, I will run some descriptive statistics (mean, standard deviation, and count of cases) for the dependent variable I will use, HRS1:
GSSdata %>%
summarise(mean_hours=mean(HRS1), sd_hours=sd(HRS1), n=n())
## Source: local data frame [1 x 3]
##
## mean_hours sd_hours n
## (dbl) (dbl) (int)
## 1 24.07573 24.19891 4820
summary(GSSdata$HRS1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.00 -1.00 25.00 24.08 40.00 99.00
GSSdata$HRS1 <- ifelse(GSSdata$HRS1 == 99, NA, GSSdata$HRS1)
GSSdata$HRS1 <- ifelse(GSSdata$HRS1 == 98, NA, GSSdata$HRS1)
GSSdata$HRS1 <- ifelse(GSSdata$HRS1 == -1, NA, GSSdata$HRS1)
GSSdata <- tbl_df(GSSdata)
summary(GSSdata$HRS1)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 35.00 40.00 40.26 50.00 89.00 1966
GSSdata %>%
summarise(mean_hours=mean(HRS1, na.rm=TRUE), sd_hours=sd(HRS1, na.rm=TRUE), n=n())
## Source: local data frame [1 x 3]
##
## mean_hours sd_hours n
## (dbl) (dbl) (int)
## 1 40.26384 15.47928 4820
Now I create my histogram of HRS1
GSSdata %>% ggvis(~GSSdata$HRS1) %>% layer_histograms() %>%
add_axis("x", title = "Hours Worked per Week", title_offset="50") %>%
add_axis("y", title = "Number of Respondents", title_offset="50") %>%
add_axis("x", orient = "top", ticks = 0, title = "Histogram of Hours Worked per Week",
properties = axis_props(
axis = list(stroke = "white"),
labels = list(fontSize = 0)))
## Guessing width = 5 # range / 18
Now I am ready to create the histogram of logarithm of HRS1
GSSdata %>% ggvis(~log(GSSdata$HRS1)) %>% layer_histograms() %>%
add_axis("x", title = "Logarithm of Hours Worked per Week", title_offset="50") %>%
add_axis("y", title = "Number of Respondents", title_offset="50") %>%
add_axis("x", orient = "top", ticks = 0, title = "Histogram of the Logarithm of Hours Worked per Week",
properties = axis_props(
axis = list(stroke = "white"),
labels = list(fontSize = 0)))
## Guessing width = 0.2 # range / 23
First, I will run frequency distributions to get a better look at the data
hoursworked <- table(GSSdata$HRS1)
hoursworked
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 9 2 1 10 10 6 5 19 95 20 2 20 5 5 32 24 3 8
## 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## 3 104 7 9 2 30 50 9 9 13 2 105 6 69 8 10 84 50
## 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## 36 43 9 831 15 39 17 35 149 29 11 66 3 219 3 25 6 1
## 55 56 57 58 59 60 62 63 64 65 66 67 68 70 72 74 75 77
## 84 18 9 11 2 166 8 3 4 30 1 3 9 51 9 1 6 2
## 80 82 85 86 89
## 27 1 3 2 21
earners <- table(GSSdata$EARNRS)
earners
##
## 0 1 2 3 4 5 6 7 8 9
## 913 1807 1555 324 147 33 13 5 4 19
sex <- table(GSSdata$SEX)
sex
##
## 1 2
## 2132 2688
race <- table(GSSdata$RACE)
race
##
## 1 2 3
## 3700 722 398
All of my variables first need to be transformed.
I start by creating a new variable for RACE that consists of only “White” (1) or “Non-White” (0) which will combine Black (2) and Other (3) into a single category. (This will also enable me to assign meaningful numerical values to this categorical variable.)
GSSdata$RACEnew <- 0
GSSdata$RACEnew <- ifelse(GSSdata$RACE == 1, 1, GSSdata$RACEnew)
racetest <- table(GSSdata$RACEnew)
racetest
##
## 0 1
## 1120 3700
For EARNRS, I want to create two new variables from the EARNRS data so that in my least squares regression, I can see 1) if there is any relationship between homes that have 1 or more wage earners and hours worked in the past week and 2) if there is any relationship between homes that have 2 or more wage earners and hours worked in the past week.
The data for EARNRS also reveals that the 9s (no answer) were not changed into NAs. I will have to do that myself for both of the new variables so R will recognize those as “NA”
earners <- table(GSSdata$EARNRS)
earners
##
## 0 1 2 3 4 5 6 7 8 9
## 913 1807 1555 324 147 33 13 5 4 19
GSSdata$EARNRS1 <- 0
GSSdata$EARNRS1 <- ifelse(GSSdata$EARNRS >= 1 & GSSdata$EARNRS < 9, 1, GSSdata$EARNRS1)
GSSdata$EARNRS1 <- ifelse(GSSdata$EARNRS == 9, NA, GSSdata$EARNRS1)
GSSdata$EARNRS2 <- 0
GSSdata$EARNRS2 <- ifelse(GSSdata$EARNRS >= 2 & GSSdata$EARNRS < 9, 1, GSSdata$EARNRS2)
GSSdata$EARNRS2 <- ifelse(GSSdata$EARNRS == 9, NA, GSSdata$EARNRS2)
earntest1 <- table(GSSdata$EARNRS1)
earntest1
##
## 0 1
## 913 3888
earntest2 <- table(GSSdata$EARNRS2)
earntest2
##
## 0 1
## 2720 2081
SEX is a binary categorical variable, coded as either 1 (Male) or 2 (Female). To assign meaningful numerical values to these level of categorical variables, I need to create a new dummy variable for SEX that will recode Male (currently 1) to 0 and Female (currently 2) to 1.
GSSdata$SEXnew <- 0
GSSdata$SEXnew <- ifelse(GSSdata$SEX == 2, 1, GSSdata$SEXnew)
sex <- table(GSSdata$SEX)
sex
##
## 1 2
## 2132 2688
sextest <- table(GSSdata$SEXnew)
sextest
##
## 0 1
## 2132 2688
leastsquares1 <- lm(HRS1 ~ SEXnew + RACEnew + EARNRS1 + EARNRS2,data=GSSdata)
summary(leastsquares1)
##
## Call:
## lm(formula = HRS1 ~ SEXnew + RACEnew + EARNRS1 + EARNRS2, data = GSSdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.371 -6.242 1.256 6.521 52.131
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.3671 2.3695 15.348 < 2e-16 ***
## SEXnew -6.8924 0.5653 -12.192 < 2e-16 ***
## RACEnew 0.6269 0.6740 0.930 0.35239
## EARNRS1 7.3769 2.3407 3.152 0.00164 **
## EARNRS2 -0.6099 0.5774 -1.056 0.29099
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.08 on 2844 degrees of freedom
## (1971 observations deleted due to missingness)
## Multiple R-squared: 0.05327, Adjusted R-squared: 0.05194
## F-statistic: 40.01 on 4 and 2844 DF, p-value: < 2.2e-16
confint(leastsquares1)
## 2.5 % 97.5 %
## (Intercept) 31.7210911 41.0131768
## SEXnew -8.0008978 -5.7839403
## RACEnew -0.6947109 1.9485622
## EARNRS1 2.7871952 11.9665397
## EARNRS2 -1.7421130 0.5223776
Is the entire set of independent variables related to the dependent variable?
I read from the findings that:
## Multiple R-squared: 0.05327, Adjusted R-squared: 0.05194
## F-statistic: 40.01 on 4 and 2844 DF, p-value: < 2.2e-16
If α = .05, then the p-value, < 2.2e-16, is less than α. Therefore, I reject the null hypothesis that there is no relationship between the dependent variable and the entire set of independent variables.
Since I reject the null hypothesis, I will now explore whether each of the independent variables is related to the dependent variable.
## Estimate Std. Error t value Pr(>|t|)
## SEXnew -6.8924 0.5653 -12.192 < 2e-16
Again, if α = .05, then the p-value, < 2e-16, is less than α. Therefore, I reject the null hypothesis that there is no relationship between HRS1 and SEXnew.
The SEXnew variable was coded as either 0 (Male) or 1 (Female). The estimate of the population coefficient is -6.89 (rounded). This means that there is a negative relationship between HRS1 and SEXnew. On average, women work 6.89% hours per week less than men.
How accurate is the estimate of the relationship between hours worked per week and sex? Here is the 95% confidence interval for SEXnew:
## 2.5 % 97.5 %
## SEXnew -8.0008978 -5.7839403
Our best estimate is that, on average, women work 7.49% hours per week less than men. However, we are 95% confident that women work between 5.78% and 8.00% hours per week less than men.
## Estimate Std. Error t value Pr(>|t|)
## RACEnew 0.6269 0.6740 0.930 0.35239
If α = .05, then the p-value, 0.35, is more than α. Therefore, I fail to reject the null hypothesis that there is no relationship between HRS1 and RACEnew.
## Estimate Std. Error t value Pr(>|t|)
## EARNRS1 7.3769 2.3407 3.152 0.00164
EARNRS1 describes how many family members in the household earned any money in the past year from any job or employment. The EARNRS1 variable was coded as either 0 (No wage earners in the respondent’s family in the household) or 1 (1 or more wage earners in the respondent’s family in the household).
If α = .05, then the p-value, < .002, is less than α. Therefore, I reject the null hypothesis that there is no relationship between HRS1 and EARNRS1.
The estimate of the population coefficient is 7.38 (rounded). This means that there is a positive relationship between HRS1 and EARNRS1. On average, households with 1 or more wage earners report working 7.38% hours more per week than households with no wage earners.
How accurate is the estimate of the relationship between hours worked per week and the number of a household’s family wage earners? Here is the 95% confidence interval for EARNRS1:
## 2.5 % 97.5 %
## EARNRS1 2.7871952 11.9665397
Our best estimate is that, on average, households with 1 or more wage earners report working 7.38% hours more per week than households with no wage earners. However, we are 95% confident that households with 1 or more wage earners report working between 2.79% and 11.97% hours more per week.
## Estimate Std. Error t value Pr(>|t|)
## EARNRS2 -0.6099 0.5774 -1.056 0.29099
EARNRS2 describes how many family members in the household earned any money in the past year from any job or employment. The EARNRS1 variable was coded as either 0 (0 to 1 wage earners in the respondent’s family in the household) or 1 (2 or more wage earners in the respondent’s family in the household).
If α = .05, then the p-value, .29, is more than α. Therefore, I fail to reject the null hypothesis that there is no relationship between HRS1 and EARNRS2.
leastsquares2 <- lm(log(HRS1) ~ SEXnew + RACEnew + EARNRS1 + EARNRS2,data=GSSdata)
summary(leastsquares2)
##
## Call:
## lm(formula = log(HRS1) ~ SEXnew + RACEnew + EARNRS1 + EARNRS2,
## data = GSSdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6951 -0.0575 0.1381 0.2562 1.0091
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.482908 0.085484 40.743 <2e-16 ***
## SEXnew -0.207448 0.020395 -10.171 <2e-16 ***
## RACEnew -0.003406 0.024317 -0.140 0.8886
## EARNRS1 0.215589 0.084447 2.553 0.0107 *
## EARNRS2 -0.004572 0.020833 -0.219 0.8263
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5439 on 2844 degrees of freedom
## (1971 observations deleted due to missingness)
## Multiple R-squared: 0.03708, Adjusted R-squared: 0.03573
## F-statistic: 27.38 on 4 and 2844 DF, p-value: < 2.2e-16
confint(leastsquares2)
## 2.5 % 97.5 %
## (Intercept) 3.31529104 3.65052540
## SEXnew -0.24743871 -0.16745663
## RACEnew -0.05108686 0.04427558
## EARNRS1 0.05000585 0.38117279
## EARNRS2 -0.04542040 0.03627654
Is the entire set of independent variables related to the dependent variable?
I read from the findings that:
## Multiple R-squared: 0.03708, Adjusted R-squared: 0.03573
## F-statistic: 27.38 on 4 and 2844 DF, p-value: < 2.2e-16
If α = .05, then the p-value, < 2.2e-16, is less than α. Therefore, I reject the null hypothesis that there is no relationship between the dependent variable and the entire set of independent variables.
Since I reject the null hypothesis, I will now explore whether each of the independent variables is related to the dependent variable.
## Estimate Std. Error t value Pr(>|t|)
## SEXnew -0.207448 0.020395 -10.171 <2e-16
Again, if α = .05, then the p-value, < 2e-16, is less than α. Therefore, I reject the null hypothesis that there is no relationship between ln(HRS1) and SEXnew.
The SEXnew variable was coded as either 0 (Male) or 1 (Female). The estimate of the population coefficient is -0.21 (rounded). This means that there is a negative relationship between ln(HRS1) and SEXnew. On average, women work .21% hours per week less than men.
How accurate is the estimate of the relationship between hours worked per week and sex? Here is the 95% confidence interval for SEXnew:
## 2.5 % 97.5 %
## SEXnew -0.24743871 -0.16745663
Our best estimate is that, on average, women work .21% hours per week less than men. However, we are 95% confident that women work between .17% and .25% hours per week less than men.
## Estimate Std. Error t value Pr(>|t|)
## RACEnew -0.003406 0.024317 -0.140 0.8886
If α = .05, then the p-value, 0.89, is more than α. Therefore, I fail to reject the null hypothesis that there is no relationship between HRS1 and RACEnew.
## Estimate Std. Error t value Pr(>|t|)
## EARNRS1 0.215589 0.084447 2.553 0.0107
EARNRS1 describes how many family members in the household earned any money in the past year from any job or employment. The EARNRS1 variable was coded as either 0 (No wage earners in the respondent’s family in the household) or 1 (1 or more wage earners in the respondent’s family in the household).
If α = .05, then the p-value, < .01, is less than α. Therefore, I reject the null hypothesis that there is no relationship between HRS1 and EARNRS1.
The estimate of the population coefficient is .22 (rounded). This means that there is a positive relationship between HRS1 and EARNRS1. On average, households with 1 or more wage earners report working .22% hours more per week than households with no wage earners.
How accurate is the estimate of the relationship between hours worked per week and the number of a household’s family wage earners? Here is the 95% confidence interval for EARNRS1:
## 2.5 % 97.5 %
## EARNRS1 0.05000585 0.38117279
Our best estimate is that, on average, households with 1 or more wage earners report working .22% hours more per week than households with no wage earners. However, we are 95% confident that households with 1 or more wage earners report working between .05% and .38% hours more per week.
## Estimate Std. Error t value Pr(>|t|)
## EARNRS2 -0.004572 0.020833 -0.219 0.8263
EARNRS2 describes how many family members in the household earned any money in the past year from any job or employment. The EARNRS2 variable was coded as either 0 (0 to 1 wage earners in the respondent’s family in the household) or 1 (2 or more wage earners in the respondent’s family in the household).
If α = .05, then the p-value, < .83, is more than α. Therefore, I fail to reject the null hypothesis that there is no relationship between HRS1 and EARNRS2.
#First, I need means and standard deviations of all variables in each equation.
GSSdata %>%
summarise(mean_hours=mean(HRS1, na.rm=TRUE), mean_sex=mean(SEXnew), mean_race=mean(RACEnew), mean_earners1 = mean(EARNRS1,na.rm=TRUE), mean_earners2 = mean(EARNRS2,na.rm=TRUE), n=n())
## Source: local data frame [1 x 6]
##
## mean_hours mean_sex mean_race mean_earners1 mean_earners2 n
## (dbl) (dbl) (dbl) (dbl) (dbl) (int)
## 1 40.26384 0.5576763 0.7676349 0.8098313 0.4334514 4820
GSSdata %>%
summarise(sd_hours=sd(HRS1,na.rm=TRUE), sd_sex=sd(SEXnew), sd_race=sd(RACEnew), sd_earners1 = sd(EARNRS1,na.rm=TRUE), sd_earners2 = sd(EARNRS2,na.rm=TRUE), n=n())
## Source: local data frame [1 x 6]
##
## sd_hours sd_sex sd_race sd_earners1 sd_earners2 n
## (dbl) (dbl) (dbl) (dbl) (dbl) (int)
## 1 15.47928 0.4967138 0.4223844 0.392475 0.4956031 4820
GSSdata %>%
summarise(mean_hours=mean(log(GSSdata$HRS1),na.rm=TRUE), mean_sex=mean(SEXnew), mean_race=mean(RACEnew), mean_earners1 = mean(EARNRS1,na.rm=TRUE), mean_earners2 = mean(EARNRS2,na.rm=TRUE), n=n())
## Source: local data frame [1 x 6]
##
## mean_hours mean_sex mean_race mean_earners1 mean_earners2 n
## (dbl) (dbl) (dbl) (dbl) (dbl) (int)
## 1 3.584611 0.5576763 0.7676349 0.8098313 0.4334514 4820
GSSdata %>%
summarise(sd_hours=sd(log(GSSdata$HRS1),na.rm=TRUE), sd_sex=sd(SEXnew), sd_race=sd(RACEnew), sd_earners1 = sd(EARNRS1,na.rm=TRUE), sd_earners2 = sd(EARNRS2,na.rm=TRUE), n=n())
## Source: local data frame [1 x 6]
##
## sd_hours sd_sex sd_race sd_earners1 sd_earners2 n
## (dbl) (dbl) (dbl) (dbl) (dbl) (int)
## 1 0.5535699 0.4967138 0.4223844 0.392475 0.4956031 4820