Getting started

I first opened the specific libraries I needed to complete the assignment

require(foreign)
## Loading required package: foreign
require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(ggvis)
## Loading required package: ggvis
require(magrittr)
## Loading required package: magrittr

Downloading the data

I then dowloaded the General Social Services (GSS) data I needed from the National Opinion Research Center and moved the data into my Assignment 3 R Projects folder.

Using the following commands, I am able to open the data and create a new table data frame. To save space I have not printed a glimpse or summary of the data. However, I did summarize one variable, HRS1, to insure that the data downloaded correctly.

GSSdata <- read.spss("GSS.SAV",  
                     max.value.labels=TRUE, to.data.frame=FALSE,
                     trim.factor.names=FALSE, 
                     reencode=NA, use.missings=to.data.frame)

GSSdata <- data.frame(GSSdata)
GSSdata <- tbl_df(GSSdata)

GSSdata %>%
  summarise(mean_hours=mean(HRS1), sd_hours=sd(HRS1), n=n())
## Source: local data frame [1 x 3]
## 
##   mean_hours sd_hours     n
##        (dbl)    (dbl) (int)
## 1   24.07573 24.19891  4820

Clean and reorganize data frame

Our assignment ultimately asks us to use four variables, one dependent and three independent, from the GSS data to conduct two regression analyses. After reviewing the associated Codebook for the data, I chose to use the following variables:

  1. Dependent variable: HRS1 (required), If working, full- or part-time: How many hours did you work last week, at all jobs? Range: -1 to 99

  2. Independent variable: SEX, Respondent’s sex. Range: 1 to 2; 1=Male, 2=Female

  3. Independent variable: CLASS, if you were asked to use one of four names for your social class, which would you say you belong in: the lower class, the working class, the middle class or the upper class? Range: 1 to 9; 1=Lower, 2=Working, 3=Middle, 4=Upper, 8=Don’t know, 9=No answer

  4. Independent variable: DEGREE, Do you have any college degree? (If yes: what degree or degrees?) Code highest degree earned. Range: 0 to 4; 0=Left high school, 1=High school, 2=Junior college, 3=Bachelor, 4=Graduate

To more easily view the data, I decided to create a new data frame only containing these variables rather than the 1,061 in the original data set. I did this by running the following code.

newgss <- cbind(Sex=GSSdata$SEX, Class=GSSdata$CLASS, Degree=GSSdata$DEGREE, HRS=GSSdata$HRS1)
newgss <- data.frame(newgss)

I then viewed the dataframe to insure I had done this correctly.

glimpse(newgss)
## Observations: 4,820
## Variables: 4
## $ Sex    (dbl) 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 1, 2, 1...
## $ Class  (dbl) 4, 3, 3, 4, 1, 3, 2, 2, 3, 2, 1, 2, 1, 4, 2, 3, 2, 3, 2...
## $ Degree (dbl) 3, 1, 1, 1, 3, 3, 2, 0, 0, 3, 0, 3, 1, 1, 0, 1, 1, 1, 1...
## $ HRS    (dbl) 15, 30, 60, -1, -1, -1, -1, -1, -1, 40, -1, 20, -1, -1,...
summary(newgss)
##       Sex            Class           Degree          HRS       
##  Min.   :1.000   Min.   :1.000   Min.   :0.00   Min.   :-1.00  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:1.00   1st Qu.:-1.00  
##  Median :2.000   Median :2.000   Median :1.00   Median :25.00  
##  Mean   :1.558   Mean   :2.437   Mean   :1.66   Mean   :24.08  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:3.00   3rd Qu.:40.00  
##  Max.   :2.000   Max.   :9.000   Max.   :4.00   Max.   :99.00

I then further cleaned my data by removing any -1 values from my newly created variables, HRS and Class. I cleaned “HRS”" because I knew the assignment would require me to run a natural logarithm of that data, which requires no negative values and that the -1 value (inapplicable) indicated that person was not working full- or part-time. I also removed the values of 99 (No answer) and 98 (Don’t know) because they were irrelevant to the analyses.

I also cleaned “Class” because I didn’t want to include any “Don’t know” or “No answer” responses in my analysis.

To do this I filtered the data using the following commands, and checked my data against the original. Note: the summary also shows that the HRS variable doesn’t contain any 0 values I need to account for before running a natural logarithm command.

cleangss <- newgss %>% filter(HRS>=0, HRS<98, Class<=4)
glimpse(cleangss)
## Observations: 2,840
## Variables: 4
## $ Sex    (dbl) 1, 1, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 1, 2, 2, 1, 2...
## $ Class  (dbl) 4, 3, 3, 2, 2, 2, 3, 2, 3, 2, 4, 3, 2, 3, 3, 2, 1, 2, 2...
## $ Degree (dbl) 3, 1, 1, 3, 3, 0, 1, 1, 1, 1, 0, 2, 2, 3, 1, 1, 2, 3, 1...
## $ HRS    (dbl) 15, 30, 60, 40, 20, 32, 53, 60, 40, 40, 12, 40, 40, 75,...
summary(cleangss)
##       Sex           Class           Degree           HRS       
##  Min.   :1.00   Min.   :1.000   Min.   :0.000   Min.   : 1.00  
##  1st Qu.:1.00   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:35.00  
##  Median :2.00   Median :2.000   Median :1.000   Median :40.00  
##  Mean   :1.51   Mean   :2.413   Mean   :1.834   Mean   :40.25  
##  3rd Qu.:2.00   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:50.00  
##  Max.   :2.00   Max.   :4.000   Max.   :4.000   Max.   :89.00

After this data preparation, I’m now ready to answer the questions posed by the assignment.


1. Visualizations. Using the ggvis package, create two plots…

a. One histogram of HRS1.

I used my original data frame and ran the following code to produce a histogram.

GSSdata %>% ggvis(~HRS1) %>% layer_histograms()%>%
    add_axis("x", title = "Number of hours worked in previous week", title_offset = 50) %>%
  add_axis("y", title = "Number of observations", title_offset = 50)
## Guessing width = 5 # range / 21

I have also provided a histogram of HRS below which uses my cleaned data so that it is directly comparable in terms of number of observations to later parts of this assignment.

cleangss %>% ggvis(~HRS) %>% layer_histograms() %>%
  add_axis("x", title = "Number of hours worked in previous week", title_offset = 50) %>%
  add_axis("y", title = "Number of observations", title_offset = 50)
## Guessing width = 5 # range / 18


b. Another histogram of the natural logarithm (ln) of HRS1, or ln(HRS1).

Since I renamed my data in a new data frame, I will actually produce ln(HRS) from the cleangss data frame by running the following code. This code also creates a new data frame that I can use later in the assignment.

ln_HRS <- cleangss %>% mutate(HRS = log(HRS))
glimpse(ln_HRS)
## Observations: 2,840
## Variables: 4
## $ Sex    (dbl) 1, 1, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 1, 2, 2, 1, 2...
## $ Class  (dbl) 4, 3, 3, 2, 2, 2, 3, 2, 3, 2, 4, 3, 2, 3, 3, 2, 1, 2, 2...
## $ Degree (dbl) 3, 1, 1, 3, 3, 0, 1, 1, 1, 1, 0, 2, 2, 3, 1, 1, 2, 3, 1...
## $ HRS    (dbl) 2.708050, 3.401197, 4.094345, 3.688879, 2.995732, 3.465...
summary(ln_HRS)
##       Sex           Class           Degree           HRS       
##  Min.   :1.00   Min.   :1.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:1.00   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:3.555  
##  Median :2.00   Median :2.000   Median :1.000   Median :3.689  
##  Mean   :1.51   Mean   :2.413   Mean   :1.834   Mean   :3.584  
##  3rd Qu.:2.00   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.912  
##  Max.   :2.00   Max.   :4.000   Max.   :4.000   Max.   :4.489

I then created a histogram of ln(HRS) using this new data frame.

ln_HRS %>% ggvis(~HRS) %>% layer_histograms() %>% 
  add_axis("x", title = "ln(HRS)", title_offset = 50) %>%
  add_axis("y", title = "number of observations", title_offset = 50)
## Guessing width = 0.2 # range / 23


2. Analyses. Conduct two ordinary least squares regression analyses…

a. Use HRS1 as the dependent variable and at least three other variable as independent variables.

Once again, my new data set cleangss has renamed the dependent variable to HRS, so I can run the following code to conduct my first analysis.

gss_factors <- lm(HRS ~ Sex + Class + Degree, data = cleangss)
gss_factors
## 
## Call:
## lm(formula = HRS ~ Sex + Class + Degree, data = cleangss)
## 
## Coefficients:
## (Intercept)          Sex        Class       Degree  
##     48.4535      -6.9515       0.2393       0.9374
summary(gss_factors)
## 
## Call:
## lm(formula = HRS ~ Sex + Class + Degree, data = cleangss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.970  -5.970   1.033   6.919  53.273 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  48.4535     1.4789  32.763  < 2e-16 ***
## Sex          -6.9515     0.5681 -12.236  < 2e-16 ***
## Class         0.2393     0.5079   0.471 0.637540    
## Degree        0.9374     0.2539   3.692 0.000227 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.06 on 2836 degrees of freedom
## Multiple R-squared:  0.05575,    Adjusted R-squared:  0.05475 
## F-statistic: 55.81 on 3 and 2836 DF,  p-value: < 2.2e-16
confint(gss_factors)
##                  2.5 %    97.5 %
## (Intercept) 45.5536535 51.353365
## Sex         -8.0654319 -5.837554
## Class       -0.7566335  1.235308
## Degree       0.4395308  1.435338

Analysis of the Null Hypothesis

We read from the findings that:

Multiple R-squared: 0.05575, Adjusted R-squared: 0.05475

F-statistic: 55.81 on 3 and 2836 DF, p-value: < 2.2e-16

If \(\alpha\) = .05, then the p-value, < 2.2e-16, is less than \(\alpha\). Therefore, we reject the null hypothesis that there is no relationship between the dependent variable and the entire sent of independent variables.

Since we have rejected the null hypothesis that there is no relationship between the dependent variable and the entire sent of independent variables, we can test each independent variable as it relates to the dependent variable.

Independence of variable Sex

The regression coefficient for the independent variable Sex is:

##Estimate Std. Error t value Pr(>|t|)

##-6.9515 0.5681 -12.236 < 2e-16

Again, if \(\alpha\) = .05, then the p-value, < 2e-16, is less than \(\alpha\). Therefore, we reject the null hypothesis that there is no relationship between HRS and Sex.

Interpretation

Point estimate

The estimate of the population coefficient is -6.95 (rounded). The means that there is a negative relationship between HRS and Sex – females worked less than males. Another way of saying this is that women worked an average of 6.83 hours less per week than men.

Interval estimate

## 2.5 % 97.5 %

##Sex -8.0654319 -5.837554

Our best estimate is that during a week, women work 6.95 hours less than men. However, we are 95% confident that this difference in number of hours worked is between 8.07 and 5.83 hours.

Independence of variable Class

The regression coefficient for the independent variable Class is:

##Estimate Std. Error t value Pr(>|t|)

## 0.2393 0.5079 0.471 0.637540

In this instance, if \(\alpha\) = .05, then the p-value, 0.637540, is greater than \(\alpha\). Therefore, we fail to reject the null hypothesis that there is no relationship between HRS and Class. Said another way, our regression analysis hasn’t provided us evidence that HRS and Class are dependent variables, i.e. one can be used to predict the other and vice versa.

Independence of variable Degree

The regression coefficient for the independent variable Degree is:

##Estimate Std. Error t value Pr(>|t|)

## 0.9374 0.2539 3.692 0.000227

Again, if \(\alpha\) = .05, then the p-value, 0.000227, is less than \(\alpha\). Therefore, we reject the null hypothesis that there is no relationship between HRS and Degree.

Interpretation

Point estimate

The estimate of the population coefficient is 0.94 (rounded). The means that there is a positive relationship between HRS and Degree – the higher the degree that is earned, the more hours per week that are worked. Another way of saying this is that for every degree level earned 0.94 hours more per week are worked.

Interval estimate

## 2.5 % 97.5 %

##Degree 0.4395308 1.435338

Our best estimate is that for every degree level earned, the number of hours worked each week increases by 0.94 hours. However, we are 95% confident that this increase in number of hours is between 0.44 and 1.44 hours (rounded).


b. Use ln(HRS) as the dependent variable and the same set of independent variables.

The is easily done using my mutated data set, ln_HRS, with the same commands.

gss_factors_lnHRS <- lm(HRS ~ Sex + Class + Degree, data = ln_HRS)
gss_factors_lnHRS
## 
## Call:
## lm(formula = HRS ~ Sex + Class + Degree, data = ln_HRS)
## 
## Coefficients:
## (Intercept)          Sex        Class       Degree  
##   3.8558696   -0.2085724    0.0003349    0.0230698
summary(gss_factors_lnHRS)
## 
## Call:
## lm(formula = HRS ~ Sex + Class + Degree, data = ln_HRS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7406 -0.0517  0.1485  0.2641  1.0265 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.8558696  0.0534380  72.156   <2e-16 ***
## Sex         -0.2085724  0.0205275 -10.161   <2e-16 ***
## Class        0.0003349  0.0183536   0.018    0.985    
## Degree       0.0230698  0.0091753   2.514    0.012 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5443 on 2836 degrees of freedom
## Multiple R-squared:  0.03739,    Adjusted R-squared:  0.03637 
## F-statistic: 36.71 on 3 and 2836 DF,  p-value: < 2.2e-16
confint(gss_factors_lnHRS)
##                    2.5 %      97.5 %
## (Intercept)  3.751088217  3.96065094
## Sex         -0.248822723 -0.16832214
## Class       -0.035652798  0.03632263
## Degree       0.005078946  0.04106073

Analysis of the Null Hypothesis

We read from the findings that:

Multiple R-squared: 0.03739, Adjusted R-squared: 0.03637

F-statistic: 36.71 on 3 and 2836 DF, p-value: < 2.2e-16

If \(\alpha\) = .05, then the p-value, < 2.2e-16, is less than \(\alpha\). Therefore, we reject the null hypothesis that there is no relationship between the dependent variable and the entire sent of independent variables.

Since we have rejected the null hypothesis, we can test each independent variable as it relates to the dependent variable.

Independence of variable Sex

The regression coefficient for the independent variable Sex is:

## Estimate Std. Error t value Pr(>|t|)

##-0.2085724 0.0205275 -10.161 <2e-16

Again, if \(\alpha\) = .05, then the p-value, < 2e-16, is less than \(\alpha\). Therefore, we reject the null hypothesis that there is no relationship between ln_HRS and Sex.

Interpretation

Point estimate

The estimate of the population coefficient is -0.21 (rounded). The means that there is a negative relationship between ln_HRS and Sex – females worked 21% less hours than men.

Interval estimate

## 2.5 % 97.5 %

##Sex -0.248822723 -0.16832214

Our best estimate is that women work 21 percent less hours than men. However, we are 95% confident that the number of hours worked is reduced by between 25 and 17 percent (rounded).

Independence of variable Class

The regression coefficient for the independent variable Class is:

## Estimate Std. Error t value Pr(>|t|)

## 0.0003349 0.0183536 0.018 0.985

In this instance, if \(\alpha\) = .05, then the p-value, 0.985, is greater than \(\alpha\). Therefore, we fail to reject the null hypothesis that there is no relationship between ln_HRS and Class. Said another way, our regression analysis hasn’t provided us evidence that ln_HRS and Class are dependent variables, i.e. one can be used to predict the other and vice versa.

Independence of variable Degree

The regression coefficient for the independent variable Degree is:

## Estimate Std. Error t value Pr(>|t|)

## 0.0230698 0.0091753 2.514 0.012

Again, if \(\alpha\) = .05, then the p-value, 0.012, is less than \(\alpha\). Therefore, we reject the null hypothesis that there is no relationship between ln_HRS and Degree.

Interpretation

Point estimate

The estimate of the population coefficient is 0.02 (rounded). The means that there is a positive relationship between ln_HRS and Degree – for every degree level earned the number of hours worked each week will increase by 2%.

Interval estimate

## 2.5 % 97.5 %

##Degree 0.005078946 0.04106073

Our best estimate is that for every degree level earned, the number of hours worked each week will increase by 2%. However, we are 95% confident that this increase is between 0.5 and 4 percent (rounded).


3. Reports. Build two APA-correct tables…

a. One table reporting the regression of HRS that I conducted in 2a.

Before I make the table, I need means and standard deviations of all variables in the equation.

cleangss %>% 
  summarise(mean_HRS=mean(HRS), mean_Sex=mean(Sex), 
            mean_Class=mean(Class), mean_Degree=mean(Degree))
##   mean_HRS mean_Sex mean_Class mean_Degree
## 1 40.25458 1.509859   2.412676    1.834155
cleangss %>%
  summarise(sd_HRS=sd(HRS), sd_Sex=sd(Sex),
            sd_Class=sd(Class), sd_Degree=sd(Degree))
##    sd_HRS    sd_Sex  sd_Class sd_Degree
## 1 15.4924 0.4999908 0.6163053  1.230539

The table:

b. Another table reporting the regression of ln(HRS1) that you conducted in 2b.

Before I make the table, I need means and standard deviations of all variables in the equation.

ln_HRS %>% 
  summarise(mean_HRS=mean(HRS), mean_Sex=mean(Sex), 
            mean_Class=mean(Class), mean_Degree=mean(Degree))
##   mean_HRS mean_Sex mean_Class mean_Degree
## 1 3.584076 1.509859   2.412676    1.834155
ln_HRS %>%
  summarise(sd_HRS=sd(HRS), sd_Sex=sd(Sex),
            sd_Class=sd(Class), sd_Degree=sd(Degree))
##      sd_HRS    sd_Sex  sd_Class sd_Degree
## 1 0.5544267 0.4999908 0.6163053  1.230539

The table: