I first opened the specific libraries I needed to complete the assignment
require(foreign)
## Loading required package: foreign
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(ggvis)
## Loading required package: ggvis
require(magrittr)
## Loading required package: magrittr
I then dowloaded the General Social Services (GSS) data I needed from the National Opinion Research Center and moved the data into my Assignment 3 R Projects folder.
Using the following commands, I am able to open the data and create a new table data frame. To save space I have not printed a glimpse or summary of the data. However, I did summarize one variable, HRS1, to insure that the data downloaded correctly.
GSSdata <- read.spss("GSS.SAV",
max.value.labels=TRUE, to.data.frame=FALSE,
trim.factor.names=FALSE,
reencode=NA, use.missings=to.data.frame)
GSSdata <- data.frame(GSSdata)
GSSdata <- tbl_df(GSSdata)
GSSdata %>%
summarise(mean_hours=mean(HRS1), sd_hours=sd(HRS1), n=n())
## Source: local data frame [1 x 3]
##
## mean_hours sd_hours n
## (dbl) (dbl) (int)
## 1 24.07573 24.19891 4820
Our assignment ultimately asks us to use four variables, one dependent and three independent, from the GSS data to conduct two regression analyses. After reviewing the associated Codebook for the data, I chose to use the following variables:
Dependent variable: HRS1 (required), If working, full- or part-time: How many hours did you work last week, at all jobs? Range: -1 to 99
Independent variable: SEX, Respondent’s sex. Range: 1 to 2; 1=Male, 2=Female
Independent variable: CLASS, if you were asked to use one of four names for your social class, which would you say you belong in: the lower class, the working class, the middle class or the upper class? Range: 1 to 9; 1=Lower, 2=Working, 3=Middle, 4=Upper, 8=Don’t know, 9=No answer
Independent variable: DEGREE, Do you have any college degree? (If yes: what degree or degrees?) Code highest degree earned. Range: 0 to 4; 0=Left high school, 1=High school, 2=Junior college, 3=Bachelor, 4=Graduate
To more easily view the data, I decided to create a new data frame only containing these variables rather than the 1,061 in the original data set. I did this by running the following code.
newgss <- cbind(Sex=GSSdata$SEX, Class=GSSdata$CLASS, Degree=GSSdata$DEGREE, HRS=GSSdata$HRS1)
newgss <- data.frame(newgss)
I then viewed the dataframe to insure I had done this correctly.
glimpse(newgss)
## Observations: 4,820
## Variables: 4
## $ Sex (dbl) 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 1, 2, 1...
## $ Class (dbl) 4, 3, 3, 4, 1, 3, 2, 2, 3, 2, 1, 2, 1, 4, 2, 3, 2, 3, 2...
## $ Degree (dbl) 3, 1, 1, 1, 3, 3, 2, 0, 0, 3, 0, 3, 1, 1, 0, 1, 1, 1, 1...
## $ HRS (dbl) 15, 30, 60, -1, -1, -1, -1, -1, -1, 40, -1, 20, -1, -1,...
summary(newgss)
## Sex Class Degree HRS
## Min. :1.000 Min. :1.000 Min. :0.00 Min. :-1.00
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:1.00 1st Qu.:-1.00
## Median :2.000 Median :2.000 Median :1.00 Median :25.00
## Mean :1.558 Mean :2.437 Mean :1.66 Mean :24.08
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:3.00 3rd Qu.:40.00
## Max. :2.000 Max. :9.000 Max. :4.00 Max. :99.00
I then further cleaned my data by removing any -1 values from my newly created variables, HRS and Class. I cleaned “HRS”" because I knew the assignment would require me to run a natural logarithm of that data, which requires no negative values and that the -1 value (inapplicable) indicated that person was not working full- or part-time. I also removed the values of 99 (No answer) and 98 (Don’t know) because they were irrelevant to the analyses.
I also cleaned “Class” because I didn’t want to include any “Don’t know” or “No answer” responses in my analysis.
To do this I filtered the data using the following commands, and checked my data against the original. Note: the summary also shows that the HRS variable doesn’t contain any 0 values I need to account for before running a natural logarithm command.
cleangss <- newgss %>% filter(HRS>=0, HRS<98, Class<=4)
glimpse(cleangss)
## Observations: 2,840
## Variables: 4
## $ Sex (dbl) 1, 1, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 1, 2, 2, 1, 2...
## $ Class (dbl) 4, 3, 3, 2, 2, 2, 3, 2, 3, 2, 4, 3, 2, 3, 3, 2, 1, 2, 2...
## $ Degree (dbl) 3, 1, 1, 3, 3, 0, 1, 1, 1, 1, 0, 2, 2, 3, 1, 1, 2, 3, 1...
## $ HRS (dbl) 15, 30, 60, 40, 20, 32, 53, 60, 40, 40, 12, 40, 40, 75,...
summary(cleangss)
## Sex Class Degree HRS
## Min. :1.00 Min. :1.000 Min. :0.000 Min. : 1.00
## 1st Qu.:1.00 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:35.00
## Median :2.00 Median :2.000 Median :1.000 Median :40.00
## Mean :1.51 Mean :2.413 Mean :1.834 Mean :40.25
## 3rd Qu.:2.00 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:50.00
## Max. :2.00 Max. :4.000 Max. :4.000 Max. :89.00
After this data preparation, I’m now ready to answer the questions posed by the assignment.
I used my original data frame and ran the following code to produce a histogram.
GSSdata %>% ggvis(~HRS1) %>% layer_histograms()%>%
add_axis("x", title = "Number of hours worked in previous week", title_offset = 50) %>%
add_axis("y", title = "Number of observations", title_offset = 50)
## Guessing width = 5 # range / 21
I have also provided a histogram of HRS below which uses my cleaned data so that it is directly comparable in terms of number of observations to later parts of this assignment.
cleangss %>% ggvis(~HRS) %>% layer_histograms() %>%
add_axis("x", title = "Number of hours worked in previous week", title_offset = 50) %>%
add_axis("y", title = "Number of observations", title_offset = 50)
## Guessing width = 5 # range / 18
Since I renamed my data in a new data frame, I will actually produce ln(HRS) from the cleangss data frame by running the following code. This code also creates a new data frame that I can use later in the assignment.
ln_HRS <- cleangss %>% mutate(HRS = log(HRS))
glimpse(ln_HRS)
## Observations: 2,840
## Variables: 4
## $ Sex (dbl) 1, 1, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 1, 2, 2, 1, 2...
## $ Class (dbl) 4, 3, 3, 2, 2, 2, 3, 2, 3, 2, 4, 3, 2, 3, 3, 2, 1, 2, 2...
## $ Degree (dbl) 3, 1, 1, 3, 3, 0, 1, 1, 1, 1, 0, 2, 2, 3, 1, 1, 2, 3, 1...
## $ HRS (dbl) 2.708050, 3.401197, 4.094345, 3.688879, 2.995732, 3.465...
summary(ln_HRS)
## Sex Class Degree HRS
## Min. :1.00 Min. :1.000 Min. :0.000 Min. :0.000
## 1st Qu.:1.00 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:3.555
## Median :2.00 Median :2.000 Median :1.000 Median :3.689
## Mean :1.51 Mean :2.413 Mean :1.834 Mean :3.584
## 3rd Qu.:2.00 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.912
## Max. :2.00 Max. :4.000 Max. :4.000 Max. :4.489
I then created a histogram of ln(HRS) using this new data frame.
ln_HRS %>% ggvis(~HRS) %>% layer_histograms() %>%
add_axis("x", title = "ln(HRS)", title_offset = 50) %>%
add_axis("y", title = "number of observations", title_offset = 50)
## Guessing width = 0.2 # range / 23
Once again, my new data set cleangss has renamed the dependent variable to HRS, so I can run the following code to conduct my first analysis.
gss_factors <- lm(HRS ~ Sex + Class + Degree, data = cleangss)
gss_factors
##
## Call:
## lm(formula = HRS ~ Sex + Class + Degree, data = cleangss)
##
## Coefficients:
## (Intercept) Sex Class Degree
## 48.4535 -6.9515 0.2393 0.9374
summary(gss_factors)
##
## Call:
## lm(formula = HRS ~ Sex + Class + Degree, data = cleangss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.970 -5.970 1.033 6.919 53.273
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.4535 1.4789 32.763 < 2e-16 ***
## Sex -6.9515 0.5681 -12.236 < 2e-16 ***
## Class 0.2393 0.5079 0.471 0.637540
## Degree 0.9374 0.2539 3.692 0.000227 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.06 on 2836 degrees of freedom
## Multiple R-squared: 0.05575, Adjusted R-squared: 0.05475
## F-statistic: 55.81 on 3 and 2836 DF, p-value: < 2.2e-16
confint(gss_factors)
## 2.5 % 97.5 %
## (Intercept) 45.5536535 51.353365
## Sex -8.0654319 -5.837554
## Class -0.7566335 1.235308
## Degree 0.4395308 1.435338
We read from the findings that:
Multiple R-squared: 0.05575, Adjusted R-squared: 0.05475
F-statistic: 55.81 on 3 and 2836 DF, p-value: < 2.2e-16
If \(\alpha\) = .05, then the p-value, < 2.2e-16, is less than \(\alpha\). Therefore, we reject the null hypothesis that there is no relationship between the dependent variable and the entire sent of independent variables.
Since we have rejected the null hypothesis that there is no relationship between the dependent variable and the entire sent of independent variables, we can test each independent variable as it relates to the dependent variable.
The regression coefficient for the independent variable Sex is:
##Estimate Std. Error t value Pr(>|t|)
##-6.9515 0.5681 -12.236 < 2e-16
Again, if \(\alpha\) = .05, then the p-value, < 2e-16, is less than \(\alpha\). Therefore, we reject the null hypothesis that there is no relationship between HRS and Sex.
Point estimate
The estimate of the population coefficient is -6.95 (rounded). The means that there is a negative relationship between HRS and Sex – females worked less than males. Another way of saying this is that women worked an average of 6.83 hours less per week than men.
Interval estimate
## 2.5 % 97.5 %
##Sex -8.0654319 -5.837554
Our best estimate is that during a week, women work 6.95 hours less than men. However, we are 95% confident that this difference in number of hours worked is between 8.07 and 5.83 hours.
The regression coefficient for the independent variable Class is:
##Estimate Std. Error t value Pr(>|t|)
## 0.2393 0.5079 0.471 0.637540
In this instance, if \(\alpha\) = .05, then the p-value, 0.637540, is greater than \(\alpha\). Therefore, we fail to reject the null hypothesis that there is no relationship between HRS and Class. Said another way, our regression analysis hasn’t provided us evidence that HRS and Class are dependent variables, i.e. one can be used to predict the other and vice versa.
The regression coefficient for the independent variable Degree is:
##Estimate Std. Error t value Pr(>|t|)
## 0.9374 0.2539 3.692 0.000227
Again, if \(\alpha\) = .05, then the p-value, 0.000227, is less than \(\alpha\). Therefore, we reject the null hypothesis that there is no relationship between HRS and Degree.
Point estimate
The estimate of the population coefficient is 0.94 (rounded). The means that there is a positive relationship between HRS and Degree – the higher the degree that is earned, the more hours per week that are worked. Another way of saying this is that for every degree level earned 0.94 hours more per week are worked.
Interval estimate
## 2.5 % 97.5 %
##Degree 0.4395308 1.435338
Our best estimate is that for every degree level earned, the number of hours worked each week increases by 0.94 hours. However, we are 95% confident that this increase in number of hours is between 0.44 and 1.44 hours (rounded).
The is easily done using my mutated data set, ln_HRS, with the same commands.
gss_factors_lnHRS <- lm(HRS ~ Sex + Class + Degree, data = ln_HRS)
gss_factors_lnHRS
##
## Call:
## lm(formula = HRS ~ Sex + Class + Degree, data = ln_HRS)
##
## Coefficients:
## (Intercept) Sex Class Degree
## 3.8558696 -0.2085724 0.0003349 0.0230698
summary(gss_factors_lnHRS)
##
## Call:
## lm(formula = HRS ~ Sex + Class + Degree, data = ln_HRS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7406 -0.0517 0.1485 0.2641 1.0265
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.8558696 0.0534380 72.156 <2e-16 ***
## Sex -0.2085724 0.0205275 -10.161 <2e-16 ***
## Class 0.0003349 0.0183536 0.018 0.985
## Degree 0.0230698 0.0091753 2.514 0.012 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5443 on 2836 degrees of freedom
## Multiple R-squared: 0.03739, Adjusted R-squared: 0.03637
## F-statistic: 36.71 on 3 and 2836 DF, p-value: < 2.2e-16
confint(gss_factors_lnHRS)
## 2.5 % 97.5 %
## (Intercept) 3.751088217 3.96065094
## Sex -0.248822723 -0.16832214
## Class -0.035652798 0.03632263
## Degree 0.005078946 0.04106073
We read from the findings that:
Multiple R-squared: 0.03739, Adjusted R-squared: 0.03637
F-statistic: 36.71 on 3 and 2836 DF, p-value: < 2.2e-16
If \(\alpha\) = .05, then the p-value, < 2.2e-16, is less than \(\alpha\). Therefore, we reject the null hypothesis that there is no relationship between the dependent variable and the entire sent of independent variables.
Since we have rejected the null hypothesis, we can test each independent variable as it relates to the dependent variable.
The regression coefficient for the independent variable Sex is:
## Estimate Std. Error t value Pr(>|t|)
##-0.2085724 0.0205275 -10.161 <2e-16
Again, if \(\alpha\) = .05, then the p-value, < 2e-16, is less than \(\alpha\). Therefore, we reject the null hypothesis that there is no relationship between ln_HRS and Sex.
Point estimate
The estimate of the population coefficient is -0.21 (rounded). The means that there is a negative relationship between ln_HRS and Sex – females worked 21% less hours than men.
Interval estimate
## 2.5 % 97.5 %
##Sex -0.248822723 -0.16832214
Our best estimate is that women work 21 percent less hours than men. However, we are 95% confident that the number of hours worked is reduced by between 25 and 17 percent (rounded).
The regression coefficient for the independent variable Class is:
## Estimate Std. Error t value Pr(>|t|)
## 0.0003349 0.0183536 0.018 0.985
In this instance, if \(\alpha\) = .05, then the p-value, 0.985, is greater than \(\alpha\). Therefore, we fail to reject the null hypothesis that there is no relationship between ln_HRS and Class. Said another way, our regression analysis hasn’t provided us evidence that ln_HRS and Class are dependent variables, i.e. one can be used to predict the other and vice versa.
The regression coefficient for the independent variable Degree is:
## Estimate Std. Error t value Pr(>|t|)
## 0.0230698 0.0091753 2.514 0.012
Again, if \(\alpha\) = .05, then the p-value, 0.012, is less than \(\alpha\). Therefore, we reject the null hypothesis that there is no relationship between ln_HRS and Degree.
Point estimate
The estimate of the population coefficient is 0.02 (rounded). The means that there is a positive relationship between ln_HRS and Degree – for every degree level earned the number of hours worked each week will increase by 2%.
Interval estimate
## 2.5 % 97.5 %
##Degree 0.005078946 0.04106073
Our best estimate is that for every degree level earned, the number of hours worked each week will increase by 2%. However, we are 95% confident that this increase is between 0.5 and 4 percent (rounded).
Before I make the table, I need means and standard deviations of all variables in the equation.
cleangss %>%
summarise(mean_HRS=mean(HRS), mean_Sex=mean(Sex),
mean_Class=mean(Class), mean_Degree=mean(Degree))
## mean_HRS mean_Sex mean_Class mean_Degree
## 1 40.25458 1.509859 2.412676 1.834155
cleangss %>%
summarise(sd_HRS=sd(HRS), sd_Sex=sd(Sex),
sd_Class=sd(Class), sd_Degree=sd(Degree))
## sd_HRS sd_Sex sd_Class sd_Degree
## 1 15.4924 0.4999908 0.6163053 1.230539
The table:
Before I make the table, I need means and standard deviations of all variables in the equation.
ln_HRS %>%
summarise(mean_HRS=mean(HRS), mean_Sex=mean(Sex),
mean_Class=mean(Class), mean_Degree=mean(Degree))
## mean_HRS mean_Sex mean_Class mean_Degree
## 1 3.584076 1.509859 2.412676 1.834155
ln_HRS %>%
summarise(sd_HRS=sd(HRS), sd_Sex=sd(Sex),
sd_Class=sd(Class), sd_Degree=sd(Degree))
## sd_HRS sd_Sex sd_Class sd_Degree
## 1 0.5544267 0.4999908 0.6163053 1.230539
The table: