library(readr) data_acs <- read_csv(“~/data-acs.csv”) Rows: 10000 Columns: 18
── Column specification ──────────────────────────────────────────────── Delimiter: “,” chr (4): sex, race, hispanic, empstat dbl (14): age, educ, hrs_wk, income, deg_bachelors, deg_masters, deg…
ℹ Use spec() to retrieve the full column specification
for this data. ℹ Specify the column types or set
show_col_types = FALSE to quiet this message. >
View(data_acs) > # Load packages using ‘pacman’ > library(pacman)
> p_load(tidyverse, scales, patchwork, fixest, here) > # Load data
> acs_df = here(‘data-acs.csv’) |> read_csv() Error:
‘C:/Users/lucas/OneDrive/Documents/Problem Set 0/data-acs.csv’ does not
exist.
Warning message: File monitoring failed for project at “~/Problem Set
0” Error 2 (The system cannot find the file specified) Features
disabled: R source file indexing, Connect, Diagnostics > # Load data
> acs_df = here(‘data-acs.csv’) |> read_csv() Error:
‘C:/Users/lucas/OneDrive/Documents/Problem Set 0/data-acs.csv’ does not
exist. > # Load data > acs_df = here(‘data-acs.csv’) |>
read_csv() Rows: 10000 Columns: 18
── Column specification ────────────────────────────────────────────────
Delimiter: “,” chr (4): sex, race, hispanic, empstat dbl (14): age,
educ, hrs_wk, income, deg_bachelors, deg_masters, deg…
ℹ Use spec() to retrieve the full column specification
for this data. ℹ Specify the column types or set
show_col_types = FALSE to quiet this message. > # Check
dimensions > acs_df |> nrow() |> comma() [1] “10,000” >
glimpse(acs_df) Rows: 10,000 Columns: 18 $ sex
Constant -1.865 (1.264) educ 1.867*** (0.0914) _______________ _________________ S.E. type IID Observations 9,819 R2 0.04074 Adj. R2 0.04064 — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1 > #[15] As shown from doing the regression the intercept indicating that when you have zero years of education this results in a -1.865 work hours.Additionally it approximates that an additional year of schooling/education results in 1.867 hours of work if all holds constant. > #[16] Using the regession you can estimate what itd look like for someone with 13 years of education: -1.865 + 1.867 * 13= 22.406 hours of work for someone with 13 years of education. > acs_df_nonzero <- acs_df %>% filter(hrs_wk > 0) > model_nonzero <- lm(hrs_wk ~ educ, data = acs_df_nonzero) > summary(model_nonzero)
Call: lm(formula = hrs_wk ~ educ, data = acs_df_nonzero)
Residuals: Min 1Q Median 3Q Max -37.853 -3.853 1.147 3.346 61.797
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.0567 1.0396 28.91 < 2e-16 educ 0.5497
0.0735 7.48 8.49e-14 — Signif. codes:
0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.62 on 6138 degrees of freedom (85 observations deleted due to missingness) Multiple R-squared: 0.009033, Adjusted R-squared: 0.008871 F-statistic: 55.95 on 1 and 6138 DF, p-value: 8.488e-14
summarise(mean_hrs_wk = mean(hrs_wk)) Error: object ‘hrs_wk’ not found acs_df |> + filter(educ == 13) |> + summarise(mean_hrs_wk = mean(hrs_wk)) |> + pull() |> + hours() Error in validObject(.Object) : invalid class “Period” object: periods must have integer values acs_df_nonzero |> + filter(educ == 13) |> + summarise(mean_hrs_wk = mean(hrs_wk)) |> + pull() |> + hours() Error in validObject(.Object) : invalid class “Period” object: periods must have integer values acs_df_nonzero |> + filter(educ == 13) |> + + summarise(mean_hrs_wk = mean(hrs_wk) + + mean(acs_df_nonzero\(hrs_wk, na.rm = TRUE) Error: unexpected symbol in: " mean" mean(acs_df_nonzero\)hrs_wk, na.rm = TRUE) [1] 37.75245
#[18] By changing the regression to only focus on those who work over 0 hours a week it gives a mean that is much higher in terms of the average of those who work perweek being at 37.75 and this is like due to most people being in the 40 hours of work a week category and makes sense as to get that mean when getting rid of those who don’t work because those numbers caused a lot of bias and gave less of an idea of how many hours people were really working. #[19] As I breifly explained the regression results were different becasue by getting rid of the people who were not working at all this reduced the variability and showed just those who were working consitantly giving much different results from previously that factored in all those people who weren’t working. library(ggplot2)
Scatterplot
ggplot(acs_df, aes(x = educ, y = hrs_wk)) + + geom_point(alpha = 0.5, color = “blue”) + + geom_smooth(method = “lm”, se = TRUE, color = “red”) + + labs( + title = “Scatterplot of Hours Worked vs. Education”, + x = “Years of Education”, + y = “Hours Worked per Week” + ) + + theme_minimal()
geom_smooth()using formula = ‘y ~ x’ Warning messages: 1: Removed 181 rows containing non-finite outside the scale range (stat_smooth()). 2: Removed 181 rows containing missing values or values outside the scale range (geom_point()). library(dplyr)acs_df <- acs_df %>% Error: unexpected symbol in “library(dplyr)acs_df” acs_df <- acs_df %>% + mutate(i_zero_hrs = ifelse(hrs_wk == 0, 1, 0)) model1 <- lm(i_zero_hrs ~ i_female, data = acs_df) summary(model1)
Call: lm(formula = i_zero_hrs ~ i_female, data = acs_df)
Residuals: Min 1Q Median 3Q Max -0.4205 -0.4205 -0.3315 0.5795 0.6685
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.331471 0.006944 47.733 <2e-16 i_female
0.089082 0.009661 9.221 <2e-16 — Signif. codes:
0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4828 on 9998 degrees of freedom Multiple R-squared: 0.008433, Adjusted R-squared: 0.008334 F-statistic: 85.03 on 1 and 9998 DF, p-value: < 2.2e-16
#[22] Interpreting both the intercept and the coefficient from our regression the intercept actsas the probability that someone who is not female(male) being that i_female=0 is working zero hours. And for the coefficient of i_female, this just shows the difference in probabilites between males and females who would work 0 hours. #[23] Apologies I answered incorrectly and couldn’t go back and fix it, but with the intercept being 0.3315 this means that the 33.15% of males in the data are predicted to work zero hours. And the 0.0891 coefficient when compared to the intercept results in: 33.15% + 8.91% = 42.06% meaning its predicted from the data that 42.06% of females work zero hours. model2 <- lm(i_zero_hrs ~ i_female + educ + i_female:educ, data = acs_df) summary(model2)
Call: lm(formula = i_zero_hrs ~ i_female + educ + i_female:educ, data = acs_df)
Residuals: Min 1Q Median 3Q Max -0.9316 -0.3764 -0.2548 0.5437 0.7756
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.741055 0.041608 17.810 < 2e-16 i_female 0.401797 0.058557 6.862 7.22e-12 educ -0.030391 0.003034 -10.018 < 2e-16 i_female:educ -0.022420 0.004237 -5.292 1.24e-07
(Intercept) i_female educ
i_female:educ — Signif. codes:
0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4722 on 9815 degrees of freedom (181 observations deleted due to missingness) Multiple R-squared: 0.04868, Adjusted R-squared: 0.04838 F-statistic: 167.4 on 3 and 9815 DF, p-value: < 2.2e-16
#[25] Taking the data from the regression it tells us a few things that differ from the past regression now factoring in education. With the intercept being 0.7411 this acts as the baseline for the model, the coefficient for females being 0.4018 this entails that for females with no education the percent that they have 0 working hours is roughly 40.18%. Additionally, with non-famle education at -0.0304 this means that for each year of education for males there is a 3.04% reduction to overall probabilty of working zero hours, for men. For female education the number was-0.0224 meaning that for each year of education there is a 2.24% reduction to the probability of working zero hours as a female. However you have to add the two together to get the ful picture so really for females, each year of education reduces 5.28%(3.04% + 2.24%) of the probability of having zero hours of work. Overall entailing that there is a stronger negative effect for males than females for years of education translatin to amount of hours worked. new_data <- data.frame(i_female = 1, educ = 13) predict(model2, newdata = new_data) 1 0.4563125 summary(model2)$r.squared [1] 0.04867559 #[28] Yes as we did the regression for [22] we restricted the ages that were used to give us a better idea of our data but in doing so this can cause ommited variable bias, this is because by taking out things that have an effect on the end reuslt in this example the working hours, the age is important and by removing it you could cause bias to your estimates. This can cause the model to show an incorrect relationship between working hours and things like gender which is considered bias. #[29] Going off of what i passed stated, what would need to be true for there to be no bias you would need to know that the model is; linear showing the the relationship between the both the independant and independant variables as linear, there cannot be collinearity present meaning that the independant variable cannot be correlated with one another, the samples must be random in order to not cause bias, and lastly there must be an error term equal to zero meaning its not correlated to the other variables in the regression. All this must be true to maintain a non-bias result. #[30] The main takeaways from the data and viewing it showed that there is a positive correlation between education and working hours, meaning that the more years of education led to more hours worked and vice versa. Additionally there are clear gender differences represented in the data being that famles had a higher likelihood of working 0 hours compared to males. And finally there was a clear pattern when comparing the age of people who were working and how it showed that young people and old people were not prone to working while the middle age of people were at the peak and showed the highest amount of working hours then decreasing once again when getting to old age. All in all what I got from all the data is that going to school is important if you want to work and there are many patterns within the workforce that are important to understand as I get closer to being there myself and understanding how the workforce operates.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this: