Problem Set 0(Complete)

library(readr) data_acs <- read_csv(“~/data-acs.csv”) Rows: 10000 Columns: 18
── Column specification ──────────────────────────────────────────────── Delimiter: “,” chr (4): sex, race, hispanic, empstat dbl (14): age, educ, hrs_wk, income, deg_bachelors, deg_masters, deg…

ℹ Use spec() to retrieve the full column specification for this data. ℹ Specify the column types or set show_col_types = FALSE to quiet this message. > View(data_acs) > # Load packages using ‘pacman’ > library(pacman) > p_load(tidyverse, scales, patchwork, fixest, here) > # Load data > acs_df = here(‘data-acs.csv’) |> read_csv() Error: ‘C:/Users/lucas/OneDrive/Documents/Problem Set 0/data-acs.csv’ does not exist.

Warning message: File monitoring failed for project at “~/Problem Set 0” Error 2 (The system cannot find the file specified) Features disabled: R source file indexing, Connect, Diagnostics > # Load data > acs_df = here(‘data-acs.csv’) |> read_csv() Error: ‘C:/Users/lucas/OneDrive/Documents/Problem Set 0/data-acs.csv’ does not exist. > # Load data > acs_df = here(‘data-acs.csv’) |> read_csv() Rows: 10000 Columns: 18
── Column specification ──────────────────────────────────────────────── Delimiter: “,” chr (4): sex, race, hispanic, empstat dbl (14): age, educ, hrs_wk, income, deg_bachelors, deg_masters, deg…

ℹ Use spec() to retrieve the full column specification for this data. ℹ Specify the column types or set show_col_types = FALSE to quiet this message. > # Check dimensions > acs_df |> nrow() |> comma() [1] “10,000” > glimpse(acs_df) Rows: 10,000 Columns: 18 $ sex “Male”, “Male”, “Male”, “Male”, “Male”, “Female… $ age 46, 50, 66, 48, 59, 21, 31, 36, 27, 77, 49, 58,… $ race ”Other”, “White”, “White”, “Other”, “White”, “O… $ hispanic ”Hispanic”, “Non-Hispanic”, “Non-Hispanic”, “Hi… $ educ 12, 17, 12, 13, 12, 13, 12, 12, 16, 14, 17, 12,… $ empstat ”Employed”, “Employed”, “Employed”, “Employed”,… $ hrs_wk 40, 45, 40, 40, 30, 0, 40, 40, 40, 0, 50, 0, 0,… $ income 17000, 595000, 123100, 110000, 83000, 50, 98000… $ deg_bachelors 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0,… $ deg_masters 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,… $ deg_profession 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… $ deg_phd 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,… $ i_female 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1,… $ i_black 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… $ i_white 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0,… $ i_hispanic 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,… $ i_workforce 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1,… $ i_employed 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1,… > sum(is.na(acs_df$hrs_wk)) [1] 0 > View(acs_df) > acs_df$educ |> is.na() |> sum() [1] 181 > acs_df |> + select(hrs_wk, educ, i_female, age, income) |> + summarise( + across(everything(), + list(mean = ~mean(.x, na.rm = TRUE), + median = ~median(.x, na.rm = TRUE))) + ) # A tibble: 1 × 10 hrs_wk_mean hrs_wk_median educ_mean educ_median i_female_mean 1 23.5 30 13.6 13 0.517 # ℹ 5 more variables: i_female_median , age_mean , # age_median , income_mean , income_median > acs_df |> + select(hrs_wk, i_female, income, educ, age) |> + summarise_all(list(mean = mean, median = median), na.rm = TRUE) # A tibble: 1 × 10 hrs_wk_mean i_female_mean income_mean educ_mean age_mean hrs_wk_median 1 23.5 0.517 52930. 13.6 50.9 30 # ℹ 4 more variables: i_female_median , income_median , # educ_median , age_median > #[8] Binary indicators like for the category of female of not female is normally medsured out by a 0(not female) or 1(female), these can be important in telling us the amount of the data set that is female vs not and with that you can get more precise data depending on how you are trying to view the data. > library(ggplot2) > > ggplot(acs_df, aes(x = hrs_wk)) + + geom_histogram(binwidth = 5, fill = “steelblue”, color = “white”) + + labs( + title = “Distribution of Hours Worked Per Week”, + x = “Hours Worked per Week”, + y = “Number of Observations” + ) + + theme_minimal() > library(ggplot2) > > ggplot(acs_df, aes(x = age)) + + geom_histogram(binwidth = 5, fill = “blue”, color = “black”) + + labs( + title = “Age Distribution”, + x = “Age (years)”, + y = “Number of Observations” + ) + + theme_minimal() > > #[11] The age distribution matters when discussing the amount of hours worked because the two have a correlation with one another being that depending on the age you are depends on the amount of work you are likely to be doing, meaning that for the younger generation they probably have not gotten into the workforce just yet and for the older generation they are likely nearing the age of retirement and by adding these two groups into the data it can cause the numbers to be skewed but in realtiy that is because there are a lot of people dilluting the data who are not participating in the workforce so it would be beneficial to see how the data compares when taking those two groups out of the equation. > ggplot(acs_df %>% filter(age >= 25 & age <= 64), aes(x = hrs_wk)) + + geom_histogram(binwidth = 5, fill = “green”, color = “black”) + + labs( + title = “Hours Worked per Week (Ages 25–64)”, + x = “Hours Worked per Week”, + y = “Number of Observations” + ) + + theme_minimal() > > #[13] Yes there was a noticable difference in the two histograms and as I briefly touched on in 11 it is from the reduced varibility caused from having age groups included in the data that were likely not working to a histogram where most people are probalby working full time jobs and as the results show there is a significant decrease in people not working at all vs the amount of people working around 40 hours per week. > # Estimate the simple linear regression > est14 = feols(hrs_wk ~ educ, data = acs_df) |> etable() NOTE: 181 observations removed because of NA values (RHS: 181). > est14 feols(hrs_wk ~ .. Dependent Var.: hrs_wk

Constant -1.865 (1.264) educ 1.867*** (0.0914) _______________ _________________ S.E. type IID Observations 9,819 R2 0.04074 Adj. R2 0.04064 — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1 > #[15] As shown from doing the regression the intercept indicating that when you have zero years of education this results in a -1.865 work hours.Additionally it approximates that an additional year of schooling/education results in 1.867 hours of work if all holds constant. > #[16] Using the regession you can estimate what itd look like for someone with 13 years of education: -1.865 + 1.867 * 13= 22.406 hours of work for someone with 13 years of education. > acs_df_nonzero <- acs_df %>% filter(hrs_wk > 0) > model_nonzero <- lm(hrs_wk ~ educ, data = acs_df_nonzero) > summary(model_nonzero)

Call: lm(formula = hrs_wk ~ educ, data = acs_df_nonzero)

Residuals: Min 1Q Median 3Q Max -37.853 -3.853 1.147 3.346 61.797

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.0567 1.0396 28.91 < 2e-16 educ 0.5497 0.0735 7.48 8.49e-14 — Signif. codes:
0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 12.62 on 6138 degrees of freedom (85 observations deleted due to missingness) Multiple R-squared: 0.009033, Adjusted R-squared: 0.008871 F-statistic: 55.95 on 1 and 6138 DF, p-value: 8.488e-14

summarise(mean_hrs_wk = mean(hrs_wk)) Error: object ‘hrs_wk’ not found acs_df |> + filter(educ == 13) |> + summarise(mean_hrs_wk = mean(hrs_wk)) |> + pull() |> + hours() Error in validObject(.Object) : invalid class “Period” object: periods must have integer values acs_df_nonzero |> + filter(educ == 13) |> + summarise(mean_hrs_wk = mean(hrs_wk)) |> + pull() |> + hours() Error in validObject(.Object) : invalid class “Period” object: periods must have integer values acs_df_nonzero |> + filter(educ == 13) |> + + summarise(mean_hrs_wk = mean(hrs_wk) + + mean(acs_df_nonzero$hrs_wk, na.rm = TRUE) Error: unexpected symbol in: " mean" mean(acs_df_nonzero$hrs_wk, na.rm = TRUE) [1] 37.75245

#[18] By changing the regression to only focus on those who work over 0 hours a week it gives a mean that is much higher in terms of the average of those who work perweek being at 37.75 and this is like due to most people being in the 40 hours of work a week category and makes sense as to get that mean when getting rid of those who don’t work because those numbers caused a lot of bias and gave less of an idea of how many hours people were really working. #[19] As I breifly explained the regression results were different becasue by getting rid of the people who were not working at all this reduced the variability and showed just those who were working consitantly giving much different results from previously that factored in all those people who weren’t working. library(ggplot2)

Scatterplot

ggplot(acs_df, aes(x = educ, y = hrs_wk)) + + geom_point(alpha = 0.5, color = “blue”) + + geom_smooth(method = “lm”, se = TRUE, color = “red”) + + labs( + title = “Scatterplot of Hours Worked vs. Education”, + x = “Years of Education”, + y = “Hours Worked per Week” + ) + + theme_minimal() geom_smooth() using formula = ‘y ~ x’ Warning messages: 1: Removed 181 rows containing non-finite outside the scale range (stat_smooth()). 2: Removed 181 rows containing missing values or values outside the scale range (geom_point()). library(dplyr)acs_df <- acs_df %>% Error: unexpected symbol in “library(dplyr)acs_df” acs_df <- acs_df %>% + mutate(i_zero_hrs = ifelse(hrs_wk == 0, 1, 0)) model1 <- lm(i_zero_hrs ~ i_female, data = acs_df) summary(model1)

Call: lm(formula = i_zero_hrs ~ i_female, data = acs_df)

Residuals: Min 1Q Median 3Q Max -0.4205 -0.4205 -0.3315 0.5795 0.6685

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.331471 0.006944 47.733 <2e-16 i_female 0.089082 0.009661 9.221 <2e-16 — Signif. codes:
0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4828 on 9998 degrees of freedom Multiple R-squared: 0.008433, Adjusted R-squared: 0.008334 F-statistic: 85.03 on 1 and 9998 DF, p-value: < 2.2e-16

#[22] Interpreting both the intercept and the coefficient from our regression the intercept actsas the probability that someone who is not female(male) being that i_female=0 is working zero hours. And for the coefficient of i_female, this just shows the difference in probabilites between males and females who would work 0 hours. #[23] Apologies I answered incorrectly and couldn’t go back and fix it, but with the intercept being 0.3315 this means that the 33.15% of males in the data are predicted to work zero hours. And the 0.0891 coefficient when compared to the intercept results in: 33.15% + 8.91% = 42.06% meaning its predicted from the data that 42.06% of females work zero hours. model2 <- lm(i_zero_hrs ~ i_female + educ + i_female:educ, data = acs_df) summary(model2)

Call: lm(formula = i_zero_hrs ~ i_female + educ + i_female:educ, data = acs_df)

Residuals: Min 1Q Median 3Q Max -0.9316 -0.3764 -0.2548 0.5437 0.7756

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.741055 0.041608 17.810 < 2e-16 i_female 0.401797 0.058557 6.862 7.22e-12 educ -0.030391 0.003034 -10.018 < 2e-16 i_female:educ -0.022420 0.004237 -5.292 1.24e-07

(Intercept) i_female educ i_female:educ — Signif. codes:
0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4722 on 9815 degrees of freedom (181 observations deleted due to missingness) Multiple R-squared: 0.04868, Adjusted R-squared: 0.04838 F-statistic: 167.4 on 3 and 9815 DF, p-value: < 2.2e-16

#[25] Taking the data from the regression it tells us a few things that differ from the past regression now factoring in education. With the intercept being 0.7411 this acts as the baseline for the model, the coefficient for females being 0.4018 this entails that for females with no education the percent that they have 0 working hours is roughly 40.18%. Additionally, with non-famle education at -0.0304 this means that for each year of education for males there is a 3.04% reduction to overall probabilty of working zero hours, for men. For female education the number was-0.0224 meaning that for each year of education there is a 2.24% reduction to the probability of working zero hours as a female. However you have to add the two together to get the ful picture so really for females, each year of education reduces 5.28%(3.04% + 2.24%) of the probability of having zero hours of work. Overall entailing that there is a stronger negative effect for males than females for years of education translatin to amount of hours worked. new_data <- data.frame(i_female = 1, educ = 13) predict(model2, newdata = new_data) 1 0.4563125 summary(model2)$r.squared [1] 0.04867559 #[28] Yes as we did the regression for [22] we restricted the ages that were used to give us a better idea of our data but in doing so this can cause ommited variable bias, this is because by taking out things that have an effect on the end reuslt in this example the working hours, the age is important and by removing it you could cause bias to your estimates. This can cause the model to show an incorrect relationship between working hours and things like gender which is considered bias. #[29] Going off of what i passed stated, what would need to be true for there to be no bias you would need to know that the model is; linear showing the the relationship between the both the independant and independant variables as linear, there cannot be collinearity present meaning that the independant variable cannot be correlated with one another, the samples must be random in order to not cause bias, and lastly there must be an error term equal to zero meaning its not correlated to the other variables in the regression. All this must be true to maintain a non-bias result. #[30] The main takeaways from the data and viewing it showed that there is a positive correlation between education and working hours, meaning that the more years of education led to more hours worked and vice versa. Additionally there are clear gender differences represented in the data being that famles had a higher likelihood of working 0 hours compared to males. And finally there was a clear pattern when comparing the age of people who were working and how it showed that young people and old people were not prone to working while the middle age of people were at the peak and showed the highest amount of working hours then decreasing once again when getting to old age. All in all what I got from all the data is that going to school is important if you want to work and there are many patterns within the workforce that are important to understand as I get closer to being there myself and understanding how the workforce operates.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Problem Set 0(Complete)

Garrick Lammers

2025-04-18

Scatterplot

R Markdown