In this lab assignment you’ll analyze data called Union Membership, Coverage, and Earnings from the CPS by Barry Hirsch (Georgia State University), David Macpherson (Trinity University), and William Even (Miami University). These years in the dataset are from 1983 to 2022. The dataset’s unit of analysis is companies in a particular year.
The variables in unions.csv are:
Employment total number of employees in the thousands
at the company in all the company’s years in the survey.PercentUnionMembers Percent of employed workers who are
union members.Year Year of the surveyState State nameWage Mean hourly earnings in nominal dollars. As usual,
you should download the .Rmd template for the lab to the
same folder as the dataset, which is called
"unions.csv".Open RStudio and load the dataset using the command:
Unions <- read.csv("unions.csv")
This will load the dataset. After loading the dataset, you should attach it and have R print the names of the variables in the dataset. If you choose to look at the dataset the first part of the dataset contains a lot of NAs, so don’t be concerned if you keep scrolling through the rows you’ll see rows without NAs.
Unions <- read.csv("unions.csv")
attach(Unions)
names(Unions)
## [1] "Employment" "PercentUnionMembers" "Wage"
## [4] "Year" "State"
Now we’ll create a new variable that is Employment divided by a 1000
to make it employment 10,000. Call this new variable
Employment10000 and create it based on the original
variable Employment. For some companies Employment is
extremely large since employment is all employees, who worked at any
point at company in the 30 year time span of the survey.
Hint: this new variable can be created simply by dividing the
original variable by 1000.
Unions$Employment10000 <- Unions$Employment / 1000
head(Unions)
## Employment PercentUnionMembers Wage Year State Employment10000
## 1 NA NA 3.963343 1973 <NA> NA
## 2 NA NA 3.142819 1973 <NA> NA
## 3 NA NA 3.853046 1973 <NA> NA
## 4 NA NA 5.124393 1973 <NA> NA
## 5 NA NA 3.963343 1973 <NA> NA
## 6 NA NA 3.963343 1973 <NA> NA
Make a histogram of the Employment10000 variable and use
the summary() function to calculate some basic descriptive
statistics for this variable, and briefly commenting on what you
learned
# Create a histogram of Employment10000
hist(Unions$Employment10000,
main = "Histogram of Employment10000",
xlab = "Employment (in 10,000s)",
ylab = "Frequency",
col = "lightblue",
border = "black")
# Calculate descriptive statistics for Employment10000
summary(Unions$Employment10000)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.151 121.602 345.032 1025.179 1153.099 16504.474 207
The histogram shows that the data is heavily skewed to the right. This suggests that most companies have relatively low employment numbers and only a few have large employment numbers.
Make a histogram of the PercentUnionMembers variable and
use the summary() function to calculate some basic
descriptive statistics for this variable, and briefly commenting on what
you learned.
# Create a histogram of PercentUnionMembers
hist(Unions$PercentUnionMembers,
main = "Histogram of PercentUnionMembers",
xlab = "Percent of Union Members",
ylab = "Frequency",
col = "lightgreen",
border = "black")
# Calculate descriptive statistics for PercentUnionMembers
summary(Unions$PercentUnionMembers)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00000 0.07183 0.12824 0.16707 0.21380 0.74120 207
This histogram is also pretty heavily skewed to the right. This suggest that most workers are not apart of a union.
Make a scatterplot of the PercentUnionMembers variable
(x-axis) against the (y-axis) Employment and comment on
what you learn from this.
# Create a scatterplot of PercentUnionMembers vs. Employment
plot(Unions$PercentUnionMembers, Unions$Employment,
main = "Scatterplot of PercentUnionMembers vs. Employment",
xlab = "Percent of Union Members",
ylab = "Employment",
col = "blue",
pch = 19)
# Optionally, add a regression line to observe the trend
abline(lm(Unions$Employment ~ Unions$PercentUnionMembers), col = "red")
This scatter plot is also right skewed. This suggests that companies with lower percentages of union membership make up the majority of the dataset. Also the companies with the highest employment tend to have the lowest percent of union members.
Estimate a linear regression predicting a companies employment with
PercentUnionMembers. Print a summary of the results and
comment on what you learn, including from the estimates, including
commenting on any relevant statistical significance.
Hint: Keep in mind what PercentUnionMembers is – for
example, what does an increase of 1 unit in the
PercentUnionMembers variable mean?
# Fit the linear regression model
model <- lm(Employment ~ PercentUnionMembers, data = Unions)
# Print the summary of the model
summary(model)
##
## Call:
## lm(formula = Employment ~ PercentUnionMembers, data = Unions)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1322060 -907684 -597370 163828 15451311
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1324292 5450 242.99 <2e-16 ***
## PercentUnionMembers -1790312 25332 -70.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1768000 on 265198 degrees of freedom
## (207 observations deleted due to missingness)
## Multiple R-squared: 0.01849, Adjusted R-squared: 0.01848
## F-statistic: 4995 on 1 and 265198 DF, p-value: < 2.2e-16
The regression analysis reveals a strong negative relationship between PercentUnionMembers and employment, indicating that for each percentage point increase in union membership, employment decreases by approximately 1,790,312 employees. Both the intercept and the PercentUnionMembers coefficient are highly statistically significant (p < 0.001), suggesting a robust relationship. However, the model only explains about 1.85% of the variability in employment, indicating that other factors likely play a significant role in determining employment levels. Thus, while union membership appears to negatively affect employment, the low R-squared value suggests further investigation is needed to identify additional influencing variables.
Make a scatterplot of the PercentUnionMembers variable
(x-axis) against the (y-axis) Wage and comment on what you
learn from this.
# Create a scatterplot of PercentUnionMembers vs. Wage
plot(Unions$PercentUnionMembers, Unions$Wage,
main = "Scatterplot of PercentUnionMembers vs. Wage",
xlab = "Percent of Union Members",
ylab = "Wage",
col = "green",
pch = 19)
# Optionally, add a regression line to observe the trend
abline(lm(Unions$Wage ~ Unions$PercentUnionMembers), col = "red")
This scatter plot suggest a slight right skew. The line of regression shows that people that aren’t union members tend to make slightly higher wages than people that are in unions.
Estimate a linear regression predicting companies wage with
PercentUnionMembers. Print a summary of the results and
comment on what you learn, including from the estimates, including
commenting on any relevant statistical significance.
Hint: Keep in mind what PercentUnionMembers is – for
example, what does an increase of 1 unit in the
PercentUnionMembers variable mean?
# Fit the linear regression model
wage_model <- lm(Wage ~ PercentUnionMembers, data = Unions)
# Print the summary of the model
summary(wage_model)
##
## Call:
## lm(formula = Wage ~ PercentUnionMembers, data = Unions)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.520 -5.931 -1.298 4.516 34.910
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.40201 0.02318 836.98 <2e-16 ***
## PercentUnionMembers -9.32268 0.10775 -86.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.521 on 265198 degrees of freedom
## (207 observations deleted due to missingness)
## Multiple R-squared: 0.02745, Adjusted R-squared: 0.02745
## F-statistic: 7486 on 1 and 265198 DF, p-value: < 2.2e-16
The regression analysis indicates a strong negative relationship between PercentUnionMembers and wage, with the coefficient for PercentUnionMembers estimated at -9.32. This means that for each percentage point increase in union membership, the average wage decreases by approximately $9.32. Both the intercept and the PercentUnionMembers coefficient are highly statistically significant (p < 0.001), providing strong evidence that union membership has a notable impact on wage levels. However, the model explains only about 2.75% of the variability in wages, suggesting that other factors likely influence wages and that the relationship observed may not fully capture the complexity of wage determination in these companies.
Print the row of the dataset for the one particular company which is
on row 256401: Unions[256401,] This will show you the wage
and percent union workers for this particular company in row 256401.
(Hint: you will simply need to plug in the values they get from the row
into y= mx+b)
Based on the regression results above, what would you predict the
value of wage would be for a a company that has the same
percent unions workers as the company on row 25601?
Briefly comment on whether the company seems to be typical of company with its percent union worker in terms of wage based on these calculations.
Unions[256401,]
## Employment PercentUnionMembers Wage Year State Employment10000
## 256401 47878.67 0.1302603 45.0488 2021 Maine 47.87867
For the company in row 256401, which has 13.03% union members and an actual wage of 45.0488 dollars, the predicted wage based on the regression model is approximately -101.65 dollars, indicating a significant discrepancy. This suggests that the company is atypical for its percentage of union workers, likely due to other influencing factors not accounted for in the model, such as industry standards or company performance.