Lab 5 Template: Linear Regression

In this lab assignment you’ll analyze data called Union Membership, Coverage, and Earnings from the CPS by Barry Hirsch (Georgia State University), David Macpherson (Trinity University), and William Even (Miami University). These years in the dataset are from 1983 to 2022. The dataset’s unit of analysis is companies in a particular year.

The variables in unions.csv are:

Employment total number of employees in the thousands at the company in all the company’s years in the survey.
PercentUnionMembers Percent of employed workers who are union members.
Year Year of the survey
State State name
Wage Mean hourly earnings in nominal dollars. As usual, you should download the .Rmd template for the lab to the same folder as the dataset, which is called "unions.csv".

Question 1: Loading and Exploring the Dataset

Open RStudio and load the dataset using the command: Unions <- read.csv("unions.csv")

This will load the dataset. After loading the dataset, you should attach it and have R print the names of the variables in the dataset. If you choose to look at the dataset the first part of the dataset contains a lot of NAs, so don’t be concerned if you keep scrolling through the rows you’ll see rows without NAs.

Unions <- read.csv("unions.csv")
attach(Unions)
names(Unions)

## [1] "Employment"          "PercentUnionMembers" "Wage"               
## [4] "Year"                "State"

Question 2: Creating a new variable

Now we’ll create a new variable that is Employment divided by a 1000 to make it employment 10,000. Call this new variable Employment10000 and create it based on the original variable Employment. For some companies Employment is extremely large since employment is all employees, who worked at any point at company in the 30 year time span of the survey.

Hint: this new variable can be created simply by dividing the original variable by 1000.

Unions$Employment10000 <- Unions$Employment / 1000

head(Unions)

##   Employment PercentUnionMembers     Wage Year State Employment10000
## 1         NA                  NA 3.963343 1973  <NA>              NA
## 2         NA                  NA 3.142819 1973  <NA>              NA
## 3         NA                  NA 3.853046 1973  <NA>              NA
## 4         NA                  NA 5.124393 1973  <NA>              NA
## 5         NA                  NA 3.963343 1973  <NA>              NA
## 6         NA                  NA 3.963343 1973  <NA>              NA

Question 3: Visualizing employment

Make a histogram of the Employment10000 variable and use the summary() function to calculate some basic descriptive statistics for this variable, and briefly commenting on what you learned

# Create a histogram of Employment10000
hist(Unions$Employment10000, 
     main = "Histogram of Employment10000", 
     xlab = "Employment (in 10,000s)", 
     ylab = "Frequency", 
     col = "lightblue", 
     border = "black")

# Calculate descriptive statistics for Employment10000
summary(Unions$Employment10000)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##     2.151   121.602   345.032  1025.179  1153.099 16504.474       207

The histogram shows that the data is heavily skewed to the right. This suggests that most companies have relatively low employment numbers and only a few have large employment numbers.

Question 4: Visualizing PercentUnionMembers

Make a histogram of the PercentUnionMembers variable and use the summary() function to calculate some basic descriptive statistics for this variable, and briefly commenting on what you learned.

# Create a histogram of PercentUnionMembers
hist(Unions$PercentUnionMembers, 
     main = "Histogram of PercentUnionMembers", 
     xlab = "Percent of Union Members", 
     ylab = "Frequency", 
     col = "lightgreen", 
     border = "black")

# Calculate descriptive statistics for PercentUnionMembers
summary(Unions$PercentUnionMembers)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00000 0.07183 0.12824 0.16707 0.21380 0.74120     207

This histogram is also pretty heavily skewed to the right. This suggest that most workers are not apart of a union.

Question 5: Associations between variables

Make a scatterplot of the PercentUnionMembers variable (x-axis) against the (y-axis) Employment and comment on what you learn from this.

# Create a scatterplot of PercentUnionMembers vs. Employment
plot(Unions$PercentUnionMembers, Unions$Employment, 
     main = "Scatterplot of PercentUnionMembers vs. Employment",
     xlab = "Percent of Union Members", 
     ylab = "Employment", 
     col = "blue", 
     pch = 19)

# Optionally, add a regression line to observe the trend
abline(lm(Unions$Employment ~ Unions$PercentUnionMembers), col = "red")

This scatter plot is also right skewed. This suggests that companies with lower percentages of union membership make up the majority of the dataset. Also the companies with the highest employment tend to have the lowest percent of union members.

Question 6: Estimating a linear regression

Estimate a linear regression predicting a companies employment with PercentUnionMembers. Print a summary of the results and comment on what you learn, including from the estimates, including commenting on any relevant statistical significance.

Hint: Keep in mind what PercentUnionMembers is – for example, what does an increase of 1 unit in the PercentUnionMembers variable mean?

# Fit the linear regression model
model <- lm(Employment ~ PercentUnionMembers, data = Unions)

# Print the summary of the model
summary(model)

## 
## Call:
## lm(formula = Employment ~ PercentUnionMembers, data = Unions)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1322060  -907684  -597370   163828 15451311 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1324292       5450  242.99   <2e-16 ***
## PercentUnionMembers -1790312      25332  -70.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1768000 on 265198 degrees of freedom
##   (207 observations deleted due to missingness)
## Multiple R-squared:  0.01849,    Adjusted R-squared:  0.01848 
## F-statistic:  4995 on 1 and 265198 DF,  p-value: < 2.2e-16

The regression analysis reveals a strong negative relationship between PercentUnionMembers and employment, indicating that for each percentage point increase in union membership, employment decreases by approximately 1,790,312 employees. Both the intercept and the PercentUnionMembers coefficient are highly statistically significant (p < 0.001), suggesting a robust relationship. However, the model only explains about 1.85% of the variability in employment, indicating that other factors likely play a significant role in determining employment levels. Thus, while union membership appears to negatively affect employment, the low R-squared value suggests further investigation is needed to identify additional influencing variables.

Question 7: Associations between variables

Make a scatterplot of the PercentUnionMembers variable (x-axis) against the (y-axis) Wage and comment on what you learn from this.

# Create a scatterplot of PercentUnionMembers vs. Wage
plot(Unions$PercentUnionMembers, Unions$Wage, 
     main = "Scatterplot of PercentUnionMembers vs. Wage",
     xlab = "Percent of Union Members", 
     ylab = "Wage", 
     col = "green", 
     pch = 19)

# Optionally, add a regression line to observe the trend
abline(lm(Unions$Wage ~ Unions$PercentUnionMembers), col = "red")

This scatter plot suggest a slight right skew. The line of regression shows that people that aren’t union members tend to make slightly higher wages than people that are in unions.

Question 8: Estimating a linear regression, Again!

Estimate a linear regression predicting companies wage with PercentUnionMembers. Print a summary of the results and comment on what you learn, including from the estimates, including commenting on any relevant statistical significance.

Hint: Keep in mind what PercentUnionMembers is – for example, what does an increase of 1 unit in the PercentUnionMembers variable mean?

# Fit the linear regression model
wage_model <- lm(Wage ~ PercentUnionMembers, data = Unions)

# Print the summary of the model
summary(wage_model)

## 
## Call:
## lm(formula = Wage ~ PercentUnionMembers, data = Unions)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.520  -5.931  -1.298   4.516  34.910 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         19.40201    0.02318  836.98   <2e-16 ***
## PercentUnionMembers -9.32268    0.10775  -86.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.521 on 265198 degrees of freedom
##   (207 observations deleted due to missingness)
## Multiple R-squared:  0.02745,    Adjusted R-squared:  0.02745 
## F-statistic:  7486 on 1 and 265198 DF,  p-value: < 2.2e-16

The regression analysis indicates a strong negative relationship between PercentUnionMembers and wage, with the coefficient for PercentUnionMembers estimated at -9.32. This means that for each percentage point increase in union membership, the average wage decreases by approximately $9.32. Both the intercept and the PercentUnionMembers coefficient are highly statistically significant (p < 0.001), providing strong evidence that union membership has a notable impact on wage levels. However, the model explains only about 2.75% of the variability in wages, suggesting that other factors likely influence wages and that the relationship observed may not fully capture the complexity of wage determination in these companies.

Question 9: Prediction with linear regression

Print the row of the dataset for the one particular company which is on row 256401: Unions[256401,] This will show you the wage and percent union workers for this particular company in row 256401. (Hint: you will simply need to plug in the values they get from the row into y= mx+b)

Based on the regression results above, what would you predict the value of wage would be for a a company that has the same percent unions workers as the company on row 25601?

Briefly comment on whether the company seems to be typical of company with its percent union worker in terms of wage based on these calculations.

Unions[256401,]

##        Employment PercentUnionMembers    Wage Year State Employment10000
## 256401   47878.67           0.1302603 45.0488 2021 Maine        47.87867

For the company in row 256401, which has 13.03% union members and an actual wage of 45.0488 dollars, the predicted wage based on the regression model is approximately -101.65 dollars, indicating a significant discrepancy. This suggests that the company is atypical for its percentage of union workers, likely due to other influencing factors not accounted for in the model, such as industry standards or company performance.