project3

Panel Data Model

Briefly discuss your data and the question you are trying to answer with your model.

rm(list=ls())
# essential libraries
library(AER)

## Warning: package 'AER' was built under R version 4.2.3

## Loading required package: car

## Warning: package 'car' was built under R version 4.2.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.2.3

## Loading required package: lmtest

## Warning: package 'lmtest' was built under R version 4.2.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.2.3

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## Warning: package 'sandwich' was built under R version 4.2.3

## Loading required package: survival

library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

library(plm)

## Warning: package 'plm' was built under R version 4.2.3

library(coefplot)

## Warning: package 'coefplot' was built under R version 4.2.3

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.2.3

library(gplots) # for plot means

## Warning: package 'gplots' was built under R version 4.2.3

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

library(readxl)

## Warning: package 'readxl' was built under R version 4.2.3

library(ggplot2)
library(heatmaply)

## Warning: package 'heatmaply' was built under R version 4.2.3

## Loading required package: plotly

## Warning: package 'plotly' was built under R version 4.2.3

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

## Loading required package: viridis

## Warning: package 'viridis' was built under R version 4.2.3

## Loading required package: viridisLite

## Warning: package 'viridisLite' was built under R version 4.2.3

## 
## ======================
## Welcome to heatmaply version 1.4.2
## 
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
## 
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## You may ask questions at stackoverflow, use the r and heatmaply tags: 
##   https://stackoverflow.com/questions/tagged/heatmaply
## ======================

library(knitr)

## Warning: package 'knitr' was built under R version 4.2.3

library(broom)

## Warning: package 'broom' was built under R version 4.2.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.2.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plm':
## 
##     between, lag, lead

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

## Warning: package 'tidyr' was built under R version 4.2.3

# retrieve dataset "PSID7682" from library(AER)
data("PSID7682")

# create a subset with the first 15 individuals for our model
earnings0 <- subset(PSID7682, id == "1"| id == "2" | id == "3" | id == "4" | id == "5" | id == "6"| id == "7" | id == "8" | id == "9" | id == "10" | id == "11" | id == "12" | id == "13" | id == "14" | id == "15")

# transform to panel dataset
earnings_pd <- pdata.frame(earnings0, index=c("id", "year"))
# check class
class(earnings_pd)

## [1] "pdata.frame" "data.frame"

# beginning of dataset
head(earnings_pd)

# dimensions: how many units are you doing with and how many time periods are you looking at?
dim(earnings_pd)

## [1] 105  14

# is the dataset balanced? i.e. does each indiviudal have the same time periods?
is.pbalanced(earnings_pd)

## [1] TRUE

The data set we selected is called “PSID7682,” from the library AER. This data set comprises the earnings of 595 individuals for the years 1976–1982, originating from the Panel Study of Income Dynamics. Using this data set, we will create the best-fit model that demonstrates the relationship between the wage of individuals across the periods 1976-1982 and the following variables: experience (years of work), weeks (weeks worked), occupation (white- or blue- collar), industry (individual does/does not work in a manufacturing industry), south (individual does/does not reside in the South), smsa (individual does/does not reside in a standard metropolitan statistical area), married (individual is/is not married), gender, union (individual’s wage is/is not set by a union contract), education (years), and ethnicity (individual is African-American or another ethnicity).

We will model the earnings of the first 15 individuals recorded in this data set. The dimensions of the panel data frame are 105 rows by 14 columns. We used the is.pbalanced function to determine that the dataset is balanced.

Provide a descriptive analysis of your variables. This should include histograms and fitted distributions, correlation plot, boxplots, scatterplots, and statistical summaries (e.g., the five- number summary). All figures must include comments.

names(earnings_pd)

##  [1] "experience" "weeks"      "occupation" "industry"   "south"     
##  [6] "smsa"       "married"    "gender"     "union"      "education" 
## [11] "ethnicity"  "wage"       "year"       "id"

# abbreviate variables
exper <- earnings_pd$experience
wks <- earnings_pd$weeks
occ <- earnings_pd$occupation
ind <- earnings_pd$industry
educ <- earnings_pd$education
ethn <- earnings_pd$ethnicity

# pooled model with first 15 individuals
pooledreg <- plm(wage~exper + weeks + occ + ind + south + smsa + married + gender + union + educ + ethn, model="pooling", data=earnings_pd, index=c("id", "year"))
summary(pooledreg)

## Pooling Model
## 
## Call:
## plm(formula = wage ~ exper + weeks + occ + ind + south + smsa + 
##     married + gender + union + educ + ethn, data = earnings_pd, 
##     model = "pooling", index = c("id", "year"))
## 
## Balanced Panel: n = 15, T = 7, N = 105
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -385.134 -132.619  -39.281  108.591  585.587 
## 
## Coefficients:
##               Estimate Std. Error t-value Pr(>|t|)    
## (Intercept)  -558.7887   259.3884 -2.1543 0.033805 *  
## exper          22.8804     3.7166  6.1563 1.87e-08 ***
## weeks          13.7461     4.0492  3.3948 0.001011 ** 
## occblue      -149.0303   101.7773 -1.4643 0.146490    
## indyes         25.5615    58.4400  0.4374 0.662837    
## southyes      -93.7503    64.8207 -1.4463 0.151455    
## smsayes       157.3728    61.5495  2.5568 0.012182 *  
## marriedyes    -50.6621   102.1541 -0.4959 0.621108    
## genderfemale -453.5244   144.3212 -3.1425 0.002248 ** 
## unionyes     -137.4925    67.6039 -2.0338 0.044821 *  
## educ           42.2579    13.6334  3.0996 0.002564 ** 
## ethnafam     -252.8175   138.5603 -1.8246 0.071272 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    11378000
## Residual Sum of Squares: 3910700
## R-Squared:      0.6563
## Adj. R-Squared: 0.61565
## F-statistic: 16.144 on 11 and 93 DF, p-value: < 2.22e-16

In this step, we simplified our variables and constructed a pooled model with just the first 15 individuals of our dataset.

Histogram and Fitted Distribution of Variables

variables <- all.vars(formula(pooledreg))
for (variable in variables) {
  # Get the values of the variable from the data and convert to numeric
  values <- as.numeric(earnings_pd[[variable]])
  
  # Check if there are at least 2 data points
  if (length(values) >= 2) {
    # Estimate the density of the values
    density <- density(values)
    
    # Create a new plot for each variable
    hist(values, freq = FALSE, col = "lightblue", main = paste("Fitted Histogram -", variable))
    lines(density, col = "red", lwd = 2)
  } else {
    # Skip this variable if there are fewer than 2 data points
    cat("Skipping variable:", variable, "- not enough data points for density estimation\n")
  }
}

## Skipping variable: exper - not enough data points for density estimation

## Skipping variable: occ - not enough data points for density estimation
## Skipping variable: ind - not enough data points for density estimation

## Skipping variable: educ - not enough data points for density estimation
## Skipping variable: ethn - not enough data points for density estimation

Wage: The distribution is slightly right skewed with a high density of the data points being concentrated between wages of 250 to 1000. The extremeties of the histogram (very low wage or very high wage) have very low densities.

Weeks: The data is very clearly left-skewed with some clear outliers in between 0 to 20 weeks worked. It is evident that most individuals in our dataset worked a majority of the year.

South: Looking at the histogram, it seems that the distribution between South and non-South residents is rather equal, with a slight majority belonging to people not residing in the South.

SMSA: Looking at the histogram, there seems to be a slight majority of the data belonging to the group of people that do reside in a SMSA.

Married: Looking at the histogram, it is clear that a majority of the people in this dataset are married and that single people make up the minority of the data points.

Gender: Looking at the histogram, the individuals of this data set are mostly male with only a few female individuals.

Union: Looking at the histogram, about a third of the data points are contracted with a union while the other two-thirds are not.

Other variables: Due to a lack of data points for density estimation, other variables can not have a histogram generated for them. We will however conduct further analysis using other metrics as seen below.

Correlation Matrix of Non-Binary Variables

# Extract the variables used in pooledreg1
variables <- c("experience", "weeks", "education")

# Subset the data frame 'earnings_pd' to include only the selected variables
data_subset <- earnings_pd[, variables]

# Compute the correlation matrix
cor_matrix <- cor(data_subset)

# Print the correlation matrix
print(cor_matrix)

##            experience      weeks  education
## experience  1.0000000 -0.1306444 -0.4680834
## weeks      -0.1306444  1.0000000  0.1556170
## education  -0.4680834  0.1556170  1.0000000

Based on the above correlation matrix, we observe that experience and education seem to have a weak negative correlation with each other (-0.47). Aside from that, there seems to be no correlation in regards to the pairs of experience/weeks and education/weeks. The correlation matrix only include the non-binary independent variables as it is not reasonable to do so with binary variables such as occupation.

Scatterplot of Variables

library(ggplot2)

# pooled model with first 15 individuals
pooledreg <- plm(wage ~ exper + weeks + occ + ind + south + smsa + married + gender + union + educ + ethn, model = "pooling", data = earnings_pd, index = c("id", "year"))

# Get the independent variables used in the model
independent_vars <- names(pooledreg$model)[-1]

# Loop over the variables and create scatterplots
for (var in independent_vars) {
  plot_data <- data.frame(x = pooledreg$model[[var]], y = pooledreg$model$wage)
  plot_title <- paste("Scatterplot of", var, "versus wage")
  
  plot <- ggplot(data = plot_data, aes(x = x, y = y)) +
    geom_point() +
    labs(title = plot_title, x = var, y = "Wage") +
    theme_minimal()
  
  print(plot)
}

Experience vs Wage: The scatterplot seems to follow an upwards sloping trend that resets back to a lower value and repeats the positive slope every 15 years. The data points are distributed evenly across the graph. This graph defies the expected economic intuition that wage steadily increases as years of experience increases, instead suggesting that there are certain breakpoints in which having an additional year of work experience could potentially be associated with a drop in wages.

Weeks vs Wage: The data points in this scatterplot are left-skewed, meaning that most of them are concentrated on the right side of the plot. A majority of the individuals in the data set have worked between 45 to 52 weeks out of the year, with the highest wage recorded being located at 45 weeks.

Occupation vs Wage: This binary scatterplot is interesting in that despite blue-collared jobs being mostly concentrated between a wage of 500 and 1000, the variance of blue-collared jobs extends to a higher peak than white-collared jobs (around 1750). Additionally, contrary to expectations, white collared jobs are relatively evenly distributed among their variance and do not seem to be particularly better than blue-collared jobs on average.

Industry vs Wage: In this binary scatterplot, we observe that the variance for workers that do not work in a manufacturing industry is much wider than the variance for workers that do. For non-industry workers, roughly half of them are distributed above the 1000 wage level and the other half under the 1000 wage level. In contrast, most of the industry workers have a wage level under 1000. There are also significantly more non-industry workers than industry workers. Once again, there is interestingly one outlier in the industry workers that has a wage level higher than anything in the non-industry workers’ wage.

South vs Wage: This scatterplot seems to have relatively equal values based on simply looking at the two sides alone. However, there does seem to be a slightly higher wage on average for people not residing in the South when compared to people that do reside in the South.

SMSA (Standard Metropolitan Statistical Area) vs Wage: This scatterplot clearly shows a higher wage on average for individuals that reside in an SMSA as opposed to not living in an SMSA.

Married vs Wage: This scatterplot demonstrates a much wider variance for individuals that are married, whereas all individuals in the non-married category have a wage of 1000 or less. It also seems that the majority of individuals in this data set are married, making single individuals the minority.

Gender vs Wage: This scatterplot shows a significant disparity between male and female workers, with males having a much wider variance in wages and reaching peaks of around 1500, while all females in this dataset have a wage level of roughly 750 or less. However, it is important to consider that there are a large number of males in this dataset compared to females and it could be that this data was not able to collect enough data on females.

Union vs Wage: This scatterplot shows that non-union-contracted workers have a wider variance of wages, while union workers have a more concentrated area of wages. Also, union workers have a higher wage floor than non-union workers.

Education vs Wage: This scatterplot demonstrates a relatively normal distribution of data, with a small majority of data being present around the 12 year mark and a decreasing amount of data points as we radiate outward from that point. Interestingly, a higher average wage does not seem to manifest itself on this data until after 15 years of education (the exception being the high variance around the 12 year mark, in which there are some high outliers).

Ethnicity vs Wage: This scatterplot highlights how African Americans in this data set have a wage limit of 750, while the combined other category has a massive variance with an even distribution from roughly 0 to 1500. It is important to consider that there are only seven observations of African Americans in this data set.

Boxplot of Variables

# Extract the variables used in pooledreg1
variables <- c("wage", "experience", "weeks", "occupation", "industry", "south", "smsa", "married", "gender", "union", "education", "ethnicity")

# Create a boxplot for each variable
boxplot(earnings_pd[, "wage"], main = "Boxplot of Wage", ylab = "Value")

boxplot(earnings_pd[, "experience"], main = "Boxplot of Experience", ylab = "Value")

boxplot(earnings_pd[, "weeks"], main = "Boxplot of Weeks", ylab = "Value")

boxplot(earnings_pd[, "occupation"], main = "Boxplot of Occupation", ylab = "Value")

boxplot(earnings_pd[, "industry"], main = "Boxplot of Industry", ylab = "Value")

boxplot(earnings_pd[, "south"], main = "Boxplot of South", ylab = "Value")

boxplot(earnings_pd[, "smsa"], main = "Boxplot of SMSA", ylab = "Value")

boxplot(earnings_pd[, "married"], main = "Boxplot of Married", ylab = "Value")

boxplot(earnings_pd[, "gender"], main = "Boxplot of Gender", ylab = "Value")

boxplot(earnings_pd[, "union"], main = "Boxplot of Union", ylab = "Value")

boxplot(earnings_pd[, "education"], main = "Boxplot of Education", ylab = "Value")

boxplot(earnings_pd[, "ethnicity"], main = "Boxplot of Ethnicity", ylab = "Value")

Wage: As shown in the boxplot, the median of the data seems to be around a wage of 750. The majority of the data seems to be located between roughly 600 to 1000, with some outliers being present above 1500 wages.

Experience: The median of the data seems to be around 20 years of experience, with the majority of the data being located between 10 to 27 years of experience. Interestingly, there is an upper bound of around 37 years of experience, which seems like an unusually high amount of schooling.

Weeks: The median of the data seems to be around 48 weeks worked, with the majority of the data being located between around 43 years to 47 weeks worked. There are a notable amount of outliers being present below 40 weeks worked during the year.

Occupation: The black line indicates that more individuals in this data set have blue-collared jobs as opposed to white-collared jobs.

Industry: The black line indicates that more individuals in this data set are not part of the manufacturing industry as opposed to being in the manufacturing industry.

South: The black line indicates that more individuals in this data set do not reside in the South as opposed to residing in the South.

SMSA: The black line indicates that more individuals reside in an SMSA as opposed to not residing in an SMSA.

Married: Interestingly, this boxplot shows that an overwhelming amount of people are married such that the few people who are not married are considered outliers.

Gender: Similar to the married variable, this boxplot shows that an overwhelming amount of people are male such that the few females in the data set are considered outliers.

Union: The black line indicates that more individuals are not contracted with a union as opposed to being contracted with a union.

Education: The median of the data seems to be around 12 years of education, with the majority of the data being located between roughly 11 years to 16 years of education. This indicates that the majority of the data is in between a high-school to undergraduate college level of education.

Ethnicity: Once again, this boxplot shows that an overwhelming amount of people identify as the “other” ethnic category such that the few people who identify as African American are considered outliers.

Statistical Summary / Five Number Summary of Variables

variables <- c("wage", "experience", "weeks", "occupation", "industry", "south", "smsa", "married", "gender", "union", "education", "ethnicity")
summary_data <- sapply(earnings_pd[, variables], summary)
print(summary_data)

## $wage
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   260.0   600.0   757.0   833.8  1000.0  1845.0 
## 
## $experience
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   10.00   20.00   19.33   28.00   37.00 
## 
## $weeks
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   44.00   47.00   45.09   48.00   52.00 
## 
## $occupation
## white  blue 
##    39    66 
## 
## $industry
##  no yes 
##  76  29 
## 
## $south
##  no yes 
##  59  46 
## 
## $smsa
##  no yes 
##  45  60 
## 
## $married
##  no yes 
##  19  86 
## 
## $gender
##   male female 
##     91     14 
## 
## $union
##  no yes 
##  66  39 
## 
## $education
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   11.00   12.00   12.73   16.00   17.00 
## 
## $ethnicity
## other  afam 
##    98     7

Wage: The median or center of the data is at around 757 dollars for an individual’s wage, and the mean is slightly higher at 833.8 dollars. This data ranges from around 260 to 1845 dollars and most of the data is located between 600 and 1000 dollars.

Experience: The median or center of the data is at around 20 years, and the mean is slightly lower at 19.33 years. This data ranges from around 3 years of experience to 37 years of experience and most of the data is located between 10 and 28 years.

Weeks: The median or center of the data is at around 47 weeks worked, and the mean is slightly lower at 45 weeks worked. This data ranged from around 17 weeks to 52 weeks worked and most of the data is located between 44 and 48 weeks worked. The maximum value for this variable of 52 makes sense because this data is looking at one year which is 52 weeks. The data is also clearly left-skewed because there are outliers for individuals who work significantly less than 44 weeks.

Occupation: Roughly 37% of individuals have white collared occupations while the other 63% have blue collared occupations. This is in line with expectations as one would expect that white collared opportunities are less available due to requiring more prerequisites such as education or experience.

Industry: Roughly 72% of individuals do not work in the manufacturing industry while the other 28% do. This is significant as more than a fourth of the data is comprised of manufacturing industry workers.

South: Roughly 56% of individuals do not reside in the south while the other 44% do. This is a relatively even split and indicates that the data is not overwhelmingly biased towards one geographical region.

SMSA: Roughly 43% of individuals do not reside in a SMSA while the other 57% do. This is a relatively even distribution, albeit there are slightly more individuals that reside in a SMSA.

Married: Roughly 18% of individuals are not married, while a whopping 82% are.

Gender: Roughly 87% of individuals are male while only 13% of the data are female. Tying back to the previous variable married, this means that a majority of the data are likely husbands.

Union: Roughly 63% of individuals are not part of a union while 37% are. This is a surprising statistic for this data as typically the percentage of workers part of a union seems to be quite low (around 10%).

Education: The median or center of the data is at around 12 years, and the mean is slightly higher at 12.73 years. This data ranges from around 9 years of education to 17 years of education and most of the data is located between 11 and 16 years. This means most of the individuals in this data set are in between high school and undergraduate levels of education.

Ethnicity: Roughly 93% of individuals identify as “other” for ethnicity while only 7% identify as African American. Keep in mind that the “other” category contains all other ethnic groups; therefore this indicator variable is mainly focused on the distinction between African Americans and any other race.

Percentages of Binary Variables

white_occupation <- 39 / (39+66)
black_occupation <- 66 / (39+66)
no_industry <- 76 / (76+29)
yes_industry <- 29 / (76+29)
no_south <- 59 / (59+46)
yes_south <- 46 / (59+46)
no_smsa <- 45 / (45+60)
yes_smsa <- 60 / (45+60)
no_married <- 19 / (19+86)
yes_married <- 86 / (19+86)
male_gender <- 91 / (91+14)
female_gender <- 14 / (91+14)
no_union <- 66 / (66+39)
yes_union <- 39 / (66+39)
other_ethnicity <- 98 / (98+7)
afam_ethnicity <- 7 / (98+7)

white_occupation

## [1] 0.3714286

black_occupation

## [1] 0.6285714

no_industry

## [1] 0.7238095

yes_industry

## [1] 0.2761905

no_south

## [1] 0.5619048

yes_south

## [1] 0.4380952

no_smsa

## [1] 0.4285714

yes_smsa

## [1] 0.5714286

no_married

## [1] 0.1809524

yes_married

## [1] 0.8190476

male_gender

## [1] 0.8666667

female_gender

## [1] 0.1333333

no_union

## [1] 0.6285714

yes_union

## [1] 0.3714286

other_ethnicity

## [1] 0.9333333

afam_ethnicity

## [1] 0.06666667

[1] 0.3714286 [1] 0.6285714 [1] 0.7238095 [1] 0.2761905 [1] 0.5619048 [1] 0.4380952 [1] 0.4285714 [1] 0.5714286 [1] 0.1809524 [1] 0.8190476 [1] 0.8666667 [1] 0.1333333 [1] 0.6285714 [1] 0.3714286 [1] 0.9333333 [1] 0.06666667 These percentages were involved in the five-number summary analysis above

Fit the three models below, and identify which model is your preferred one and why. Make sure to include effects plots, statistical diagnostics, etc., to support your conclusion, and to comment on your findings.

Pooled Model
Fixed Effects
Random Effects

** REFERENCE: https://bruinlearn.ucla.edu/courses/166841/files/13129093/download?download_frd=1

Relevant Plots

# observing hetersokedasticity
# differences in wage across individuals
plotmeans(wage ~ id, data = earnings_pd)

# differences in wage across years
plotmeans(wage ~ year, data = earnings_pd)

Using the plotmeans function, we first visualize the means for wages across individuals and see that there is a lot of variation. This is expected, as individuals were selected randomly for this data set. In the second plot, we visualize the means across time and see there is an increasing trend in the means for wages. This is also expected, as wage typically increases as an individual remains in his/her occupation.

Pooled Model

summary(pooledreg)

## Pooling Model
## 
## Call:
## plm(formula = wage ~ exper + weeks + occ + ind + south + smsa + 
##     married + gender + union + educ + ethn, data = earnings_pd, 
##     model = "pooling", index = c("id", "year"))
## 
## Balanced Panel: n = 15, T = 7, N = 105
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -385.134 -132.619  -39.281  108.591  585.587 
## 
## Coefficients:
##               Estimate Std. Error t-value Pr(>|t|)    
## (Intercept)  -558.7887   259.3884 -2.1543 0.033805 *  
## exper          22.8804     3.7166  6.1563 1.87e-08 ***
## weeks          13.7461     4.0492  3.3948 0.001011 ** 
## occblue      -149.0303   101.7773 -1.4643 0.146490    
## indyes         25.5615    58.4400  0.4374 0.662837    
## southyes      -93.7503    64.8207 -1.4463 0.151455    
## smsayes       157.3728    61.5495  2.5568 0.012182 *  
## marriedyes    -50.6621   102.1541 -0.4959 0.621108    
## genderfemale -453.5244   144.3212 -3.1425 0.002248 ** 
## unionyes     -137.4925    67.6039 -2.0338 0.044821 *  
## educ           42.2579    13.6334  3.0996 0.002564 ** 
## ethnafam     -252.8175   138.5603 -1.8246 0.071272 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    11378000
## Residual Sum of Squares: 3910700
## R-Squared:      0.6563
## Adj. R-Squared: 0.61565
## F-statistic: 16.144 on 11 and 93 DF, p-value: < 2.22e-16

Based on the above summary, the variables occupation, industry, south, smsa, union, married, and ethnicity are insignificant at 1%. We hypothesize these variables are correlated with one another and thus, we will remove these variables to simplify our model and remove such repeated effects. After running a few regressions including and disincluding some variables, we learned that union might be a significant variable, as shown in the regression below, and will use it for our model.

pooledreg <- plm(wage~exper + weeks + gender + union + educ, 
                 model="pooling", data=earnings_pd, index=c("id", "year"))
summary(pooledreg)

## Pooling Model
## 
## Call:
## plm(formula = wage ~ exper + weeks + gender + union + educ, data = earnings_pd, 
##     model = "pooling", index = c("id", "year"))
## 
## Balanced Panel: n = 15, T = 7, N = 105
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -471.520 -145.018  -23.495  116.002  680.168 
## 
## Coefficients:
##               Estimate Std. Error t-value  Pr(>|t|)    
## (Intercept)  -886.4654   211.6112 -4.1891 6.087e-05 ***
## exper          22.4330     2.3439  9.5709 9.446e-16 ***
## weeks          13.7027     3.7864  3.6190 0.0004681 ***
## genderfemale -485.8542    68.3255 -7.1109 1.821e-10 ***
## unionyes     -188.4079    50.2458 -3.7497 0.0002981 ***
## educ           63.1002    10.3981  6.0684 2.382e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    11378000
## Residual Sum of Squares: 4416500
## R-Squared:      0.61184
## Adj. R-Squared: 0.59224
## F-statistic: 31.2105 on 5 and 99 DF, p-value: < 2.22e-16

Tests

# BP test for heteroskedasticity
kable(tidy(bptest(pooledreg)), caption="BP Test for Pooled Model")

BP Test for Pooled Model
statistic	p.value	parameter	method
17.98901	0.0029602	5	studentized Breusch-Pagan test

# observing serial correlation
pbgtest(pooledreg)

## 
##  Breusch-Godfrey/Wooldridge test for serial correlation in panel models
## 
## data:  wage ~ exper + weeks + gender + union + educ
## chisq = 37.841, df = 7, p-value = 3.249e-06
## alternative hypothesis: serial correlation in idiosyncratic errors

Based on the BP Test, we observe that the pooled model has a p-value of about 0.003, which means we reject the null hypothesis of homoskedasticity. Thus, there might still be heteroskedasticity in the model, which we will solve using cluster robust standard errors. Based on the BG test, we observe the p-value is 3e-06 and we reject the null hypothesis that there is no serial correlation. Thus, there is serial correlation in this model, likely due to unobserved heterogeneity in the combined error each year.

# adjust pooled model with cluster robust standard errors
# aggregate variance by group
pooled.cluster <- coeftest(pooledreg, vcov=vcovHC(pooledreg,
type="HC0", cluster="group"))

stargazer(pooledreg, pooled.cluster, dep.var.labels = c("wage", "wage"), column.labels = c("\\textit{OLS SE}", "\\textit{Clustered SE for Pooled Model}"),
        model.names = FALSE, type = "text")

## 
## =================================================================
##                              Dependent variable:                 
##              ----------------------------------------------------
##                       wage                      wage             
##                      OLS SE         Clustered SE for Pooled Model
##                       (1)                        (2)             
## -----------------------------------------------------------------
## exper              22.433***                  22.433***          
##                     (2.344)                    (3.515)           
##                                                                  
## weeks              13.703***                   13.703            
##                     (3.786)                    (9.159)           
##                                                                  
## genderfemale      -485.854***                -485.854***         
##                     (68.325)                  (80.659)           
##                                                                  
## unionyes          -188.408***                -188.408**          
##                     (50.246)                  (87.491)           
##                                                                  
## educ               63.100***                  63.100***          
##                     (10.398)                  (12.245)           
##                                                                  
## Constant          -886.465***                -886.465**          
##                    (211.611)                  (357.485)          
##                                                                  
## -----------------------------------------------------------------
## Observations          105                                        
## R2                   0.612                                       
## Adjusted R2          0.592                                       
## F Statistic  31.211*** (df = 5; 99)                              
## =================================================================
## Note:                                 *p<0.1; **p<0.05; ***p<0.01

We compare the SE’s using the regular pooled model and the Cluster Robust Standard Errors. The latter are all larger, some more substantially so, than those of the original pooled model.

Fixed Effects Model

# Full Fixed Effect (id and time)
FE.full <- lm(wage~exper + weeks + gender + union + educ + year + id, data=earnings0)

# Only Time
FE.time <- lm(wage~exper + weeks + gender + union + educ + year, data=earnings0)

# Only id
FE.id <- lm(wage~exper + weeks + gender + union + educ + id, data=earnings0)

Create models with different effects and compare

fm_full <- plm(wage~exper + weeks + gender + union + educ, data=earnings_pd, model = "within", effect = "twoways")
fm_time <- plm(wage~exper + weeks + gender + union + educ, data=earnings_pd, model = "within", effect = "time")
fm_id <- plm(wage~exper + weeks + gender + union + educ, data=earnings_pd, model = "within", effect = "individual")

# see relevance of these effects
# Case 1:both time and firm effects jointly 
pFtest(fm_full, pooledreg)

## 
##  F test for twoways effects
## 
## data:  wage ~ exper + weeks + gender + union + educ
## F = 22.712, df1 = 17, df2 = 82, p-value < 2.2e-16
## alternative hypothesis: significant effects

# Case 2: Only time effects
pFtest(fm_time, pooledreg)

## 
##  F test for time effects
## 
## data:  wage ~ exper + weeks + gender + union + educ
## F = 10.858, df1 = 6, df2 = 93, p-value = 3.939e-09
## alternative hypothesis: significant effects

# Case 3: Only firm effects 
pFtest(fm_id, pooledreg)

## 
##  F test for individual effects
## 
## data:  wage ~ exper + weeks + gender + union + educ
## F = 31.696, df1 = 12, df2 = 87, p-value < 2.2e-16
## alternative hypothesis: significant effects

We conducted F-tests to see the significance of time effects, individual effects, and both effects together. We see that both time and individual effects are relevant and thus, we will use both effects for our fixed effects model.

fixedreg <- fm_full

Random Effects Model

randomreg <- plm(wage~exper + weeks + gender + union + educ, model="random", data=earnings_pd, index=c("id", "year"))
summary(randomreg)

## Oneway (individual) effect Random Effect Model 
##    (Swamy-Arora's transformation)
## 
## Call:
## plm(formula = wage ~ exper + weeks + gender + union + educ, data = earnings_pd, 
##     model = "random", index = c("id", "year"))
## 
## Balanced Panel: n = 15, T = 7, N = 105
## 
## Effects:
##                    var  std.dev share
## idiosyncratic  9450.10    97.21 0.314
## individual    20616.25   143.58 0.686
## theta: 0.7521
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -304.009  -70.772  -25.142   87.237  484.436 
## 
## Coefficients:
##                 Estimate  Std. Error z-value  Pr(>|z|)    
## (Intercept)  -1317.50374   402.29424 -3.2750 0.0010567 ** 
## exper           46.09973     4.58227 10.0605 < 2.2e-16 ***
## weeks            0.67835     3.47675  0.1951 0.8453067    
## genderfemale  -543.17866   160.78435 -3.3783 0.0007293 ***
## unionyes      -257.79597    70.78551 -3.6419 0.0002706 ***
## educ           109.75818    24.28583  4.5194 6.201e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    4219000
## Residual Sum of Squares: 1783100
## R-Squared:      0.57737
## Adj. R-Squared: 0.55603
## Chisq: 135.248 on 5 DF, p-value: < 2.22e-16

Compare Models

# compare model coefficients
stargazer(pooled.cluster, fixedreg, randomreg, type = "text", column.labels = c("Pooled w/ Clustered SE", "Fixed Effects", "Random Effects", model.names = FALSE))

## 
## ========================================================================
##                                  Dependent variable:                    
##              -----------------------------------------------------------
##                                                     wage                
##                   coefficient                      panel                
##                       test                         linear               
##              Pooled w/ Clustered SE     Fixed Effects     Random Effects
##                       (1)                    (2)               (3)      
## ------------------------------------------------------------------------
## exper              22.433***                                46.100***   
##                     (3.515)                                  (4.582)    
##                                                                         
## weeks                13.703                 2.982             0.678     
##                     (9.159)                (2.753)           (3.477)    
##                                                                         
## genderfemale      -485.854***                              -543.179***  
##                     (80.659)                                (160.784)   
##                                                                         
## unionyes           -188.408**            -245.847***       -257.796***  
##                     (87.491)              (64.220)           (70.786)   
##                                                                         
## educ               63.100***                                109.758***  
##                     (12.245)                                 (24.286)   
##                                                                         
## Constant           -886.465**                             -1,317.504*** 
##                    (357.485)                                (402.294)   
##                                                                         
## ------------------------------------------------------------------------
## Observations                                 105               105      
## R2                                          0.173             0.577     
## Adjusted R2                                -0.049             0.556     
## F Statistic                         8.565*** (df = 2; 82)   135.248***  
## ========================================================================
## Note:                                        *p<0.1; **p<0.05; ***p<0.01

# F test for twoway effects
pFtest(fixedreg, pooledreg)

## 
##  F test for twoways effects
## 
## data:  wage ~ exper + weeks + gender + union + educ
## F = 22.712, df1 = 17, df2 = 82, p-value < 2.2e-16
## alternative hypothesis: significant effects

# LM test for random effects
plmtest(pooledreg, effect="individual")

## 
##  Lagrange Multiplier Test - (Honda)
## 
## data:  wage ~ exper + weeks + gender + union + educ
## normal = 5.5371, p-value = 1.537e-08
## alternative hypothesis: significant effects

# Hausman test for endogeneity
phtest(randomreg, fixedreg)

## 
##  Hausman Test
## 
## data:  wage ~ exper + weeks + gender + union + educ
## chisq = 1.9843, df = 2, p-value = 0.3708
## alternative hypothesis: one model is inconsistent

We first use the F-test for twoways effects. We reject the null that the coefficients are the same and conclude that fixed effects are preferred over the pooled model.

For the LM test, we are given a p-value of 4e-06. We reject the null, which implies a non-zero variance, i.e. heterogeneity among individuals may be significant. Thus, we conclude random effects are also preferred and we will not use the pooled model.

Finally, using the Hausman Test, we retrieve a p-value of 0.4 and fail to reject the null that random effects are exogeneous. Therefore, we want to use the random effects model.

Qualitative Dependent Variables

Briefly discuss your data and the question you are trying to answer with your model. The dataset we chose is called “Arrests” and is the data on police treatment of individuals arrested in Toronto for simple possession of small quantities of marijuana, including variables such as colour, year, age, sex, employed, citizen, and checks. The variables colour, sex, employed, and citizen are indicator variables, and the variables year, age, and checks are continuous. The dependent variable in this data set is released, which is an indicator variable that tells whether or not an arrestee was released. The question we are attempting to answer with our model is the estimated probability of an individual being released utilizing different models (i.e. Linear Probability Model, Probit Model, and Logit Model).

# alternative dataset
library(carData)
data("Arrests")
names(Arrests)

## [1] "released" "colour"   "year"     "age"      "sex"      "employed" "citizen" 
## [8] "checks"

Provide a descriptive analysis of your variables. This should include histograms and fitted distributions, correlation plot, boxplots, scatterplots, and statistical summaries (e.g., the five- number summary). All figures must include comments.

# 'colour', 'sex', 'employed', 'citizen' are categorical variables

# Convert categorical variables to factors
Arrests$colour <- as.factor(Arrests$colour)
Arrests$sex <- as.factor(Arrests$sex)
Arrests$employed <- as.factor(Arrests$employed)
Arrests$citizen <- as.factor(Arrests$citizen)

# Fit the linear regression model
regression <- lm(released ~ colour + year + age + sex + employed + citizen + checks, data = Arrests)

## Warning in model.response(mf, "numeric"): using type = "numeric" with a factor
## response will be ignored

## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors

Histogram and Fitted Distribution of Variables

variables <- all.vars(formula(regression))
for (variable in variables) {
  # Get the values of the variable from the data and convert to numeric
  values <- as.numeric(Arrests[[variable]])
  
  # Check if there are at least 2 data points
  if (length(values) >= 2) {
    # Estimate the density of the values
    density <- density(values)
    
    # Create a new plot for each variable
    hist(values, freq = FALSE, col = "lightblue", main = paste("Fitted Histogram -", variable))
    lines(density, col = "red", lwd = 2)
  } else {
    # Skip this variable if there are fewer than 2 data points
    cat("Skipping variable:", variable, "- not enough data points for density estimation\n")
  }
}

Released: As seen in the histogram, a surprising ~80% of the individuals who were arrested end up being released. This means that not being released upon arrest is an unlikely event in comparison.

Colour: As seen in the histogram, about three-fourths of the arrested individuals are white, with only about a fourth of them being black.

Year: As seen in the histogram, the distribution seems to be relatively normal distributed, with a clear peak of data being located at the year 2000.

Age: As seen in the histogram, the distribution is clearly right skewed. There is a peak in the distribution between the ages of 15 and 20, with a steep dropoff in data points as the age increases. This indicates that most arrests in this data set are from young adults.

Sex: As seen in the histogram, an overwhelming majority of the arrested individuals in this dataset are males, with females making up only about 8% of the data points.

Employed: As seen in the histogram, a majority of the arrested individuals are employed, with roughly 20% of them being unemployed.

Citizen: As seen in the histogram, a majority of the arrested individuals are citizens, with only roughly 15% of them being non-citizens.

Checks: The distribution is very clearly right-skewed with a evident peak of data being centered around 0 to 1 checks conducted. This means that most arrested individuals have not been checked often.

Correlation Matrix of Non-Binary Variables

# Extract the variables used in pooledreg1
variables <- c("year", "age", "checks")

# Subset the data frame 'earnings_pd' to include only the selected variables
data_subset <- Arrests[, variables]

# Compute the correlation matrix
cor_matrix <- cor(data_subset)

# Print the correlation matrix
print(cor_matrix)

##                year          age      checks
## year    1.000000000 -0.004627123 -0.03686975
## age    -0.004627123  1.000000000  0.13507935
## checks -0.036869749  0.135079353  1.00000000

Based on the above correlation matrix, we observe that there seems to be no correlation at all between any of the continuous variables in this dataset. The only slightly positive correlation present is in between age and checks, but the value of the correlation is only about 0.135; therefore it is likely none of the continuous independent variables are correlated.

Scatterplot of Variables

# Create a list of variable names
variable_names <- names(Arrests)

# Create scatterplots for each variable
par(mfrow = c(2, 2))  # Adjust the layout of the plots

for (var in variable_names) {
  if (is.numeric(Arrests[[var]])) {
    plot(Arrests[[var]], main = var, xlab = var, ylab = "released")
  }
}

Boxplot of Non-Binary Variables

# Create a boxplot for each variable
boxplot(Arrests[, "year"], main = "Boxplot of Year", ylab = "Value")

boxplot(Arrests[, "age"], main = "Boxplot of Age", ylab = "Value")

boxplot(Arrests[, "checks"], main = "Boxplot of Checks", ylab = "Value")

Year: As shown in the boxplot, the center of the distribution seems to be around the year of 2000, with 50% of the data being located between the years 1998 and 2001. Overall, the dataset seems to have a lower and upper bound of the years 1997 to 2002.

Age: As shown in the boxplot, the median age seems to be 20 years old, with 50% of the data being located around the ages 18 to 25. There are however a good amount of outliers above the age of 40.

Checks: As shown in the boxplot, the median amount of checks seems to be 1 check, with 50% of the data being located in between 0 checks and 3 checks. Overall, the data seems to range from 0 checks to 6 checks.

Statistical Summary / Five Number Summary of Variables

variabless <- c("year", "age", "checks", "colour", "sex", "released", "employed", "citizen")
summary_data <- sapply(Arrests[, variabless], summary)
print(summary_data)

## $year
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1997    1998    2000    2000    2001    2002 
## 
## $age
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   18.00   21.00   23.85   27.00   66.00 
## 
## $checks
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.636   3.000   6.000 
## 
## $colour
## Black White 
##  1288  3938 
## 
## $sex
## Female   Male 
##    443   4783 
## 
## $released
##   No  Yes 
##  892 4334 
## 
## $employed
##   No  Yes 
## 1115 4111 
## 
## $citizen
##   No  Yes 
##  771 4455

Year: The median or center of the data is at around the year 2000, with the mean also being 2000. This data ranges from the year 1997 to the year 2002, with a majority of the data being located between the years 1998 and 2001.

Age: The median or center of the data is the age of 21, with the mean age being slightly higher at 23.85 years old. This data ranges from the age of 12 to the age of 66, with a majority of the data being located between the ages of 18 to 27. It is extremely surprising to observe a case of an arrest for an individual who is only 12 years old.

Checks: The median or center of the data is at 1 check, with the mean amount of checks being slightly higher at 1.636 checks. The data ranges from 0 checks to 6 checks, with a majority of the data being located between 0 checks and 3 checks.

Colour: Roughly 25% of the arrested individuals are black while the other 75% are white. This is consistent with the histogram analysis.

Sex: Roughly 8% of the arrested individuals are female while a staggering 92% are male. This is consistent with the histogram analysis.

Released: Roughly 83% of the arrested individuals end up being released, with the other 17% remaining under arrest. This is consistent with the histogram analysis.

Employed: Roughly 79% of the arrested individuals are employed with the other 21% being unemployed. This is consistent with the histogram analysis.

Citizen: Roughly 85% of the arrested individuals are citizens with the other 15% being noncitizens. This is consistent with the histogram analysis.

Percentages of Binary Variables

black_colour <- 1288 / (1288+3938)
white_colour <- 3938 / (1288+3938)
female_sex <- 433 / (433+4783)
male_sex <- 4783 / (433+4783)
no_released <- 892 / (892+4334)
yes_released <- 4334 / (892+4334)
no_employed <- 1115 / (1115+4111)
yes_employed <- 4111 / (1115+4111)
no_citizen <- 771 / (771+4455)
yes_citizen <- 4455 / (771+4455)

black_colour

## [1] 0.24646

white_colour

## [1] 0.75354

female_sex

## [1] 0.0830138

male_sex

## [1] 0.9169862

no_released

## [1] 0.170685

yes_released

## [1] 0.829315

no_employed

## [1] 0.2133563

yes_employed

## [1] 0.7866437

no_citizen

## [1] 0.1475316

yes_citizen

## [1] 0.8524684

[1] 0.24646 [1] 0.75354 [1] 0.0830138 [1] 0.9169862 [1] 0.170685 [1] 0.829315 [1] 0.2133563 [1] 0.7866437 [1] 0.1475316 [1] 0.8524684 These percentages were used in the above analyses.

Fit the three models below, and identify which model is your preferred one and why. Make sure to include effects plots, statistical diagnostics, etc., to support your conclusion, and to comment on your findings.

Linear Probability Model
Probit Model
Logit Model Linear Probability (Regression) Model

# Convert released variable to binary (0 = No and 1 = Yes)
Arrests$released <- ifelse(Arrests$released == "Yes", 1, 0)

# Convert colour variable to binary (0 = Black and 1 = White)
Arrests$colour <- ifelse(Arrests$colour == "White", 1, 0)

# Convert sex variable to binary (0 = Female and 1 = Male)
Arrests$sex <- ifelse(Arrests$sex == "Male", 1, 0)

# Convert employed variable to binary (0 = No and 1 = Yes)
Arrests$employed <- ifelse(Arrests$employed == "Yes", 1, 0)

# Convert citizen variable to binary (0 = No and 1 = Yes)
Arrests$citizen <- ifelse(Arrests$citizen == "Yes", 1, 0)

linear.mod <- lm(released ~ ., data = Arrests)
coeftest(linear.mod, vcov. = vcovHC(linear.mod, type = "HC1"))

## 
## t test of coefficients:
## 
##                Estimate  Std. Error  t value  Pr(>|t|)    
## (Intercept)  1.51436547  7.87288768   0.1924    0.8475    
## colour       0.05725374  0.01334773   4.2894 1.824e-05 ***
## year        -0.00041761  0.00393910  -0.1060    0.9156    
## age          0.00048909  0.00065488   0.7468    0.4552    
## sex          0.00350297  0.01662660   0.2107    0.8331    
## employed     0.12364629  0.01497548   8.2566 < 2.2e-16 ***
## citizen      0.08974749  0.01732626   5.1799 2.304e-07 ***
## checks      -0.04999510  0.00375479 -13.3150 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Logit Model

logit.mod <- glm(released ~ ., family = binomial(link = "logit"), data = Arrests)
summary(logit.mod)

## 
## Call:
## glm(formula = released ~ ., family = binomial(link = "logit"), 
##     data = Arrests)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3909   0.3579   0.4320   0.6047   1.7067  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  9.371821  56.717803   0.165    0.869    
## colour       0.389109   0.085663   4.542 5.56e-06 ***
## year        -0.004218   0.028379  -0.149    0.882    
## age          0.002236   0.004631   0.483    0.629    
## sex          0.007317   0.150189   0.049    0.961    
## employed     0.757302   0.084735   8.937  < 2e-16 ***
## citizen      0.576519   0.104246   5.530 3.20e-08 ***
## checks      -0.364101   0.025984 -14.013  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4776.3  on 5225  degrees of freedom
## Residual deviance: 4299.1  on 5218  degrees of freedom
## AIC: 4315.1
## 
## Number of Fisher Scoring iterations: 5

Probit Model

probit <- glm(released ~ ., family = binomial(link = "probit"), data = Arrests)
summary(probit)

## 
## Call:
## glm(formula = released ~ ., family = binomial(link = "probit"), 
##     data = Arrests)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4165   0.3449   0.4289   0.6120   1.6394  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.6932975 31.8192200   0.116    0.908    
## colour       0.2189986  0.0489607   4.473 7.72e-06 ***
## year        -0.0015688  0.0159208  -0.099    0.922    
## age          0.0008996  0.0025926   0.347    0.729    
## sex          0.0034287  0.0820532   0.042    0.967    
## employed     0.4366428  0.0492247   8.870  < 2e-16 ***
## citizen      0.3388906  0.0599838   5.650 1.61e-08 ***
## checks      -0.2034429  0.0144456 -14.083  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4776.3  on 5225  degrees of freedom
## Residual deviance: 4295.9  on 5218  degrees of freedom
## AIC: 4311.9
## 
## Number of Fisher Scoring iterations: 5

Table for linear model

tab.linear <- table("Linear model" = predict(linear.mod) >= 0.5,
"Actual" = Arrests$released)

Table for logit model

tab.logit <- table("Logit model" = predict(logit.mod, type = "response") >= 0.5,
"Actual" = Arrests$released)

Table for probit model

tab.probit <- table("Probit model" = predict(probit, type = "response") >= 0.5,
"Actual" = Arrests$released)

Display percentages of 0 and 1 observations that are marked as TRUE or FALSE by model

prop.table(tab.linear, margin = 2) * 100

##             Actual
## Linear model          0          1
##        FALSE  2.4663677  0.2768805
##        TRUE  97.5336323 99.7231195

prop.table(tab.logit, margin = 2) * 100

##            Actual
## Logit model         0         1
##       FALSE  6.165919  1.430549
##       TRUE  93.834081 98.569451

prop.table(tab.probit, margin = 2) * 100

##             Actual
## Probit model         0         1
##        FALSE  6.053812  1.407476
##        TRUE  93.946188 98.592524

Upon fitting the three models (i.e. linear probability model, logit model, and probit model), we determined that the preferred model is the linear probability model by comparing the percentage of correct predictions from the linear probability model, logit model, and probit model respectively using a predicted probability of 0.5 as the threshold.

project3

2023-06-08