DATA 110 Project 1

Author

Iris Wu

Introduction to the dataset

The dataset I chose for this project shows exonerations in the US from 1989 to the present. An exoneration occurs when a person is officially cleared of a crime, typically because new evidence of innocence emerges. In the US, thousands of people have been wrongfully convicted for crimes they did not commit. These innocent people lose years of their lives to prison, and the legal process to free and exonerate them can take decades. For this project, I will examine four variables: state, race, age at crime (yrs), and years lost. Age at crime is how old each person was when the crime they were wrongfully accused of was committed. Years lost is the number of years each person wrongfully served in prison. I will first perform a multiple linear regression analysis of age at crime and years lost, with race as a predictor. Then, I will create a data visualization that illustrates the relationship between race and average years lost to prison in the District of Columbia (DC), Maryland, and Virginia. I downloaded this dataset from The National Registry of Exonerations.

Load the library and dataset

library(tidyverse)
library(ggfortify)
#load the dataset
setwd("C:/Users/rsaidi/Downloads")
#setwd("C:/Users/iwu80/OneDrive/Documents/Files/School/DATA 110 R Assignments")
exonerations <- read_csv("nre_export_2025-10-13.csv")

Clean the dataset

#rename the column "Age at Crime Yrs" to "Age at Crime" 
exonerations <- rename(exonerations, "Age at Crime" = "Age at Crime Yrs")
#remove NA values
exonerate_nona <- exonerations |>
  filter(!is.na(exonerations$`Age at Crime`) & !is.na(exonerations$`Years lost`) & !is.na (exonerations$Race))

Linear Regression Analysis

I want to examine the relationship between age at crime and years lost to prison. Can the age at crime be a predictor for years lost to prison? For example, people who were minors at the time of a crime may receive shorter sentences, which in turn may lead to a shorter length of imprisonment.

#find the correlation coefficient 
cor(exonerate_nona$`Age at Crime`, exonerate_nona$`Years lost`)
[1] -0.2652258
#create the linear regression model
fit1 <- lm(exonerate_nona$`Years lost`~ exonerate_nona$`Age at Crime`, data = exonerate_nona)
summary(fit1)

Call:
lm(formula = exonerate_nona$`Years lost` ~ exonerate_nona$`Age at Crime`, 
    data = exonerate_nona)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.980  -7.072  -2.653   5.450  38.051 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   15.94473    0.42327   37.67   <2e-16 ***
exonerate_nona$`Age at Crime` -0.24710    0.01405  -17.59   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.154 on 4090 degrees of freedom
Multiple R-squared:  0.07034,   Adjusted R-squared:  0.07012 
F-statistic: 309.5 on 1 and 4090 DF,  p-value: < 2.2e-16
#the diagnostic plots get cut off when plotted with the summary(fit1), so I'm putting them on a new line of code 
autoplot(fit1, 1:4, nrow = 2, ncol = 2)
Warning: `fortify(<lm>)` was deprecated in ggplot2 3.6.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.

The equation for the model is (Predicted)Years Lost = -0.265(Age at Crime) + 15.945. The negative correlation coefficient indicates that the two variables have an inverse relationship. As the age at crime increases, the years lost to prison is predicted to decrease by 0.265 years, which is the opposite of what I expected. However, the correlation between the two variables is weak because -0.265 is closer to +/-0.5 than to +/-1. The adjusted R-squared confirms the weakness of this model: only around 7% of the variation in the data can be attributed to the model. Nonetheless, the small p-value indicates that age at crime is statistically significant to the model, so there are likely stronger predictors of years lost to prison. Race may be one of those predictors.

Adding race as a predictor

fit2 <- lm(exonerate_nona$`Years lost`~ exonerate_nona$`Age at Crime`+ exonerate_nona$Race, data = exonerate_nona)
summary(fit2)

Call:
lm(formula = exonerate_nona$`Years lost` ~ exonerate_nona$`Age at Crime` + 
    exonerate_nona$Race, data = exonerate_nona)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.089  -6.982  -2.489   5.345  38.259 

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)                        12.86236    1.62111   7.934 2.71e-15 ***
exonerate_nona$`Age at Crime`      -0.23042    0.01442 -15.979  < 2e-16 ***
exonerate_nona$RaceBlack            3.45263    1.57405   2.193   0.0283 *  
exonerate_nona$RaceDon't Know      -3.55104    2.05783  -1.726   0.0845 .  
exonerate_nona$RaceHispanic         1.97024    1.61480   1.220   0.2225    
exonerate_nona$RaceNative American  4.27294    2.32261   1.840   0.0659 .  
exonerate_nona$RaceOther           -0.50860    2.52558  -0.201   0.8404    
exonerate_nona$RaceWhite            1.83881    1.57970   1.164   0.2445    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.099 on 4084 degrees of freedom
Multiple R-squared:  0.08295,   Adjusted R-squared:  0.08138 
F-statistic: 52.78 on 7 and 4084 DF,  p-value: < 2.2e-16
#diagnostic plots for fit2
autoplot(fit2, 1:4, nrow = 2, ncol = 2)

Race is also a statistically significant predictor of years lost to prison, but it cannot explain most of the variation in the data. The adjusted R-squared only increased by a little over 1% from the first model. Perhaps years lost to prison is better predicted by variables that are not included in this dataset. The number of years each person loses to prison depends on many complex factors, like the availability of affordable legal services, the efficiency of the local court system, and the ease of procuring new evidence. However, it is difficult to collect exact data on these variables, especially at the scale required by the National Registry of Exonerations.

Data Visualization

Filter the data

#Filter for DC, Maryland, and Virginia and the three race categories that appear most frequently (Black, White, and Hispanic)
exonerate_dmv <- exonerate_nona |>
  filter((State == "District of Columbia"| State == "Maryland" | State == "Virginia") & (Race == "Black" | Race == "White"| Race == "Hispanic"))

Group and summarize the data

#Find the average years lost to prison for each race category
by_dmv <- exonerate_dmv |>
  group_by(State, Race) |>
  summarize(avg_yearslost = mean(`Years lost`),
            .groups = "drop") 
head(by_dmv)
# A tibble: 6 × 3
  State                Race     avg_yearslost
  <chr>                <chr>            <dbl>
1 District of Columbia Black            16.3 
2 District of Columbia Hispanic          5.7 
3 District of Columbia White            13.6 
4 Maryland             Black            15.0 
5 Maryland             White            11.9 
6 Virginia             Black             7.48

Plot the data

ggplot(by_dmv, aes(x = State, y = avg_yearslost, fill = (Race))) +
  geom_bar(position = "dodge", stat = "identity", binwidth = 0.2) +
  labs(x = "State", y = "Avg. Number of Years in Prison ", title = "Average Length of Wrongful Imprisonment \n in the DMV* by Race", subtitle = "*District of Columbia, Maryland & Virginia", caption = "Source: National Registry of Exonerations \n Data collected from 1989 to present") +
  scale_fill_manual(name = "Race", labels = c("Black", "Hispanic", "White"), values = c("skyblue1", "plum4", "seagreen")) +
  scale_y_continuous(lim = c(0,20)) +
  theme_minimal() + 
  theme(plot.title = element_text(hjust = .5, face = "bold")) +
  theme(plot.subtitle = element_text (hjust = .5, face = "italic")) +
  theme(axis.title.x = element_text(vjust = -1)) +
  theme(axis.title.y = element_text(vjust = 1.5)) +
  theme(plot.caption = element_text(hjust = .5, vjust = -1, face = "italic"))
Warning in geom_bar(position = "dodge", stat = "identity", binwidth = 0.2):
Ignoring unknown parameters: `binwidth`

Essay

Since the dataset has 40 variables, I wanted to focus on a select few. I was interested in the relationship between race and years lost to prison. However, it would be difficult to fit all 50 states in one graph, so I decided to narrow in on my local geographic zone, DC, Maryland, and Virginia. To clean the data, I first renamed some of the columns so they were less wordy (e.g., “Age at Crime Yrs” to “Age at Crime”) and removed all the NA values from the variables I wanted to examine for my linear regression model and data visualization. Then, I filtered the data for DC, Maryland, and Virginia. I also filtered the data for the race categories “Black,” “Hispanic,” and “White” because they appear most frequently. I was worried that if I included all the race categories, like “Asian” and “Other,” they would not appear on the chart because they occur so infrequently in the data. To easily compare across races, I summarized the data by average years lost to prison.

My bar chart shows the average number of years that wrongfully convicted people in the DMV lost to prison, organized by race. Overall, black people who were wrongfully convicted in the DMV spent the longest time in prison, losing a total average of 38.76 years. Surprisingly, there is no data for Hispanic people in Maryland. I find it hard to believe that from 1989 to the present, no Hispanic person in Maryland has ever been wrongfully convicted and imprisoned. The lack of data could be due to the Registry’s data collection methods, such as how it defines “Hispanic.” The chart does not provide sufficient evidence to conclude there is a correlation between race and average number of years in prison. However, it does show that wrongfully convicted people in the DMV can lose over a decade of their lives to prison. One day that an innocent person spends in prison is one day too many. The data in the chart is a damning indictment of our local criminal-legal systems.

In the future, I would like to create an interactive map of the US that shows the average length of wrongful imprisonment for each racial group in each state. I think it would also be useful to combine the National Registry’s dataset with a dataset on population by race in each state. Doing so could reveal racial disparities in wrongful conviction data. For the linear regression model, I wanted to look at the relationship between years lost to prison and term of years (i.e., the length of the prison sentence), but I did not know which quantitative value to assign to life sentences. If I have more time, I would like to conduct a detailed breakdown of the term of years variable (for example, dividing it into periods, such as 5-10 years, 15-25 years, etc.). I would also like to create a visualization of the most common contributing factors to wrongful convictions in the US.