Marie Daras LLC
Marie Daras LLC

EDAs of Murder Rates in All 50 States

EDAS of murder rates in all 50 states. We will be using a native R dataset, through exploratory data analysis (EDA’) all the way through a simple linear regression. We will talk about all the assumptions that go into a linear regression model.

Data Prep

Data Prep for the State Dataset, which comes with R This combines three datasets state.abb, state.x77, and state.region Here I combine them into one dataset “States” in preparation for the EDAs portion of this demonstration These are the descriptions of the datasets: state.abb: A vector with 2-letter abbreviations for the state names state.x77: A matrix with 50 rows and 8 columns giving the following statistics - Population, Income, Illiteracy, Life Exp, Murder, HS Grad, Frost, and Areas state.region Divides states into regions Northeast, South, North Central, West

tem <-data.frame(state.x77)  #bringing in state.x77 as a dataframe
states <-cbind(state.abb, tem, state.region) #Combine all three data sets into a dataframe

Renaming Columns

#Rename the first and 10th column
colnames(states)[1] <- "State"               # Rename first column
colnames(states)[10] <- "Region"              # Rename the 10th column
head(states, 5)
##            State Population Income Illiteracy Life.Exp Murder HS.Grad Frost
## Alabama       AL       3615   3624        2.1    69.05   15.1    41.3    20
## Alaska        AK        365   6315        1.5    69.31   11.3    66.7   152
## Arizona       AZ       2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas      AR       2110   3378        1.9    70.66   10.1    39.9    65
## California    CA      21198   5114        1.1    71.71   10.3    62.6    20
##              Area Region
## Alabama     50708  South
## Alaska     566432   West
## Arizona    113417   West
## Arkansas    51945  South
## California 156361   West

Looking at our datatypes

str(states)
## 'data.frame':    50 obs. of  10 variables:
##  $ State     : chr  "AL" "AK" "AZ" "AR" ...
##  $ Population: num  3615 365 2212 2110 21198 ...
##  $ Income    : num  3624 6315 4530 3378 5114 ...
##  $ Illiteracy: num  2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
##  $ Life.Exp  : num  69 69.3 70.5 70.7 71.7 ...
##  $ Murder    : num  15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
##  $ HS.Grad   : num  41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
##  $ Frost     : num  20 152 15 65 20 166 139 103 11 60 ...
##  $ Area      : num  50708 566432 113417 51945 156361 ...
##  $ Region    : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...

We will start by examining the distributions of the different variables, particularly for normality. If they meet the assumption of normality, then some methods, such as linear regression are appropriate. Normality is defined where the variable is distributed in a “normal curve” or bell shaped curve. The mean, median and mode (measures of central tendency) fall approximately at the same point. + and - 1 standard deviations (SD’s) around the mean are 68.27% of the observations, Within +/- 2SD’s around the mean, 95.45% of observations, and 99.73% of observations are contained within +/- 3 SD’s from the mean. In a normal curve, half of the observations fall below the mean, and half fall above.

The Normal Curve ## Lets look at summary statistics

summary(states)
##     State             Population        Income       Illiteracy   
##  Length:50          Min.   :  365   Min.   :3098   Min.   :0.500  
##  Class :character   1st Qu.: 1080   1st Qu.:3993   1st Qu.:0.625  
##  Mode  :character   Median : 2838   Median :4519   Median :0.950  
##                     Mean   : 4246   Mean   :4436   Mean   :1.170  
##                     3rd Qu.: 4968   3rd Qu.:4814   3rd Qu.:1.575  
##                     Max.   :21198   Max.   :6315   Max.   :2.800  
##     Life.Exp         Murder          HS.Grad          Frost       
##  Min.   :67.96   Min.   : 1.400   Min.   :37.80   Min.   :  0.00  
##  1st Qu.:70.12   1st Qu.: 4.350   1st Qu.:48.05   1st Qu.: 66.25  
##  Median :70.67   Median : 6.850   Median :53.25   Median :114.50  
##  Mean   :70.88   Mean   : 7.378   Mean   :53.11   Mean   :104.46  
##  3rd Qu.:71.89   3rd Qu.:10.675   3rd Qu.:59.15   3rd Qu.:139.75  
##  Max.   :73.60   Max.   :15.100   Max.   :67.30   Max.   :188.00  
##       Area                  Region  
##  Min.   :  1049   Northeast    : 9  
##  1st Qu.: 36985   South        :16  
##  Median : 54277   North Central:12  
##  Mean   : 70736   West         :13  
##  3rd Qu.: 81162                     
##  Max.   :566432

Measuring Central Tendency

There are a few main ways of measuring “central tendency” of a variable - or how we describe data by the typical data point within the distribution of a variable. The most common are the mean, median , and mode The mean of an observation variable is a numerical measure of the central location of the data values. It is the sum of its data values divided by data count. The median of an observation variable is the value at the middle when the data is sorted in ascending order. It is an ordinal measure of the central location of the data values. The mode is the value that has highest number of occurrences in a set of data. Unike mean and median, mode can have both numeric and character data. R does not have a function to calculate the mode - or the most frequently occurring observation, which is also a measure of central tendency. Here we will calculate it for a few of the numerical variables. Particularly as a demonstration.

getmode <- function(pop) {
   uniqv <- unique(pop)
   uniqv[which.max(tabulate(match(pop, uniqv)))]
}

# Create the vector with numbers.
pop <-cbind(states$Population)

# Calculate the mode using the user function.
result <- getmode(pop)
print(result)
## [1] 3615
# Create the function.
getmode <- function(inc) {
   uniqv <- unique(inc)
   uniqv[which.max(tabulate(match(inc, uniqv)))]
}

# Create the vector with numbers.
inc <-cbind(states$Murder)

# Calculate the mode using the user function.
result <- getmode(inc)
print(result)
## [1] 10.3

Variance & Standard Deviation

So far we can see that Income, Illiteracy, Life Exp, and HS Grad, are likely to be normally distributed by looking at the Min/Max, and where the means are positioned. Another way to tell is that the Mean and Median (and often Mode) are in the same position. This is not the case for Population - where the Mean and Median (and Mode) are different, and the majority of the observations are after the mean. Also, with Murder, we can see that the Mean and the Median are closer to the Max, and the Mode is not close to either. We will look at this another way as well. In essence we are looking to see if each variable is either normal or “skewed” , where the data points to the right or left of the mean. Sometimes when this happens, we need to apply changes to the data or choose another method altogether.

Standard Deviation and Variance When we talk about Standard Deviation and Variance we are discussing how the observations are spread about the mean. Variance is derived by taking the mean of the data points (the central tendency), subtracting the mean from each data point individually, squaring each of these results (so as to remove the +/- to get a measure of distance from the mean), and then taking another mean of these squares. Standard deviation is the square root of the variance, or in essence, we have standardized the variance, to return it back to the same units of measure as the datset. Smaller standard deviations are where data is clustered more about the mean, while larger standard deviations are where the data is more spread out.

Variance & Standard Deviation
Variance & Standard Deviation
lapply(states[, 2:7], var)
## $Population
## [1] 19931684
## 
## $Income
## [1] 377573.3
## 
## $Illiteracy
## [1] 0.3715306
## 
## $Life.Exp
## [1] 1.80202
## 
## $Murder
## [1] 13.62747
## 
## $HS.Grad
## [1] 65.23789

Standard Deviation

sdev <-lapply(states[, 2:7], sd)
print(sdev)
## $Population
## [1] 4464.491
## 
## $Income
## [1] 614.4699
## 
## $Illiteracy
## [1] 0.6095331
## 
## $Life.Exp
## [1] 1.342394
## 
## $Murder
## [1] 3.69154
## 
## $HS.Grad
## [1] 8.076998

Coefficient of Variation

Coefficient of Variation (CoV) The CoV is the ratio of the standard deviation to the mean and is calculated by: SD/Mean The higher the CV, the greater the dispersion of the data from the mean, or less likely to be normally distributed.

library(dplyr)
summary_df <- states %>% group_by(Region) %>%
  summarise(across(where(is.numeric), list(cv = ~sd(.)/mean(.) * 100)))
print(summary_df)
## # A tibble: 4 × 9
##   Region  Population_cv Income_cv Illiteracy_cv Life.Exp_cv Murder_cv HS.Grad_cv
##   <fct>           <dbl>     <dbl>         <dbl>       <dbl>     <dbl>      <dbl>
## 1 Northe…         111.      12.2           27.8        1.04      56.6       7.28
## 2 South            66.1     15.1           31.8        1.47      24.8      12.9 
## 3 North …          77.1      6.14          20.2        1.44      67.7       6.65
## 4 West            191.      14.1           59.5        1.90      37.1       5.65
## # ℹ 2 more variables: Frost_cv <dbl>, Area_cv <dbl>

Visualization & The Shapiro-Wilk

Visualization & The Shapiro Wilk We are going to visualize our data and run a Shapiro-Wilk Test for normality. We’ll discuss the interpretation of this, as we go along. Plotting Quantiles - the QQ Plot The first method for exploring normality of our variables is to plot quantiles, or a QQ Plot. This is a quick visual check. This is very easy in R. To explain what we are doing - Quantiles are basically your data, sorted in ascending order, then plotting them according to a theoretical normal distribution. Quantiles are values that split sorted data or a probability distribution into equal parts. In general terms, a q-quantile divides sorted data into q parts. They become special when they’re divided up into 4 parts, then they’re quartiles, into 10 parts, deciles, and so forth.

So X is a theoretical quantile, and y is your sample quantile. The closer the plotted line is to the normal line, the more normal the data is distributed.

We will also be eyeballing the histograms for a normal density curve. Let’s take a look…

a <- colnames(states)[2:9] # Pick up all numeric columns/variables according to the names
par(mfrow = c(4, 4))  
for (i in 1:length(a)){
  sub = states[a[i]][,1]

  hist(sub, main = paste("Hist. of", a[i], sep = " "), xlab = a[i])
  qqnorm(sub, main = paste("Q-Q Plot of", a[i], sep = " ")) #
  qqline(sub)           # Add a QQ plot line.

   if (i == 1) {
     s.t <- shapiro.test(sub) # Normality test for population
   } else {
     s.t <- rbind(s.t, shapiro.test(sub)) # Bind a new test result to previous row
 }
}

Examining the Histograms and QQ Plots: Histogram: Does the histogram approach a normal density curve? If yes, then the variable more likely follows a normal distribution.

Q-Q plot: Do the sample quantiles almost fall into a straight line? If yes, then the variable more likely follows a normal distribution.

We can see that the distributions of Population, Illiteracy, and Area skew to the left. Income and Life.Exp are distributed close to normal

Shapiro Wilk Test Next we are going to run the Shapiro-Wilk test. This tests the normal hypothesis that a variable follows a normal distribution.

In statistics we cannot Prove that something is true, or a difference exists. So we create a hypothesis that we wish to test, the Ha, or alternative hypothesis. That there is a difference between our data and what we are testing. The null hypothesis,H0, which states that there is no difference. If we reject the null, we accept the alternative or our HA. You measure this by a p value, which is the probability of obtaining these results, which is usually .05 or lower.

With the Shapiro Wilk, our Ha is that there is a difference between your variable’s distribution and a normal variable’s distribution. The null is that there is no difference. Therefore a small p value means we reject the null, and it is not normal. We are looking large p-values for normal distributions.

s.t <- s.t[, 1:2]       
s.t <- cbind(a, s.t)    

s.t
##     a            statistic p.value     
## s.t "Population" 0.769992  1.906393e-07
##     "Income"     0.9769037 0.4300105   
##     "Illiteracy" 0.8831491 0.0001396258
##     "Life.Exp"   0.97724   0.4423285   
##     "Murder"     0.9534691 0.04744626  
##     "HS.Grad"    0.9531029 0.04581562  
##     "Frost"      0.9545618 0.05267472  
##     "Area"       0.5717872 7.591835e-11

The shapiro tests show that Income, Life.Exp and Frost are normally distributed with p-values greater than 0.05, while Murder and HS.Grad are almost normally distributed with p-values very close to 0.05. There is no evidence that Population, Illiteracy, and Area having normal distribution. This is why both visual and statistical checks are important.

counts <- sort(table(states$Region), decreasing = TRUE)  
# Number of states in each region
percentages <- 100 * counts / length(states$Region)
barplot(percentages, ylab = "Percentage", col = "hot pink")
text(x=seq(0.7, 5, 1.2), 2, paste("n=", counts))    

The graph above shows us the breakdown between regions and state distribution. There are more states in the South and the Northwest has the fewest. Finishing Up! To finish up this data exploration we isolate a few key variables and create a visualization to see if there might be a relationship, which we will explore further in the next notebook!

Here we are creating a subset of the original dataset which includes just Region, Income, Illiteracy, and Murder. We are aggregating by State so the numbers reflect the frequency counts of Income, Illiteracy, and Murder rates by State.

We call this new dataset “tab1” to reflect that this aggregate is in tablular form.

library(dplyr)

tab1 <- states %>% group_by(Region, Income, Illiteracy, Murder) %>%
  summarise(total_count=n(),.groups = 'drop') %>%
  as.data.frame()
 tab1 <-tab1[ -c(5) ]
  head(tab1, 5)
##      Region Income Illiteracy Murder
## 1 Northeast   3694        0.7    2.7
## 2 Northeast   3907        0.6    5.5
## 3 Northeast   4281        0.7    3.3
## 4 Northeast   4449        1.0    6.1
## 5 Northeast   4558        1.3    2.4

Below is a Heatmap. Generally a heatmap shows the relationship between numbers or numeric variables. It can have categorical variables, such as we do with “Region” here. So, on the X axis, we have “Region” and on the Y we have “Illiteracy,” - our fill is “Murder.” What we can see is that in our 1976 States data, fairly consistently across our Regions, there is a positive relationship between Illiteracy and Murder rates - as one goes up the other does also. The only Region that shows a difference in this, is the West Region, where the relationship is less clear.

library(tidyverse)
library(repr)
options(repr.plot.width=10, repr.plot.height=10)
ggplot(tab1, aes(x = Region, y = Illiteracy, fill = Murder)) +
  geom_tile() +
  scale_fill_gradient(low="white", high="hot pink")

library(vembedr)
embed_url("https://youtu.be/hAx6mYeC6pY?si=X7fywX2hTIyu2gTw")