Prologue

Do you have safety concerns if you decide to move to another state in U.S?

I do.

Will you be interested in knowing that how weather can affect humans’ behaviors?

Hmm, fair.

What about getting to know the relationship between crimes and weather?

Yes, I Do!

Synopsis

Everyone has concerns: education, jobs, marriage, children, housing, safety, living conditions and politics etc. These concerns are associated with the other concerns, and life is filled with correlations. Statisticians and data scientists dig into numbers and try to find the strong or weak relationships among every possible concern, and they want us to concern less.

My focus for the final project will be on the relationship between national crime rates and climate (focusing on temperature and precipitation). One of my friends used to tell me, “higher temperature will cause higher crime rates because people get excited and irritated easily under hot weather. So Hawaii has a relatively high crime rate, and summer definitely has higher crime rate than winter.” Is her statement scientific? I doubt that, so that’s why I choose this topic to test her statement.

Data Sets Used

I’m going to use a couple of data sets that I found mainly from FBI and Current Results. Current Results are using data from NOAA National Climatic Data Center. Due to the large size of file on NOAA, I decided to move forward to Current Results to get the same table.

Here are the data sets and their characteristics:

Methodology and Packages

tidyverse -> includes most of packages that I will be using for the analyzing process, such as ggplot2, tibble, tidyr, dplyr

readxl -> enables R to read excel spreadsheet

DT -> enables R to create interactive data tables that can show multiple entries and search by keywords

plotly -> enables R to make interactive graphs

fiftystater -> enables R to use the fifty state map, point the map argument in geom_map to fifty_states

vembedr -> Add Youtube videos into RMarkdown

ggmap -> use ggplot to plot Google Map from data frames

Research Questions

I want to test my friend’s groundless statement on the relationship between climate and crime rate. More in details, these questions will be answered:

  • Do states like Hawaii and Florida have higher crime rate than states like Maine and Alaska? Why or why not?

  • Does Summer have higher crime rate than winter? Why or Why not?

  • What is the relationship between crime numbers/rate and average temperature?

  • What is the relationship between crime numbers/rate and average precipitation?


Data Preparation

Crime Percentage Data

The following data set that I downloaded from FBI website lists the percentage distribution of murders (I did not find the percentage distribution for total crimes) occurring in different months from 1996 to 2000.

Interpretation example: in January 1996, total crimes of the month is 8.7% of the total crimes occurred in 1996.

The purpose of using this data set is to find out that, on average, which months have more crimes.

# read all my packages that I will be using for the project
library(tidyverse)
library(readxl)
library(DT)
library(plotly)
library(fiftystater)
library(vembedr)
library(ggmap)
# to see how many sheets are there in the spreadsheet
excel_sheets("murderbymonth.xlsx")
## [1] "Sheet1" "Sheet2" "Sheet3"
# select the sheet and skip the useless rows above variable names
Murderbymonth_old <- read_excel("murderbymonth.xlsx", sheet = "Sheet1", skip = 3)

# Even skipping 3 rows, there are still blanks due to the special formatting
# in the original spreadsheet, so create a new data frame that omits all the
# rows with NA.
mur_ex <- na.omit(Murderbymonth_old)
Murderbymonth = as_tibble(mur_ex)

# create an interactive table using the cleaned-up data frame
datatable(Murderbymonth, rownames = FALSE, caption = "Murder Percentage of Each Month")

Weather Data

The weather data was downloaded from Current Results. The following data set includes the average temperature of each state and its respective rank in the nation, along with the average precipitation of each state and its respective rank in the nation. The data set has a time period from 1971 to 2000.

Interpretation example: the average temperature in Alabama was 17.1 in Celsius, and the average precipitation was 58.3 in inches from 1971 and 2000.

The purpose of using this data set is to find out which states have higher average temperature and precipitation, also rank is a good indicator of both. This spreadsheet is quite simple and does not require much work to clean up.

#to see how many sheets are there in the spreadsheet
excel_sheets("avgtempbystate.xlsx")
## [1] "Sheet1"
#create a data frame by reading the excel spreadsheet
weatherexcel <- read_excel("avgtempbystate.xlsx", sheet = "Sheet1")
weather = as.tibble(weatherexcel)
#change the colomn names into names without special characters
colnames(weather) <- c("State", "Avg F", "Avg C", "Temp Rank", "Inches", "Millimeters", "Precipitation Rank" )
#change the state names into uppercase in order to merge with the other data set.
weather <- mutate_each(weather, funs(toupper))
#change certain variables into numeric values instead of characters
weather[c(2:7)] <- sapply(weather[c(2:7)],as.numeric)
#create an interactive table based on "weather" data frame
datatable(weather, caption = 'Average Temperature and Precipitation By State')

Crimes by State Data

The following data set was downloaded from FBI, in which it includes the detailed information about crimes such as crime categories and crime index, in addition, the population of each state is included.(Year:2000)

There are two large categories for crimes. Among all the crimes, murder and non-negligent man-slaughter, forcible rape, robbery and aggravated assault are VIOLENT CRIME; burglary, larceny theft, motor vehicle theft and arson are PROPERTY CRIME. Since I am only interested in looking at the crime rates, I only keep the total crime numbers, which is the Crime Index Total (Violent Crime + Property Crime).

The data set has a very good-looking formatting in Excel, however, it requires a lot of work to clean up if I import this into R. The original spreadsheet was like this,

See the following codes for cleaning procedures.

#to see how many sheets are included in the spreadsheet
excel_sheets("crimebystate2000.xlsx")
## [1] "TABLE9A"
#create the original data frame by reading the excel sheet
Crimebystate <- read_excel("crimebystate2000.xlsx", sheet = "TABLE9A", skip = 3)

#Make sure the interested variables are in the correct category
str(Crimebystate)
## Classes 'tbl_df', 'tbl' and 'data.frame':    644 obs. of  15 variables:
##  $ Area                                       : chr  NA "ALABAMA" NA "Metropolitan Statistical Area" ...
##  $ Population                                 : chr  NA NA NA "3102822" ...
##  $ Crime Index Total                          : num  NA NA NA NA 149103 ...
##  $ Modified Crime Index Total1                : logi  NA NA NA NA NA NA ...
##  $ Violent Crime2                             : num  NA NA NA NA 15419 ...
##  $ Property Crime3                            : num  NA NA NA NA 133684 ...
##  $ Murder and non-negligent man-     slaughter: chr  NA NA NA NA ...
##  $ Forcible rape                              : num  NA NA NA NA 1076 ...
##  $ Robbery                                    : chr  NA NA NA NA ...
##  $ Aggravated assault                         : num  NA NA NA NA 9386 ...
##  $ X__1                                       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Burglary                                   : num  NA NA NA NA 29110 ...
##  $ Larceny-theft                              : num  NA NA NA NA 94554 ...
##  $ Motor vehicle theft                        : num  NA NA NA NA 10020 ...
##  $ Arson1                                     : logi  NA NA NA NA NA NA ...
#Population was categorized as 'character', so I need to adjust it to 'numeric' category
Crimebystate$Population <- as.numeric(as.character(Crimebystate$Population))

#Create a new data frame with useful variables, in this case, I will be using "Area", "population", "Crime index total"
crimebystate_clean <- select(Crimebystate, c(1,2,3))

#Due to the special formatting of the spreadsheet, "State Total" and "total" fall under "Area" variable, filter out other rows with useless information, and keep only rows with "state total" for 50 states, and "total" for DC and PR. 
crimebystate_clean2 <- filter(crimebystate_clean, Area %in% c("State Total", "Total"))

This is the most important step of cleaning this data set. I don’t want to type each state’s name into each row, so I need to retrieve all the state names by running the following codes. The principle is to keep the rows with texts with 2 consecutive capitalized letters, aka.AB,KO. Click code right next to see the cleaning details.

crimebystate_clean1 <-  grep(pattern = ("[a-zA-Z0-9-]*[A-Z]{2}[a-zA-Z0-9-]"), x  = Crimebystate$Area)
crimebystate_clean1 <- Crimebystate[crimebystate_clean1,]

#Replace the "State total" and "total" with their respective state names by giving them values from above data frame
crimebystate_clean2$Area <- crimebystate_clean1$Area

#Change the varibale names because there are spaces between letters, replace the variable names by inserting "_"
colnames(crimebystate_clean2) <- c("Area", "Population", "Crime_index_total")

#Due to the special formatting in the spreadsheet, there are expotional numbers right next to the state names. Replace the numbers with space so the "Area" varibale only has texts.
crimebystate_clean2$Area <-  gsub(pattern = "[0-9]", replacement = "", x = crimebystate_clean2$Area )

#Create a new variable "Crime_Per_1000"
crimebystate_clean2 <- mutate(crimebystate_clean2, crime_per_1000 = (Crime_index_total / Population) * 1000)

#DC and PR are not in my Weather Data, so I have to exclude them from my analysis
crime1 <- filter(crimebystate_clean2, Area != "PUERTO RICO" & Area != "DISTRICT OF COLUMBIA")
crime = as.tibble(crime1)
colnames(crime) <- c("State", "Population", "Crime_index_total", "crime_per_1000")

#create a data table for the 50 states with my interested variables 
datatable(crime, caption = 'Crime Rates By States')

Merged Data Sets of Weather and Crime

I am interested in knowing both the weather patterns (temperature and precipitation) and crime rates, so I decided to merge both “Weather Data” and “Crime by State Data”. However, the time period for “Weather Data” is from 1971 to 2000, and the time period for “Crime by State Data” is only 2000. Due to limited resources, I will use the average temperature and precipitation from 1971 to 2000, to represent the average temperature and precipitation in 2000.

#merge the Weather data frame with Crime by State data frame, and create a new data frame Final with the most interested variables: State, Avg F (average temperature), Inches (average percipitation), crime_per_1000 (Crime per thousand people).
Final1 <- merge(weather, crime, by="State")
Final <- Final1[c(1, 2, 5, 10)]
datatable(Final, caption = 'Crimes and Weather Patterns by States')

Analysis Works & Visulizations

Here are analysis steps including some basic descriptive statistics and plots.

Crime Percentage Analysis

First, let’s see which months had the most murders, and which months had the least murders.

#The number printed is the month with highest crime rate of the particular year. For example, if the number is 7, the highest crime rate of the year falls in July.
Murderbymonth %>%
  select(-Month) %>%
  summarise_all(which.max)
## # A tibble: 1 x 5
##   `1996` `1997` `1998` `1999` `2000`
##    <int>  <int>  <int>  <int>  <int>
## 1      8      7      8      7      7
#The number printed is the month with lowest crime rate of the particular year. For example, if the number is 1, the lowest crime rate of the year falls in January.
Murderbymonth %>%
  select(-Month) %>%
  summarise_all(which.min)
## # A tibble: 1 x 5
##   `1996` `1997` `1998` `1999` `2000`
##    <int>  <int>  <int>  <int>  <int>
## 1      3      2      2      2      2

Here is the visualization of the murder rate data table. As we can see, from the points and lines, there seems to be a similar trend for each year.

#Plot Crime Rate of each month by year
Murderbymonth1 <- Murderbymonth %>%
  gather(Year, Rate, `1996`:`2000`) %>%
  mutate(Month = factor(Month, levels = month.name)) %>%
  ggplot(aes(x = Month, Rate, group = Year, color = Year)) +
  ggtitle("Crime Percentage By Month") +
  geom_point()+
  geom_line() +
  ylab(label="Percentage of the Year") + 
  xlab("Month") +
  scale_x_discrete(labels = month.abb)

ggplotly(Murderbymonth1)

Based on the output and the graph, we can tell that from 1996 to 2000, most murders occurred in August and July. On the other side, least murders occurred in Feburuary and March.

Weather Analysis

Does temperature and precipitation have correlations? My guess was “yes”. I came from a city in southern China, which has a similar weather pattern as Winston Salem. After living in Michigan for four and half years, I noticed that Michigan is apparently colder and dryer than Winston Salem. My assumption was that there was a positive correlation between temperature and precipitation. To verify my assumption, I used ggplot to graph the data I collected.

#I'm interested in the relationship between temperature and precipitation
a <- ggplot(weather, aes(x=`Avg F`, y=`Inches`)) + 
  geom_point() + 
  geom_smooth(method="loess", se=F) + 
  labs(y="Average Temperature (F)", 
       x="Precipitation (Inches)", 
       title="Temperature VS. Precipitation", 
       caption = "Source: CurrentResult")
ggplotly(a)

Based on the graph, it looks like there is a positive relationship between temperature and precipitation.

Let’s see the average temperature and precipitation on the US maps. The higher the temperature is, the lighter the color it displays. The same algorithm applies to the precipitation, the more precipitation it has, the lighter the color it displays.

#Use ggplot and geom_map to map data frames with US fifty state map.
#Plot Average Temperature of each state on the US map.
weather$State <- tolower(weather$State)
Temp <- ggplot(weather, aes(map_id = weather$State)) + 
     geom_map(aes(fill = `Avg F`), map = fifty_states) + 
     expand_limits(x = fifty_states$long, y = fifty_states$lat) +
     coord_map() +
     scale_x_continuous(breaks = NULL) + 
     scale_y_continuous(breaks = NULL) +
     labs(x = "", y = "") +
     theme(legend.position = "bottom", 
          panel.background = element_blank())

Temp + fifty_states_inset_boxes() 

If you are familiar with the US map, you can see Florida and Hawaii have the most high temperature (no surprise). Some states in South such as Louisana, Texas, Georgia, Mississipi and Alabama also have a comparatively high average temperatures. Alaska, of course, has the lowest temperature. Following Alaska, North Dakota, Maine, Minnesota, and Wyoming have comparatively low temperatures among all 50 states.

#Plot Average Precipitation of each state on the US map.
Precipitation <- ggplot(weather, aes(map_id = weather$State)) + 
     geom_map(aes(fill = `Inches`), map = fifty_states) + 
     expand_limits(x = fifty_states$long, y = fifty_states$lat) +
     coord_map() +
     scale_x_continuous(breaks = NULL) + 
     scale_y_continuous(breaks = NULL) +
     labs(x = "", y = "") +
     theme(legend.position = "bottom", 
          panel.background = element_blank())
Precipitation + fifty_states_inset_boxes()  #add a box for Hawaii and Alaska

We can tell from the map that Hawaii, and some states such as Louisiana, Mississippi, Alabama, Florida and North Carolina have the most precipitations. On the other side, Arizona, Nevada, Utah, Wyoming and New Mexico have the least precipitations.

Crimes by State Analysis

Crime index total is not a good indicator for crime rates, while “crime_per_1000” is a better indicator because it considers the population of each state. Let’s see which states in the US have the highest values for crimes per a thousand people. The lighter the color displays, the higher crime rate it is in the state.

#Plot the variable 'crime_per_1000' on the US map
crime$State <- tolower(crime$State) #in order to use the geom_map, I need to change the state names into lower case 
crimeperk <- ggplot(crime, aes(map_id = crime$State)) + 
     geom_map(aes(fill = `crime_per_1000`), map = fifty_states) + 
     expand_limits(x = fifty_states$long, y = fifty_states$lat) +
     coord_map() +
     scale_x_continuous(breaks = NULL) + 
     scale_y_continuous(breaks = NULL) +
     labs(x = "", y = "") +
     theme(legend.position = "bottom", 
          panel.background = element_blank())
crimeperk+ fifty_states_inset_boxes() #add a box for Hawaii and Alaska

According to the map above, we can see that, on average, the southern part of US have lighter colors than the northern part. Especially Arizona, New Mexico and Florida, their colors are particularly lighter than other states. North Dakota, South Dakota, New Hampshire and West Virginia have darkest colors. Alaska has an obviously darker color than Hawaii.

Merged Data Sets Analysis

Refer back to my research questions, I’m interested in knowing the following:

  • What is the relationship between crime rate and average temperature?

  • What is the relationship between crime rate and average precipitation?

Crime and Temperature

Let’s first plot the scatter plot for crime rate and average temperature for each state, in order to find any possible relationship between them.

b <- ggplot(Final, aes(x=`Avg F`, y=`crime_per_1000`, color = `State`)) + 
  geom_point() + 
  labs(y="Crimes Per 1000", 
       x="Average Temperature", 
       title="Temperature VS. Crime Rate")
ggplotly(b)

There seems to be some sort of positive correlations. Let’s create another plot with geom_smooth function to help “the eye in seeing patterns in the presence of overplotting” (?geom_smooth). I used lm method for geom_smooth to see if there is any linear relationship.

c <- ggplot(Final, aes(x=`Avg F`, y=`crime_per_1000`)) + 
  geom_point() + 
  geom_smooth(method="lm", se=TRUE, color = "pink") + 
  labs(y="Crimes Per 1000", 
       x="Average Temperature", 
       title="Temperature VS. Crime Rate")
ggplotly(c)

Based on the above graph, it clearly tells us that there must be a positive linear relationship between Crime Rate and Temperature.

Let’s run a simple linear regression to verify our observations.

#run a simple linear regression using crime per 1000 as the dependent variable and avg f as the independent variable
crime_temp <- lm(Final$crime_per_1000 ~ Final$`Avg F`)
summary(crime_temp)
## 
## Call:
## lm(formula = Final$crime_per_1000 ~ Final$`Avg F`)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.6748  -5.3248   0.4964   4.6881  19.4410 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     5.4791     6.4987   0.843    0.403    
## Final$`Avg F`   0.6607     0.1234   5.353  2.4e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.522 on 48 degrees of freedom
## Multiple R-squared:  0.3738, Adjusted R-squared:  0.3608 
## F-statistic: 28.65 on 1 and 48 DF,  p-value: 2.399e-06

Now, we can answer one of my questions: What is the relationship between crime rate and average temperature?

Since the P-value is very small and almost close to 0, we can conclude that temperature has a significant effect on crime rate. There is a positive linear relationship, and the estimated model is as following:

\(Crime_i = 5.48 + 0.66 * Temperature_i\)

Crime: crime numbers of every 1000 people in the state; Temperature: the average temperature of the state of the year in Fahrenheit

Even this is not a good model to predict the crime rate for a state, we can still get the strong relationship between the dependent and independent variables.

Crime and Precipitation

Let’s do the same thing for precipitation. First, let’s create a scatter plot for Crime Rate and Precipitation in order to find out if there is any correlation between the two variables.

d <- ggplot(Final, aes(x=`Inches`, y=`crime_per_1000`, color = `State`)) + 
  geom_point() + 
  labs(y="Crimes Per 1000", 
       x="Average Precipitation", 
       title="Precipitation VS. Crime Rate")
ggplotly(d)

From the plot we created, the points are randomly scattered, so there seems to be no obvious correlation between precipitation and crime rate. Let’s create another plot using geom_smooth.

e <- ggplot(Final, aes(x=`Inches`, y=`crime_per_1000`)) + 
  geom_point() + 
  geom_smooth(method="lm", se=TRUE, color = "orange") + 
  labs(y="Crimes Per 1000", 
       x="Average Precipitation", 
       title="Precipitation VS. Crime Rate")
ggplotly(e)

Our observation is correct, and the linear line is almost horizontal, which means there is almost no linear relationship between precipitation and crime rate. Let’s again run a simple linear regression to verify our observations.

crime_prec <- lm(Final$crime_per_1000 ~ Final$`Inches`)
summary(crime_prec)
## 
## Call:
## lm(formula = Final$crime_per_1000 ~ Final$Inches)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.9103  -8.1474   0.9972   6.7454  20.1460 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.19457    3.67147  10.131 1.65e-13 ***
## Final$Inches  0.07020    0.09223   0.761     0.45    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.449 on 48 degrees of freedom
## Multiple R-squared:  0.01193,    Adjusted R-squared:  -0.008658 
## F-statistic: 0.5794 on 1 and 48 DF,  p-value: 0.4503

The P-value for the regression is 0.45, which is greater than the significance level 0.05. We can conclude that precipitation and crime rate do not have significant linear relationship.

Summary of Analysis

All of my research questions can be answered by above data visualizations and analytics.


1. Do states like Hawaii and Florida have higher crime rates than states like Maine and Alaska? Why or why not?

According to “Crime by State” data table, we can see the following: Arizona, Florida, New Mexico, Louisana, South Carolina and Hawaii are the top six states that have the highest crime rates out of 1000 people.

North Dakota, South Dakota, New Hampshire, West Virginia, Maine and Kentucky are the top six states that have the lease crime rates out 1000 people.

Based on “Crime by State” US map plotted with ggplot2, we can tell that on average, states in the southern part of US have higher crime rates than those in northern part. In addition, on average, South has higher temperature than North (geography knowledge).

To conclude, on average, states in North are safer than states in South.


2. Does Summer have higher crime rate than winter? Why or Why not?

First, which months are Spring, Summer, Fall and Winter?

According to Timeanddaate.com,

To be consistent and to make weather forecasting easier, meteorologists divide the year into 4 meteorological seasons of 3 months each:

  • Spring - from March 1 to May 31;
  • Summer - from June 1 to August 31;
  • Fall - from September 1 to November 30; and,
  • Winter - from December 1 to February 28 (February 29 in a leap year).

From the first part analysis (crime percentage analysis), we can tell from the table and graph that July and August have the most murders of the year, December and January also have comparatively high murders; Febururay and March have the least murders, April, May and November have relatively low murders.

We can see that Summer has the most murders of the year from 1996 to 2000. However, Winter months like December and January also have relatively high number of murders. This can be caused by too many reasons such that there are usually more crimes by year-end due to more holidays.

We cannot rank the safety of each season, but we can conclude that Summer is a little more dangerous, and Spring is a safer season than other seasons because March, April and May all have lower murder rates.


3. What is the relationship between crime numbers/rate and average temperature?

There is a positive linear relationship between temperature and crime rates.

This result verifies my friend’s statement that “higher temperature will cause higher crime rates”, but is it because people get excited and irritated easily under the hot weather?

Maybe. If I have another chance to analyze people’s behaviors and moods under different temperatures, I will definitely look into it.


4. What is the relationship between crime numbers/rate and average precipitation?

There is no obvious linear relationship between precipitation and crime rates.

Even though precipitation is related to temperature, we cannot conclude that there is a linear relationship between precipitation and crime rates.

Epilogue

If you want to move to the most safe place and you don’t mind living in a cold weather, Alaska is the best choice and you are going to enjoy Aurora.

If you also have considerations on education, economics or technology, Alaska might not be the best option, but you have other choices in North.

Please refer to the map under “Crime by State Analysis” section if you decide to move to another state. US News website is a great resource if you are looking for the safest states and best states rankings.

Watch a video on YouTube on how police talks about crime and weather!

embed_youtube("4zpQKwx8i7E")