MATH2349 Data Wrangling

Required packages

# Load packages
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readxl)
library(tidyr)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

# Define function for testing whole numbers
# Ref: BaseR documentation "Integer Vectors"
# R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. <URL: https://www.R-project.org/>.

is.wholenumber <- function(x, tol = .Machine$double.eps^0.5) abs(x - round(x)) < tol

Executive Summary

The purpose of this investigation was to determine if there was a relationship between maximum daily temperatures and violent crimes in Perth, Western Australia.

Data published by the Western Australian police on offences between January 2007 and June 2020 was used along with data published by the Bureau of Meteorology for maximum temperatures recorded for the Perth metropolitan area between 1994 and 2020. These statistics included the mean maximum daily temperature and the mean number of days with maximum temperatures over 30C, 35C and 40C.

For both data sets, the data was imported then subsetted to obtain the variables of interest. The weather data required some manipulation to convert the data into a tidy format.

Variables were then renamed and variable types converted as required to make them suitable for further processing.

Both data sets were scanned for missing data, errors, inconsistencies, special values and numerical outliers. None were found.

The crime data was filtered to include only offences that had an aspect of violence associated with them. These values were then totalled to represent all violent offences each month.

The month of the year was used then used to merge the two data sets together, effectively linking the number of offences recorded each month with the temperature statistics.

As the histogram of the crime data showed the data was right skewed, a log10 transformation was used to adjust the data to a more normal distribution. This transformation was found to be the most effective on this particular data set.

Finally, attempts were made to fit linear regression models to the data, as well as checking for a correlation using the Pearson correlation coefficient. In all cases, no statistically significant relationship was found between the data sets.

Data - Part 1 of 2

The weather data was collected from the Australian Government’s Bureau of Meteorology website.

The link to the data file used is: http://www.bom.gov.au/clim_data/cdio/tables/text/IDCJCM0039_009225.csv

The webpage the above file was downloaded from is: http://www.bom.gov.au/climate/averages/tables/cw_009225_All.shtml

The variables of interest from this file were:

Mean maximum temperature (Degrees C) for years 1994 to 2020

Mean number of days >= 30C

Mean number of days >= 35C

Mean number of days >= 40C

along with the month of the year for each observation.

The first variable is a record of the average daily maximum air temperature. A value is calculated for each month of the year as well as an annual value.

The other three variables record the average number of days in each month where the maximum daily air temperature was 30, 35 or 40 degrees C or higher.

The data relates to the “Perth Metro” site located at 31.92 degrees S Latitude and 115.87 degrees E Longitude. The data relates to recordings taken between 1994 and 2020.

The crime data was collected from the Western Australian Police department’s website.

The link to the data file used is: https://www.police.wa.gov.au/Crime/~/media/5BBD428073EC4651B0C4693CD21E532C.ashx

The webpage the above file was downloaded from is: https://www.police.wa.gov.au/Crime/CrimeStatistics#/

From the “Western Australia” sheet of the above file, the number of offences from each category was extracted for each month from January 2007 to June 2020. As only the numbers of violent crimes were of interest only the following variables were included:

Murder

Attempted / Conspiracy to murder

Manslaughter

Recent sexual assault

Historical sexual assault

Serious assault (family)

Common assault (family)

Serious assault (non-family)

Common assault (non-family)

Assault police officer

Threatening behaviour (family)

Possess weapon to cause fear (family)

Threatening behaviour (non-family)

Possess weapon to cause fear (non-family)

# Weather data
weather <- read.csv("IDCJCM0039_009225.csv", skip=11)
head(weather, 4)

# Extract the mean maximum temperatures by month of the year:
weatherByMonth <- weather[c(1,8:10),1:13]
attributes(weatherByMonth)

## $names
##  [1] "Statistic.Element" "January"           "February"         
##  [4] "March"             "April"             "May"              
##  [7] "June"              "July"              "August"           
## [10] "September"         "October"           "November"         
## [13] "December"         
## 
## $row.names
## [1]  1  8  9 10
## 
## $class
## [1] "data.frame"

# Crime data
crime <- read_xlsx("WA Police Force Crime Timeseries.xlsx", sheet="Western Australia", skip=7)

## New names:
## * `Sexual Assault` -> `Sexual Assault...8`
## * `Non-Assaultive Sexual Offences` -> `Non-Assaultive Sexual Offences...9`
## * `Sexual Assault` -> `Sexual Assault...11`
## * `Non-Assaultive Sexual Offences` -> `Non-Assaultive Sexual Offences...12`
## * `` -> ...32
## * ...

head(crime, 4)

# Extract the data for all the "violent" crimes
violentCrime <- crime[,c(1, 3:5, 8, 11, 14:15, 17:19, 21:22, 24:25)]
attributes(violentCrime)

## $names
##  [1] "Month and Year"                           
##  [2] "Murder"                                   
##  [3] "Attempted / Conspiracy to Murder"         
##  [4] "Manslaughter"                             
##  [5] "Sexual Assault...8"                       
##  [6] "Sexual Assault...11"                      
##  [7] "Serious Assault (Family)"                 
##  [8] "Common Assault (Family)"                  
##  [9] "Serious Assault (Non-Family)"             
## [10] "Common Assault (Non-Family)"              
## [11] "Assault Police Officer"                   
## [12] "Threatening Behaviour (Family)"           
## [13] "Possess Weapon to Cause Fear (Family)"    
## [14] "Threatening Behaviour (Non-Family)"       
## [15] "Possess Weapon to Cause Fear (Non-Family)"
## 
## $row.names
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
## [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## [145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## 
## $class
## [1] "tbl_df"     "tbl"        "data.frame"

Tidy & Manipulate Data I

The data in the weather data set is untidy because the months of the year are being used as variables, rather than as data under a variable, eg “month”.

The new data set contains three variables, “Statistic.Element”, “Month” and “Reading”.

# Weather data
# Tidy the data, making three variables: Type of Measurement, Month and The Recorded Measurement
weatherByMonth <- weatherByMonth %>%
  gather(`January`, `February`, `March`, `April`, `May`, `June`,
         `July`, `August`, `September`, `October`, `November`, `December`, key="Month", value="Reading")

Understand

The temperature records in the weather data set have been imported as character variables and need to be converted to numeric.

The data in the month variable have been imported as characters, but need to be converted to an ordered factor variable.

The name of the “Statistic.Element” variable was changed to “Measurement” and the descriptions within this variable were shortened for readability purposes.

Because of the way the original data was formatted there were two variables with matching names. These have been automatically renamed during the import process. These variable names were changed to a meaningful name that reflects the data they represent.

# Weather data
str(weatherByMonth)

## 'data.frame':    48 obs. of  3 variables:
##  $ Statistic.Element: chr  "Mean maximum temperature (Degrees C) for years 1994 to 2020 " "Mean number of days >= 30 Degrees C for years 1994 to 2020 " "Mean number of days >= 35 Degrees C for years 1994 to 2020 " "Mean number of days >= 40 Degrees C for years 1994 to 2020 " ...
##  $ Month            : chr  "January" "January" "January" "January" ...
##  $ Reading          : chr  "31.2" "17.7" "7.5" "1.2" ...

# Convert data in "Reading" variable from character to numeric
weatherByMonth$Reading <- as.numeric(weatherByMonth$Reading)
# Convert the "Month" variable to an ordered and leveled factor variable
weatherByMonth$Month <- factor(weatherByMonth$Month,
                               levels=c("January", "February", "March", "April", "May",
                                        "June", "July", "August", "September", "October",
                                        "November", "December"), ordered=TRUE)

# Rename "Statistic.Element" variable and shorten the text in the observations
names(weatherByMonth)[names(weatherByMonth) == "Statistic.Element"] <- "Measurement"
weatherByMonth$Measurement[weatherByMonth$Measurement ==
                             "Mean maximum temperature (Degrees C) for years 1994 to 2020 "] <- "Mean max temp"
weatherByMonth$Measurement[weatherByMonth$Measurement ==
                             "Mean number of days >= 30 Degrees C for years 1994 to 2020 "] <- "Mean No days >=30C"
weatherByMonth$Measurement[weatherByMonth$Measurement ==
                             "Mean number of days >= 35 Degrees C for years 1994 to 2020 "] <- "Mean No days >=35C"
weatherByMonth$Measurement[weatherByMonth$Measurement ==
                             "Mean number of days >= 40 Degrees C for years 1994 to 2020 "] <- "Mean No days >=40C"

# Crime data
str(violentCrime)

## tibble [162 x 15] (S3: tbl_df/tbl/data.frame)
##  $ Month and Year                           : POSIXct[1:162], format: "2007-01-01" "2007-02-01" ...
##  $ Murder                                   : num [1:162] 4 0 3 2 4 1 1 4 0 1 ...
##  $ Attempted / Conspiracy to Murder         : num [1:162] 2 0 1 0 3 1 0 0 1 0 ...
##  $ Manslaughter                             : num [1:162] 0 0 0 0 1 2 1 0 1 1 ...
##  $ Sexual Assault...8                       : num [1:162] 138 131 146 165 152 134 123 114 122 132 ...
##  $ Sexual Assault...11                      : num [1:162] 134 132 109 215 129 168 191 162 106 190 ...
##  $ Serious Assault (Family)                 : num [1:162] 224 217 217 201 185 169 170 180 186 188 ...
##  $ Common Assault (Family)                  : num [1:162] 639 522 514 417 456 406 400 465 509 447 ...
##  $ Serious Assault (Non-Family)             : num [1:162] 330 367 405 337 329 298 337 352 350 385 ...
##  $ Common Assault (Non-Family)              : num [1:162] 662 680 781 639 617 634 593 614 651 696 ...
##  $ Assault Police Officer                   : num [1:162] 107 104 87 95 92 97 85 117 105 74 ...
##  $ Threatening Behaviour (Family)           : num [1:162] 72 69 62 47 67 58 50 68 74 57 ...
##  $ Possess Weapon to Cause Fear (Family)    : num [1:162] 22 17 18 12 18 14 15 16 18 14 ...
##  $ Threatening Behaviour (Non-Family)       : num [1:162] 160 161 166 132 164 116 130 143 176 132 ...
##  $ Possess Weapon to Cause Fear (Non-Family): num [1:162] 87 103 94 71 76 70 65 78 93 92 ...

# Rename the "Sexual Assault..." variables
names(violentCrime)[names(violentCrime) == "Sexual Assault...8"] <- "Recent Sexual Assault"
names(violentCrime)[names(violentCrime) == "Sexual Assault...11"] <- "Historical Sexual Assault"

Scan I

Both the weather and crime data sets were scanned for missing data and special values, ie values that are infinite or NaN.

The weather data was scanned for negative numbers. A negative mean maximum temperature recorded in Perth would be considered an obvious error given the climate. Similarly, a negative number of days above any set maximum temperature would indicate an obvious data error.

The crime data was scanned for negative numbers of offences as this would indicate an obvious error. The numbers of offences were also scanned for any non-integers, as it is not possible to have a partial offence.

# Weather data
# Scan for missing data and special values, ie infinite or NaN
# Result = 0 indicates no missing data or special values in each case
sum(is.na(weatherByMonth))

## [1] 0

sum(sapply(weatherByMonth, is.infinite))

## [1] 0

sum(sapply(weatherByMonth, is.nan))

## [1] 0

# Scan for obvious errors or inconsistencies
# Result = 0 indicates no negative numbers, ie no errors
sum(weatherByMonth$Reading<0)

## [1] 0

# Crime data
# Scan for missing data and special values, ie infinite or NaN
# Result = 0 indicates no missing data or special values in each case
sum(is.na(violentCrime))

## [1] 0

sum(sapply(violentCrime, is.infinite))

## [1] 0

sum(sapply(violentCrime, is.nan))

## [1] 0

# Scan for obvious errors or inconsistencies
# Result = 0 indicates no negative numbers, ie no errors
sum(violentCrime<0)

## [1] 0

# All values, other than the dates, should be integers, anything else would be an error
listOfWhole <- sapply(violentCrime, FUN = is.wholenumber)
length(listOfWhole)

## [1] 2430

# Result = NA indicates only integers found
table(listOfWhole)["FALSE"]

## <NA> 
##   NA

# Should return the same value as listofwhole if all numbers are integers
table(listOfWhole)["TRUE"]

## TRUE 
## 2430

Tidy & Manipulate Data II

A new variable was created that took the sum of all the violent crimes, as described earlier in this report, for each month.

New date variables were also created to store the month and year separately.

The new month variable was converted to an ordered and leveled factor variable.

A new data set was created to hold the year, month and the total number of violent crimes recorded in that month.

# Crime data
# Create a new variable being the sum of all violent crimes for each observation
violentCrime <- mutate(violentCrime, "Total Violent Crimes"=`Murder` + `Attempted / Conspiracy to Murder` +
                         `Manslaughter` + `Recent Sexual Assault` + `Historical Sexual Assault` +
                         `Serious Assault (Family)` + `Common Assault (Family)` +
                         `Serious Assault (Non-Family)` + `Common Assault (Non-Family)` +
                         `Assault Police Officer` + `Threatening Behaviour (Family)` +
                         `Possess Weapon to Cause Fear (Family)` + `Threatening Behaviour (Non-Family)` +
                         `Possess Weapon to Cause Fear (Non-Family)`)

# Create new variables to store month and year separately
violentCrime <- mutate(violentCrime, "Year"=year(`Month and Year`))
violentCrime <- mutate(violentCrime, "Month"=month(`Month and Year`, label=TRUE, abbr=FALSE))

# Create a new data frame with just the variables: Year, month and total number of violent crimes
totalViolentCrimes <- select(violentCrime, Year, Month, `Total Violent Crimes`)
str(totalViolentCrimes)

## tibble [162 x 3] (S3: tbl_df/tbl/data.frame)
##  $ Year                : num [1:162] 2007 2007 2007 2007 2007 ...
##  $ Month               : Ord.factor w/ 12 levels "January"<"February"<..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Total Violent Crimes: num [1:162] 2581 2503 2603 2333 2293 ...

totalViolentCrimes$Month <- factor(totalViolentCrimes$Month,
                               levels=c("January", "February", "March", "April", "May",
                                        "June", "July", "August", "September", "October",
                                        "November", "December"), ordered=TRUE)

Scan II - Part 1 of 2 - Weather Data

A box plot was produced for each of the areas of interest. The plots suggest there were no numerical outliers in the weather data set. That is, there are no values outside the 1.5 * IQR fences.

# Scan for outliers
# Mean maximum temperatures
meanMaxTemps <- weatherByMonth %>% filter(Measurement=="Mean max temp")
# Mean number of days over 30C
meanDaysOver30 <- weatherByMonth %>% filter(Measurement=="Mean No days >=30C")
# Mean number of days over 35C
meanDaysOver35 <- weatherByMonth %>% filter(Measurement=="Mean No days >=35C")
# Mean number of days over 40C
meanDaysOver40 <- weatherByMonth %>% filter(Measurement=="Mean No days >=40C")

# Plot charts
meanMaxTemps$Reading %>% boxplot(main="Mean Maximum Temperatures", ylab="Degrees C")

meanDaysOver30$Reading %>% boxplot(main="Mean Number of Days >= 30C", ylab="Days")

meanDaysOver35$Reading %>% boxplot(main="Mean Number of Days >= 35C", ylab="Days")

meanDaysOver40$Reading %>% boxplot(main="Mean Number of Days >= 40C", ylab="Days")

Scan II - Part 2 of 2 - Crime Data

A box plot of the crime data set was produced. This plot suggests there are no numerical outliers in the crime data set. That is, there are no values outside the 1.5 * IQR fences.

# Scan for outliers
totalViolentCrimes$`Total Violent Crimes` %>% boxplot(main="Total Violent Crimes", ylab="Number of Violent Crimes")

Data - Part 2 of 2

The weather and crime data was merged together using a left_join function. The data is matched between the data sets by the month variable that is common to both. This effectively links the total number of violent crimes to the four temperature measurements, ie mean maximum temperature, mean number of days >=30 C, mean number of days >=35 C and mean number of days >=40 C.

# Merge the weather and crime data sets.
# Data is merged based on the month of the year
MergedDataMMT <- meanMaxTemps %>% left_join(totalViolentCrimes, by="Month")
MergedData30 <- meanDaysOver30 %>% left_join(totalViolentCrimes, by="Month")
MergedData35 <- meanDaysOver35 %>% left_join(totalViolentCrimes, by="Month")
MergedData40 <- meanDaysOver40 %>% left_join(totalViolentCrimes, by="Month")

Transform

The histogram of Total Violent Crimes is right skewed. The most effective transformation to improve the normality of the data was a log10 transformation.

# Crimes
MergedDataMMT$`Total Violent Crimes` %>% hist(main="Total Violent Crimes")

# Data is right skewed, apply log_10 transformation to improve normality of data
log_crimes <- log10(MergedDataMMT$`Total Violent Crimes`)
log_crimes %>% hist(main="Log10 of Total Violent Crimes")

Test for a linear relationship between violent crimes and mean maximum daily temperature

An attempt was made to fit a linear regression model to the data. The null hypothesis was that there is no linear relationship between the number of violent crimes in a month and the mean maximum temperature for the same month. The alternative hypothesis was that there is a relationship between the two variables.

# Testing for a linear relationship
plot(MergedDataMMT$Reading, log_crimes, main="Violent Crimes by Mean Maximum Temperature",
     xlab="Monthly Mean Maximum Temperature",
     ylab="log_10 of the Number of Violent Crimes")

# Since the data exhibits a positive linear trend, fit a linear regression model:
crime_model <- lm(log_crimes ~ MergedDataMMT$Reading)
crime_model %>% summary()

## 
## Call:
## lm(formula = log_crimes ~ MergedDataMMT$Reading)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14516 -0.06841 -0.01455  0.08940  0.14630 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.308780   0.034513  95.870  < 2e-16 ***
## MergedDataMMT$Reading 0.005979   0.001364   4.385  2.1e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08149 on 160 degrees of freedom
## Multiple R-squared:  0.1073, Adjusted R-squared:  0.1017 
## F-statistic: 19.22 on 1 and 160 DF,  p-value: 2.098e-05

# R^2 = 0.1073
# and F statistic = 19.22 on 1 and 160 DF
# Calculating p-value
# H0: The data does not fit a linear regression model
# Ha: The data does not a linear regression model
pf(q=19.22, 1, 160, lower.tail=FALSE)

## [1] 2.102242e-05

# Since p<0.001 we reject H0
# Get coefficients
crime_model %>% summary() %>% coef()

##                          Estimate Std. Error  t value      Pr(>|t|)
## (Intercept)           3.308780197 0.03451318 95.87004 2.908146e-143
## MergedDataMMT$Reading 0.005979091 0.00136367  4.38456  2.097959e-05

# Intercept (alpha) = 3.30878
# t statistic is 95.87 at p<0.001
# Slope (beta) = 0.00598
# t statistic is 4.38 at p<0.001
# The intercept of the linear regression model would represent the number of violent crimes when the mean maximum temperature was 0C. Given the climate in Perth, this is very unlikely to ever occur, however testing the statistical significance of the intercept:
# H0: alpha = 0
# Ha: alpha != 0
crime_model %>% confint()

##                             2.5 %      97.5 %
## (Intercept)           3.240620059 3.376940335
## MergedDataMMT$Reading 0.003285978 0.008672204

# The 95% CI for alpha is found to be [3.2406, 3.3769]. H0: alpha=0 is not captured by this interval, hence H0 is rejected.
# The slope of the linear regression model represents the average increase in the log10 of violent crimes per 1 degree C increase in mean maximum temperature
# The slope, beta, was found to be 0.005979
# Testing the statistical significance of the slope:
# H0: beta = 0
# Ha: beta !=0
2*pt(q=0.005979, df=162-2, lower.tail=FALSE)

## [1] 0.9952369

# p=0.995 and since p>0.05 we fail to reject H0, indicating there is no statistically significant evidence that the number of violent crimes per month
# was positively related to the mean maximum temperature.

Test for a correlation between violent crimes and mean maximum daily temperature

The Pearson correlation coefficient was used to determine if there was a correlation between the number of violent crimes and the mean maximum daily temperature. The null hypothesis was that there was no correlation between them and the alternative hypothesis was that there is a relationship between the two variables.

# Check for correlation using the Pearson correlation coefficient
cor(MergedDataMMT$Reading, log_crimes)

## [1] 0.3275122

# r = 0.3275 indicating a positive correlation
MergedDataMMT <- mutate(MergedDataMMT, "Log_V_Crimes"=log_crimes)
bivariate <- as.matrix(select(MergedDataMMT, Reading, Log_V_Crimes))
rcorr(bivariate, type="pearson")

##              Reading Log_V_Crimes
## Reading         1.00         0.33
## Log_V_Crimes    0.33         1.00
## 
## n= 162 
## 
## 
## P
##              Reading Log_V_Crimes
## Reading               0          
## Log_V_Crimes  0

# So r=0.33 and p<0.001
# H0: r = 0
# Ha: r != 0
2*pt(q=0.005979, df=162-2, lower.tail=FALSE)

## [1] 0.9952369

# The p-value for r is >0.05 hence we fail to reject H0

Test for a linear relationship between violent crimes and average number of days >=30 C

In this scenario, the null hypothesis was that there is no linear relationship between the number of violent crimes in a month and the average number of days with maximum temperatures >=30 C. The alternative hypothesis was that there is a relationship between the two variables.

# Testing for a linear relationship
plot(MergedData30$Reading, log_crimes, main="Violent Crimes by Average Days >=30C",
     xlab="Average Number of Days >=30C",
     ylab="log_10 of the Number of Violent Crimes")

# Since the data exhibits a positive linear trend, fit a linear regression model:
crime_model <- lm(log_crimes ~ MergedData30$Reading)
crime_model %>% summary()

## 
## Call:
## lm(formula = log_crimes ~ MergedData30$Reading)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.13120 -0.07245 -0.01680  0.08516  0.15299 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.4319024  0.0088155 389.303  < 2e-16 ***
## MergedData30$Reading 0.0039513  0.0009325   4.238 3.81e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08178 on 160 degrees of freedom
## Multiple R-squared:  0.1009, Adjusted R-squared:  0.09528 
## F-statistic: 17.96 on 1 and 160 DF,  p-value: 3.806e-05

# R^2 = 0.1009
# and F statistic = 17.96 on 1 and 160 DF
# Calculating p-value
# H0: The data does not fit a linear regression model
# Ha: The data does not a linear regression model
pf(q=17.96, 1, 160, lower.tail=FALSE)

## [1] 3.799594e-05

# Since p<0.001 we reject H0
# Get coefficients
crime_model %>% summary() %>% coef()

##                         Estimate   Std. Error    t value      Pr(>|t|)
## (Intercept)          3.431902403 0.0088155118 389.302684 4.429787e-240
## MergedData30$Reading 0.003951328 0.0009324635   4.237515  3.805824e-05

# Intercept (alpha) = 3.4319
# t statistic is 389.303 at p<0.001
# Slope (beta) = 0.00395
# t statistic is 4.23 at p<0.001
# The intercept of the linear regression model would represent the number of violent crimes when the number of days >=30C was zero. Testing the statistical significance of the intercept:
# H0: alpha = 0
# Ha: alpha != 0
crime_model %>% confint()

##                            2.5 %      97.5 %
## (Intercept)          3.414492635 3.449312171
## MergedData30$Reading 0.002109804 0.005792851

# The 95% CI for alpha is found to be [3.4145, 3.4493]. H0: alpha=0 is not captured by this interval, hence H0 is rejected
# The slope of the linear regression model represents the average increase in the log10 of violent crimes for each additional day above 30C during the month
# The slope, beta, was found to be 0.003951
# Testing the statistical significance of the slope:
# H0: beta = 0
# Ha: beta !=0
2*pt(q=0.003951, df=162-2, lower.tail=FALSE)

## [1] 0.9968525

# p=0.997 and since p>0.05 we fail to reject H0, indicating there is no statistically significant evidence that the number of violent crimes per month was positively related to the number of days >=30C.

Test for a correlation between violent crimes and average number of days >=30 C

In this scenario, the null hypothesis was that there is no correlation between the number of violent crimes in a month and the average number of days with maximum temperatures >=30 C. The alternative hypothesis was that there is a correlation between the two variables.

# Check for correlation using the Pearson correlation coefficient
cor(MergedData30$Reading, log_crimes)

## [1] 0.3176539

# r = 0.3176 indicating a positive correlation
MergedData30 <- mutate(MergedData30, "Log_V_Crimes"=log_crimes)
bivariate <- as.matrix(select(MergedData30, Reading, Log_V_Crimes))
rcorr(bivariate, type="pearson")

##              Reading Log_V_Crimes
## Reading         1.00         0.32
## Log_V_Crimes    0.32         1.00
## 
## n= 162 
## 
## 
## P
##              Reading Log_V_Crimes
## Reading               0          
## Log_V_Crimes  0

# So r=0.32 and p<0.001
# H0: r = 0
# Ha: r != 0
2*pt(q=0.003951, df=162-2, lower.tail=FALSE)

## [1] 0.9968525

# The p-value for r is >0.05 hence we fail to reject H0

Test for a linear relationship between violent crimes and average number of days >=35 C

In this scenario, the null hypothesis was that there is no linear relationship between the number of violent crimes in a month and the average number of days with maximum temperatures >=35 C. The alternative hypothesis was that there is a relationship between the two variables.

# Testing for a linear relationship
plot(MergedData35$Reading, log_crimes, main="Violent Crimes by Average Days >=35C",
     xlab="Average Number of Days >=35C",
     ylab="log_10 of the Number of Violent Crimes")

# Since the data exhibits a positive linear trend, fit a linear regression model:
crime_model <- lm(log_crimes ~ MergedData35$Reading)
crime_model %>% summary()

## 
## Call:
## lm(formula = log_crimes ~ MergedData35$Reading)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.12870 -0.07389 -0.01908  0.08331  0.15966 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.436841   0.008239 417.118  < 2e-16 ***
## MergedData35$Reading 0.009410   0.002334   4.031 8.56e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08217 on 160 degrees of freedom
## Multiple R-squared:  0.09221,    Adjusted R-squared:  0.08654 
## F-statistic: 16.25 on 1 and 160 DF,  p-value: 8.557e-05

# R^2 = 0.0922
# and F statistic = 16.25 on 1 and 160 DF
# Calculating p-value
# H0: The data does not fit a linear regression model
# Ha: The data does not a linear regression model
pf(q=16.25, 1, 160, lower.tail=FALSE)

## [1] 8.567076e-05

# Since p<0.001 we reject H0
# Get coefficients
crime_model %>% summary() %>% coef()

##                         Estimate  Std. Error    t value      Pr(>|t|)
## (Intercept)          3.436840970 0.008239494 417.117965 7.171966e-245
## MergedData35$Reading 0.009409835 0.002334120   4.031427  8.557224e-05

# Intercept (alpha) = 3.4368
# t statistic is 417.118 at p<0.001
# Slope (beta) = 0.0094
# t statistic is 4.03 at p<0.001
# The intercept of the linear regression model would represent the number of violent crimes when there were zero days >=35C. Testing the statistical significance of the intercept:
# H0: alpha = 0
# Ha: alpha != 0
crime_model %>% confint()

##                            2.5 %     97.5 %
## (Intercept)          3.420568781 3.45311316
## MergedData35$Reading 0.004800177 0.01401949

# The 95% CI for alpha is found to be [3.4206, 3.4531]. H0: alpha=0 is not captured by this interval, hence H0 is rejected.
# The slope of the linear regression model represents the average increase in the log10 of violent crimes for each additional day above 35C during the month
# The slope, beta, was found to be 0.0094
# Testing the statistical significance of the slope:
# H0: beta = 0
# Ha: beta !=0
2*pt(q=0.0094, df=162-2, lower.tail=FALSE)

## [1] 0.9925117

# p=0.997 and since p>0.05 we fail to reject H0, indicating there is no statistically significant evidence that the number of violent crimes per month was positively related to the number of days >=35C.

Test for a correlation between violent crimes and average number of days >=35 C

In this scenario, the null hypothesis was that there is no correlation between the number of violent crimes in a month and the average number of days with maximum temperatures >=35 C. The alternative hypothesis was that there is a correlation between the two variables.

# Check for correlation using the Pearson correlation coefficient
cor(MergedData35$Reading, log_crimes)

## [1] 0.3036626

# r = 0.3176 indicating a positive correlation
MergedData35 <- mutate(MergedData35, "Log_V_Crimes"=log_crimes)
bivariate <- as.matrix(select(MergedData35, Reading, Log_V_Crimes))
rcorr(bivariate, type="pearson")

##              Reading Log_V_Crimes
## Reading          1.0          0.3
## Log_V_Crimes     0.3          1.0
## 
## n= 162 
## 
## 
## P
##              Reading Log_V_Crimes
## Reading               0          
## Log_V_Crimes  0

# So r=0.3 and p<0.001
# H0: r = 0
# Ha: r != 0
2*pt(q=0.0094, df=162-2, lower.tail=FALSE)

## [1] 0.9925117

# The p-value for r is >0.05 hence we fail to reject H0

Test for a linear relationship between violent crimes and average number of days >=40 C

In this scenario, the null hypothesis was that there is no linear relationship between the number of violent crimes in a month and the average number of days with maximum temperatures >=40 C. The alternative hypothesis was that there is a relationship between the two variables.

# Testing for a linear relationship
plot(MergedData40$Reading, log_crimes, main="Violent Crimes by Average Days >=40C",
     xlab="Average Number of Days >=40C",
     ylab="log_10 of the Number of Violent Crimes")

# Since the data exhibits a positive linear trend, fit a linear regression model:
crime_model <- lm(log_crimes ~ MergedData40$Reading)
crime_model %>% summary()

## 
## Call:
## lm(formula = log_crimes ~ MergedData40$Reading)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.13220 -0.07219 -0.01678  0.08351  0.16937 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.44033    0.00792  434.38  < 2e-16 ***
## MergedData40$Reading  0.05611    0.01485    3.78 0.000221 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08264 on 160 degrees of freedom
## Multiple R-squared:  0.08197,    Adjusted R-squared:  0.07624 
## F-statistic: 14.29 on 1 and 160 DF,  p-value: 0.0002212

# R^2 = 0.08197
# and F statistic = 14.29 on 1 and 160 DF
# Calculating p-value
# H0: The data does not fit a linear regression model
# Ha: The data does not a linear regression model
pf(q=14.29, 1, 160, lower.tail=FALSE)

## [1] 0.0002208191

# Since p<0.001 we reject H0
# Get coefficients
crime_model %>% summary() %>% coef()

##                        Estimate  Std. Error    t value      Pr(>|t|)
## (Intercept)          3.44033301 0.007920195 434.374802 1.099493e-247
## MergedData40$Reading 0.05611321 0.014845533   3.779804  2.211508e-04

# Intercept (alpha) = 3.4403
# t statistic is 434.38 at p<0.001
# Slope (beta) = 0.05611
# t statistic is 3.78 at p<0.001
# The intercept of the linear regression model would represent the number of violent crimes when there were zero days >=40C. Testing the statistical significance of the intercept:
# H0: alpha = 0
# Ha: alpha != 0
crime_model %>% confint()

##                           2.5 %     97.5 %
## (Intercept)          3.42469141 3.45597462
## MergedData40$Reading 0.02679474 0.08543168

# The 95% CI for alpha is found to be [3.4247, 3.456]. H0: alpha=0 is not captured by this interval, hence H0 is rejected.
# The slope of the linear regression model represents the average increase in the log10 of violent crimes for each additional day above 40C during the month
# The slope, beta, was found to be 0.05611
# Testing the statistical significance of the slope:
# H0: beta = 0
# Ha: beta !=0
2*pt(q=0.05611, df=162-2, lower.tail=FALSE)

## [1] 0.9553242

# p=0.955 and since p>0.05 we fail to reject H0, indicating there is no statistically significant evidence that the number of violent crimes per month was positively related to the number of days >=40C.

Test for a correlation between violent crimes and average number of days >=40 C

In this scenario, the null hypothesis was that there is no correlation between the number of violent crimes in a month and the average number of days with maximum temperatures >=40 C. The alternative hypothesis was that there is a correlation between the two variables.

# Check for correlation using the Pearson correlation coefficient
cor(MergedData40$Reading, log_crimes)

## [1] 0.2863103

# r = 0.29 indicating a positive correlation
MergedData40 <- mutate(MergedData40, "Log_V_Crimes"=log_crimes)
bivariate <- as.matrix(select(MergedData40, Reading, Log_V_Crimes))
rcorr(bivariate, type="pearson")

##              Reading Log_V_Crimes
## Reading         1.00         0.29
## Log_V_Crimes    0.29         1.00
## 
## n= 162 
## 
## 
## P
##              Reading Log_V_Crimes
## Reading              2e-04       
## Log_V_Crimes 2e-04

# So r=0.29 and p<0.001
# H0: r = 0
# Ha: r != 0
2*pt(q=0.05611, df=162-2, lower.tail=FALSE)

## [1] 0.9553242

# The p-value for r is >0.05 hence we fail to reject H0