The aim of this report is to showcase the data wrangling skills we have obtained during this course, with two open source real datasets, one of them possessing some level of untidiness.
We will follow the required steps in the completion of this assessment.
Inserting and loading the packages needed to produce the report.
knitr::opts_chunk$set(echo = TRUE,
warning = FALSE,
message = FALSE)
# Install script, should installing packages be needed.
#install.packages("readr")
#install.packages("here")
#install.packages("knitr")
#install.packages("dplyr")
#install.packages("magrittr")
#install.packages("tidyr")
#install.packages("stringr")
#install.packages("lubridate")
#install.packages("Hmisc")
#install.packages("outliers")
#install.packages("MVN")
#install.packages("ggplot2")
#install.packages("forecast")
# Accessing the necessary libraries
library(knitr) # Useful for creating nice tables
library(readr) # Useful for importing and exporting data
library(here) # For specifying file paths
library(dplyr) # For data wrangling # For mutating a new column
library(magrittr) # For pipes
library(tidyr) # For reading and writing data
library(stringr) # For working with character strings
library(lubridate) # For working with dates and times
library(Hmisc) # For miscelanous values
library(outliers) # For outlier detection
library(MVN)
library(ggplot2) # For advanced visualisations
library(forecast) # For lambda parameter for the Box-Cox transformation
#foreign, gdata, editrules, validate, infotheo, MASS, caret, MLR. You can also use your own functions to demonstrate your learning.
In the process of looking for some desirable datasets; desirable in the content to explore a possible story. And given the fact, that the search did not provide the wanted results, particularly fulfilling the untidy set requirement. The two datasets were found in the process, which will be use to explore a different subject.
This is a compilation of average city temperature sets provided in text file format by the University of Dayton, which SRK has subsequently put together in a larger csv file. It can be found in Kaggle and has a Multiple Authors (Creative Commons License).
Its variables are: Region, Country, State (for US cities), City, Month, Day, Year and Average Temperature (in degrees F); all pretty self explanatory.
avgdT <- read.csv(here::here("/Users/samuelklettnavarro/PY4E/MATH2405_codes/DW_A3", "city_temperature.csv"))
head(avgdT)
Given the large size of this dataframe, we are going to subset it to the desired city: Bogota, which is the one we will selecting from the second dataset.
bogotaT <- filter(avgdT, City == "Bogota")
head(bogotaT)
We will be cleaning and tidying the data in subsequent steps, but first let’s introduce the second dataset.
This is a compilation of data on reported cases of domestic violence in Colombia over time. It can be found in Kaggle and has a Public Domain (Creative Commons License).
The following variables can be found: - Departments: which are political and administrative divisions that make up the national territory. - Municipalities: Which correspond to the second level of administrative division in Colombia - DANE Code: It is a standardized nomenclature, designed by DANE for the identification of Territorial Entities (departments, districts and municipalities), Non-Municipalized Areas and Populated Centers, by assigning a unique numerical code to each of these territorial units. - Weapons Medium: Which is an allowance for the items that were used to commit the act of violence - Date Made: Which describes the day on which the violent event took place. - Gender: Sex of the victim - Quantity: Number of people who have been injured
dvcolombia <- read.csv(here::here("/Users/samuelklettnavarro/PY4E/MATH2405_codes/DW_A3", "Reporte_Delito_Violencia_Intrafamiliar_Polic_a_Nacional.csv"))
head(dvcolombia)
This table requires some cleaning and feature selection, first we need to focus on the values for Bogota, which we are going to filer with “BOGOTÁ D.C. (CT)”, subsetting the dataframe.
dvbog <- filter(dvcolombia, MUNICIPIO == "BOGOTÁ D.C. (CT)")
str(dvbog)
## 'data.frame': 53792 obs. of 8 variables:
## $ DEPARTAMENTO: chr "CUNDINAMARCA" "CUNDINAMARCA" "CUNDINAMARCA" "CUNDINAMARCA" ...
## $ MUNICIPIO : chr "BOGOTÁ D.C. (CT)" "BOGOTÁ D.C. (CT)" "BOGOTÁ D.C. (CT)" "BOGOTÁ D.C. (CT)" ...
## $ CODIGO.DANE : chr "11001000" "11001000" "11001000" "11001000" ...
## $ ARMAS.MEDIOS: chr "ARMA BLANCA / CORTOPUNZANTE" "CONTUNDENTES" "CONTUNDENTES" "CONTUNDENTES" ...
## $ FECHA.HECHO : chr "1/01/2010" "1/01/2010" "1/01/2010" "1/01/2010" ...
## $ GENERO : chr "FEMENINO" "FEMENINO" "FEMENINO" "MASCULINO" ...
## $ GRUPO.ETARIO: chr "ADULTOS" "ADULTOS" "MENORES" "ADULTOS" ...
## $ CANTIDAD : int 1 22 1 6 1 1 1 1 5 1 ...
In order to organise this set, we need rename the variables into the English equivalents, plus we can also drop some of the redundant columns, which do not provide any information after subsetting, like Departments, Municipalities and DANE codes.
dvbog %<>% select(-(1:3))
names(dvbog)
## [1] "ARMAS.MEDIOS" "FECHA.HECHO" "GENERO" "GRUPO.ETARIO" "CANTIDAD"
In order to change the names, let’s assign the translations to a vector.
names(dvbog) <- c("Used_weapon", "Date", "Genre", "Age_group", "Num_victims")
head(dvbog)
Further preprocessing is required to change some of the observations to English, plus turn them into factors.
uw <- dvbog$Used_weapon %>%
c() %>%
unique()
uw
## [1] "ARMA BLANCA / CORTOPUNZANTE" "CONTUNDENTES"
## [3] "NO REPORTADO" "SIN EMPLEO DE ARMAS"
## [5] "CORTOPUNZANTES" "CORTANTES"
## [7] "PUNZANTES" "ARMA DE FUEGO"
## [9] "NO REPORTA" "ESCOPOLAMINA"
Some of them are redundant, so we are going to collapse the factors into the equivalent translations, plus use string replacement function from {stringr}.
strep1 <- c("ARMA BLANCA / CORTOPUNZANTE" = "KNIFE",
"CONTUNDENTES" = "HEAVY",
"NO REPORTADO" = "UNREPORTED",
"SIN EMPLEO DE ARMAS" = "NO-WEAPON",
"CORTOPUNZANTES" = "KNIFE",
"CORTANTES" = "KNIFE",
"PUNZANTES" = "KNIFE",
"ARMA DE FUEGO" = "FIRE-ARM",
"NO REPORTA" = "UNREPORTED",
"ESCOPOLAMINA" = "POISON")
dvbog$Used_weapon %<>% str_replace_all(strep1)
# TO CHECK THE NEW VALUES
uw <- dvbog$Used_weapon %>%
c() %>%
unique()
uw
## [1] "KNIFE" "HEAVY" "UNREPORTED" "NO-WEAPON" "FIRE-ARM"
## [6] "POISON"
Now we can turn the observations into factors.
dvbog$Used_weapon %<>% factor(.,
levels = uw)
str(dvbog$Used_weapon)
## Factor w/ 6 levels "KNIFE","HEAVY",..: 1 2 2 2 3 3 4 2 2 3 ...
If we do the same for the Genre.
ug <- dvbog$Genre %>%
c() %>%
unique()
ug
## [1] "FEMENINO" "MASCULINO" "NO REPORTA" "-" ""
We can see that there are some unique values that represent NAs, so we are going to change them to that value, with str_replace_na() and deal with them in Step 5.
strep2 <- c("FEMENINO" = "FEMALE",
"MASCULINO" = "MALE",
"NO REPORTA" = "UNREPORTED",
"-" = NA_character_)
dvbog$Genre %<>% str_replace_all(strep2) %>% str_replace_na()
# TO CHECK THE NEW VALUES
ug <- dvbog$Genre %>%
c() %>%
unique()
#View(ug)
dvbog$Genre %<>% factor(.,
levels = ug)
str(dvbog$Genre)
## Factor w/ 5 levels "FEMALE","MALE",..: 1 1 1 2 1 2 1 1 1 1 ...
For the Age_group, just as we did before.
uag <- dvbog$Age_group %>%
c() %>%
unique()
uag
## [1] "ADULTOS" "MENORES" "ADOLESCENTES" "NO REPORTA" ""
We can see there are missing values, with the length 0 character, so we use the same methodology than for Genre.
strep3 <- c("ADULTOS" = "ADULTS",
"MENORES" = "MINORS",
"ADOLESCENTES" = "TEENS",
"NO REPORTA" = "UNREPORTED")
dvbog$Age_group %<>% str_replace_all(strep3) %>% str_replace_na()
# TO CHECK THE NEW VALUES
uag <- dvbog$Age_group %>%
c() %>%
unique()
#View(uag)
dvbog$Age_group %<>% factor(.,
levels = uag)
str(dvbog$Age_group)
## Factor w/ 5 levels "ADULTS","MINORS",..: 1 1 2 1 1 1 1 3 1 1 ...
We know that some of the values with length 0 have not been transformed and will have to be dealt with in step 5.
The only thing left to have the whole dataframe ready to be joined and completely clean, is to transform the string-date format to date format using the mutate() function to update the variable, given that we know the date format, is the standard dd-mm-yyyy:
dvbog %<>%
mutate(Date = as.Date(Date, format = "%d/%m/%Y"))
str(dvbog)
## 'data.frame': 53792 obs. of 5 variables:
## $ Used_weapon: Factor w/ 6 levels "KNIFE","HEAVY",..: 1 2 2 2 3 3 4 2 2 3 ...
## $ Date : Date, format: "2010-01-01" "2010-01-01" ...
## $ Genre : Factor w/ 5 levels "FEMALE","MALE",..: 1 1 1 2 1 2 1 1 1 1 ...
## $ Age_group : Factor w/ 5 levels "ADULTS","MINORS",..: 1 1 2 1 1 1 1 3 1 1 ...
## $ Num_victims: int 1 22 1 6 1 1 1 1 5 1 ...
bogotaT In order to be able to join the dataframes, we will have to tidy up bogotaT.
We can see that the bogotaT dataset is untidy. The are a number of reasons that make it so. If we look at the structure of this table:
str(bogotaT)
## 'data.frame': 9266 obs. of 8 variables:
## $ Region : chr "South/Central America & Carribean" "South/Central America & Carribean" "South/Central America & Carribean" "South/Central America & Carribean" ...
## $ Country : chr "Colombia" "Colombia" "Colombia" "Colombia" ...
## $ State : chr "" "" "" "" ...
## $ City : chr "Bogota" "Bogota" "Bogota" "Bogota" ...
## $ Month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
## $ AvgTemperature: num 54.1 58.1 50.6 52.8 52.5 51.9 52.8 55.7 55.8 50.8 ...
We can see: 1.- Redundant variables, which do not provide relevant information. This is not the a difficult item to fix, we just need to re-select the relevant columns, disposing of Country, State and City.
bogotaT %<>% select(-(1:4))
str(bogotaT)
## 'data.frame': 9266 obs. of 4 variables:
## $ Month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
## $ AvgTemperature: num 54.1 58.1 50.6 52.8 52.5 51.9 52.8 55.7 55.8 50.8 ...
max(dvbog$Date)
## [1] "2021-05-31"
2.- Creating a Date variable the contains Month, Day and Year, with the make_date() function from lubridate, that way each observation will be unique and have it’s own cell. We can also filter the results based on the min and max date range of dvbog, although this could have been done through the joining process.
bogotaT %<>%
mutate(Date = make_date(Year, Month, Day)) %>%
select(Date, AvgTemperature) %>%
# Filtering from the max and min dates in dvbog
filter(., Date >= min(dvbog$Date)) %>%
filter(., Date <= max(dvbog$Date))
str(bogotaT)
## 'data.frame': 3787 obs. of 2 variables:
## $ Date : Date, format: "2010-01-01" "2010-01-02" ...
## $ AvgTemperature: num 51.4 55.7 51.6 52.1 51.2 54.3 57.3 59.1 58.5 54.7 ...
Now we are ready to merge the datasets, as we have both in a tidy format and with a common date column.
By mergin this two data frames, we want to be able to access and generate statistics related to domestic violence cases and their correlation to temperature. This means, we are going to keep all the records from bogotaT, till the maximum date in the dvbog data frame.
dvbog %<>% filter(., Date <= max(bogotaT$Date))
For easier manipulation at a latter stage, we are going to use a right_join() from {dplyr}, where x will be dvbog, which might have several observations in one day, and also not reported incidents in others; with this type of join, we keep all the observations from bogotaT, so that we can account for the days where there are no cases.
mdf <- right_join(dvbog, bogotaT)
str(mdf)
## 'data.frame': 39538 obs. of 6 variables:
## $ Used_weapon : Factor w/ 6 levels "KNIFE","HEAVY",..: 1 2 2 2 3 3 4 2 2 3 ...
## $ Date : Date, format: "2010-01-01" "2010-01-01" ...
## $ Genre : Factor w/ 5 levels "FEMALE","MALE",..: 1 1 1 2 1 2 1 1 1 1 ...
## $ Age_group : Factor w/ 5 levels "ADULTS","MINORS",..: 1 1 2 1 1 1 1 3 1 1 ...
## $ Num_victims : int 1 22 1 6 1 1 1 1 5 1 ...
## $ AvgTemperature: num 51.4 51.4 51.4 51.4 51.4 51.4 51.4 55.7 55.7 55.7 ...
R assumes that the column to join by is the Date in the respective data frames.
From the structure showcased above, we can see that we have three columns with factors relating to Used_weapon, Genre and Age_group, an integral variable for the number of cases and a numerocal for the average temperature in degree’s Farenheit.
Let’s arrange the columns in a different manner with the next select statement.
mdf %<>% select(2, everything())
head(mdf)
In this case, we want to have the temperature in degree’s Celsius, to better understand the results.
We are going to create a new varieable for the average temperature in degree’s Celsius and drop the original one, using the mutate() function.
mdf %<>%
mutate(avgTempC = ((AvgTemperature - 32) * (5/9))) %>%
#round(avgTempC, digits = 1) %>% # generates an error in the pipe
select(-AvgTemperature)
mdf$avgTempC <- round(mdf$avgTempC, digits = 1)
str(mdf)
## 'data.frame': 39538 obs. of 6 variables:
## $ Date : Date, format: "2010-01-01" "2010-01-01" ...
## $ Used_weapon: Factor w/ 6 levels "KNIFE","HEAVY",..: 1 2 2 2 3 3 4 2 2 3 ...
## $ Genre : Factor w/ 5 levels "FEMALE","MALE",..: 1 1 1 2 1 2 1 1 1 1 ...
## $ Age_group : Factor w/ 5 levels "ADULTS","MINORS",..: 1 1 2 1 1 1 1 3 1 1 ...
## $ Num_victims: int 1 22 1 6 1 1 1 1 5 1 ...
## $ avgTempC : num 10.8 10.8 10.8 10.8 10.8 10.8 10.8 13.2 13.2 13.2 ...
Here we are going to deal with the know missing values that were found previously, and the newly created ones with the join, which will correspond with the dates without cases.
First, we scan the data for missing values.
colSums(is.na(mdf))
## Date Used_weapon Genre Age_group Num_victims avgTempC
## 0 2 2 2 2 0
If we look at those two missing values.
mdf %>%
filter(is.na(Genre)) %>%
select(everything())
This does not match what we encountered in previous steps, with string length 0 items; and the fact that mdf contains 11 rows more than dvbog, which should have missing values across all variables except the Date and avgTempC.
mdf %>%
summary()
## Date Used_weapon Genre Age_group
## Min. :2010-01-01 KNIFE : 3434 FEMALE :25184 ADULTS :27597
## 1st Qu.:2015-11-13 HEAVY : 9850 MALE :14232 MINORS : 6527
## Median :2018-02-23 UNREPORTED:16470 UNREPORTED: 118 TEENS : 5290
## Mean :2017-07-27 NO-WEAPON : 7716 NA : 2 UNREPORTED: 117
## 3rd Qu.:2020-01-27 FIRE-ARM : 279 : 0 : 5
## Max. :2020-05-13 POISON : 1787 NA's : 2 NA's : 2
## NA's : 2
## Num_victims avgTempC
## Min. : 1.000 Min. :-72.80
## 1st Qu.: 1.000 1st Qu.: 13.60
## Median : 1.000 Median : 14.20
## Mean : 4.503 Mean : 13.79
## 3rd Qu.: 4.000 3rd Qu.: 14.80
## Max. :130.000 Max. : 17.90
## NA's :2
#summarise(empty_usedweapon = count(Used_weapon == ""))
Looking at the summary, there are some missing values in the Age_group and Genre columns, though strangely enough, in the Genre case, they do not show in the colSums summary early on.
mdf %>% filter(Age_group == "") %>%
select(everything())
Due to the small amount of NAs, compared to the overall amount of data, we are going to drop those to values.
mdf %<>%
na.omit()
#mdf %>% summary()
Now, we are going to deal with the other inconsistencies, by changing the values to “UNREPORTED”, using window functions.
" mdf %<>%
mutate(Genre = case_when(Genre == NA_character_ ~ 'UNREPORTED',
TRUE ~ Genre)#,
Age_group = case_when(Age_group == '' ~ 'Unknown', TRUE ~ Age_group)
)
mdf %>%
summary()" # ERRORS I CANNOT CORRECT
## [1] " mdf %<>% \n mutate(Genre = case_when(Genre == NA_character_ ~ 'UNREPORTED', \n TRUE ~ Genre)#, \n Age_group = case_when(Age_group == '' ~ 'Unknown', TRUE ~ Age_group)\n )\nmdf %>% \n summary()"
After attempting to change the values, errors arise with the length of the replaced value and the NA_character_, which I have not enough knowledge to fix, so given that we are talking about 5 rows, out of almost 40 thousand, choosing to drop the values would still be a good solution.
mdf %<>%
mutate(Date, Date = as.character(Date))
mdf %<>%
filter(Age_group != "")
mdf %<>%
mutate(Date, Date = as_date(Date))
mdf %>% summary()
## Date Used_weapon Genre Age_group
## Min. :2010-01-01 KNIFE : 3434 FEMALE :25181 ADULTS :27597
## 1st Qu.:2015-11-13 HEAVY : 9849 MALE :14232 MINORS : 6527
## Median :2018-02-23 UNREPORTED:16470 UNREPORTED: 118 TEENS : 5290
## Mean :2017-07-27 NO-WEAPON : 7712 NA : 0 UNREPORTED: 117
## 3rd Qu.:2020-01-27 FIRE-ARM : 279 : 0 : 0
## Max. :2020-05-13 POISON : 1787
## Num_victims avgTempC
## Min. : 1.000 Min. :-72.80
## 1st Qu.: 1.000 1st Qu.: 13.60
## Median : 1.000 Median : 14.20
## Mean : 4.503 Mean : 13.79
## 3rd Qu.: 4.000 3rd Qu.: 14.80
## Max. :130.000 Max. : 17.90
Just from the summary above, we can clearly see a couple of outliers, one of them perhaps an inconsistency, or a data error: 1.-the Min avgTempC =-72.8 is well below 0, an impossibility in those latitudes, if not physically. 2.-the Max Num_victims is 130, these could be an input error, given that we are talking about domestic violence, which’s value should not go over double digits.
In this two cases, we are going to input the mean temperature, and then we will have a look with a boxplot, into other possible cases.
#mdf$Num_victims[mdf$Num_victims > 20] <- mean(mdf$Num_victims)
mdf$avgTempC[mdf$avgTempC < 0] <- mean(mdf$avgTempC)
mdf %>% summary()
## Date Used_weapon Genre Age_group
## Min. :2010-01-01 KNIFE : 3434 FEMALE :25181 ADULTS :27597
## 1st Qu.:2015-11-13 HEAVY : 9849 MALE :14232 MINORS : 6527
## Median :2018-02-23 UNREPORTED:16470 UNREPORTED: 118 TEENS : 5290
## Mean :2017-07-27 NO-WEAPON : 7712 NA : 0 UNREPORTED: 117
## 3rd Qu.:2020-01-27 FIRE-ARM : 279 : 0 : 0
## Max. :2020-05-13 POISON : 1787
## Num_victims avgTempC
## Min. : 1.000 Min. :10.70
## 1st Qu.: 1.000 1st Qu.:13.70
## Median : 1.000 Median :14.20
## Mean : 4.503 Mean :14.16
## 3rd Qu.: 4.000 3rd Qu.:14.80
## Max. :130.000 Max. :17.90
After running both operations on the extremes, the avgTempC seems to have been resolved, though for the Num_victims, and given that we allowed for amp room over the double digits, there seems to still be some outliers in up to that range, so, let’s have a better look, before correcting them.
boxnv <- mdf$Num_victims %>%
boxplot(main = "Number of victims boxplot", ylab = "", col = "grey")
If we look at the 5th element of the boxplot, we can see that over 10% ~ 1676 of the cases are outliers.
summary(boxnv)
## Length Class Mode
## stats 5 -none- numeric
## n 1 -none- numeric
## conf 2 -none- numeric
## out 4716 -none- numeric
## group 4716 -none- numeric
## names 1 -none- character
length(boxnv$out[boxnv$out > 20])
## [1] 1676
This poses a serious problem, as the assumed cut off of 20 victims, has a 3rd of the values over 20. If we trying to check the summary for the outliers values.
summary(boxnv$out)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 11.00 15.00 23.29 32.00 130.00
In order to be able to produce a chart, we are going to assume that the outliers that fall past the median are input errors and we will adjust them to the Min value of 9.
mdf$Num_victims[mdf$Num_victims > 15] <- min(boxnv$out)
boxnvw <- mdf$Num_victims %>%
boxplot(main = "Number of victims with adjusted outliers boxplot", ylab = "", col = "grey")
Now we have a different summary for the outliers, although the total number has not been reduced, the range is shorter.
summary(boxnvw)
## Length Class Mode
## stats 5 -none- numeric
## n 1 -none- numeric
## conf 2 -none- numeric
## out 4716 -none- numeric
## group 4716 -none- numeric
## names 1 -none- character
There is a possibility, that the problem we have encountered with these outliers has been caused by an input methodology change, or a change in the tabulation of the values, which could be turning the standard input, into a daily, weekly or monthly running total; this is outside the scope of this study, and it will require a higher R expertise.
Looking at the histogram for the outlier Num_victims, to gain a better insight into the outlier distribution.
hist(boxnv$out, breaks = 30, main = "Num_victims outliers histogram", xlab = NULL)
Most of the outiers are close to the original IQR, though there is a reasonable volume of outliers between 20-60 victims per case.
From these results, we can see that transformations might be required, in order to produce a visualisation of the relationship between temperature and number of victims, in order to be able to compare the values fairly, should we desire to repduce such a graph.
Let’s look at continuous variables with the aid of the z-score method and QQ-plots. Looking at the avgTempC:
hist(mdf$avgTempC, breaks = 30, main = "Avg Temp in Celsius histogram", xlab = NULL)
The distribution seems to follow the desired bell curve shape. Let’s see this on display following the creation of the z-scores and the QQ plot.
z.scores <- mdf$avgTempC %>%
outliers::scores(type = "z")
z.scores %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.72936 -0.49824 0.04028 0.00000 0.68650 4.02532
Note, we can observe the histogram of these z-scores against the theoretical normal distribution to check for normality:
hist(z.scores, freq = FALSE, breaks = 40, main = "Avg temp in C z-scores histogram with theoretical normal", xlab = NULL)
x <- seq(-10, 10, by = 0.001)
lines(x = x, y = dnorm(x), col = "red")
We can observe peaks in the middle and to the right, which it might indicate a left-skeweness for this variable. We can say that the average temperature sits in the center, but that there is a tendency for it to go to higher values, which is typical for the equatorial weather band, so there is no need to adjust outliers in this instance.
Looking at the histogram for Num_victims, to gain a better insight into the distribution and the effect of the outliers.
hist(mdf$Num_victims, breaks = 23, main = "Number of Victims per case histogram", xlab = NULL)
We can see that the scale of the frequency poses a challenge to see the outlier effect, and also the right-skewness of the data.
Also an interesting effect, which sees gaps in the distribution at the odd values.
Let’s try the Box-Cox transformation, which is a type of power transformation; the reason why we would consider this type of transformation is with the intention of transforming non-normal data into normally distributed data, thus decreasing the skewness. The interesting part of using this kind of transformation is that it uses the parameter \(\lambda\), to perform the transformation, and with the aid of the {forecast} package and the functions to find the best \(\lambda\) parameter for the Box-Cox transformation, this becomes an easy task.
boxcox_nv <- BoxCox(mdf$Num_victims, lambda = "auto")
This function returns the transformed data values. From this output the optimum \(\lambda\) value is found as -0.4002659.
No we can look at histogram using the transformed values.
hist(boxcox_nv,
main = bquote("Histogram of Box-Cox transformed victim numbers"),
xlab = "Transformed victim numbers")
Still, it doesn’t provide a great change in the skewness of the data.
Let’s try a different more direct method.
par(mfrow = c(2, 2))
log10_nv <- log10(mdf$Num_victims)
hist(log10_nv,
main = "Histogram of log (base 10) victim numbers",
xlab = "Base 10 Log of victim numbers")
log_nv <- log(mdf$Num_victims)
hist(log_nv,
main = "Histogram of natural log of victim numbers",
xlab = "Natural Log of victim numbers")
insqr_nv <- mdf$Num_victims^(-2)
hist(insqr_nv,
main = "Histogram of victim numbers inv square",
xlab = "Inverse square of victim numbers")
sqrroot_nv <- mdf$Num_victims^(1/2)
hist(sqrroot_nv,
main = "Histogram of victim numbers squareroot",
xlab = "Inverse square root of victim numbers")
The results across the board are not very convincing, perhaps the outlier problem posses a larger thread than expected.
After making several attempts to plotting the chart, I am finding impossible to get the temperature line correctly, for comparison, though I have the feeling that the Date object in the x axis of a histogram has its own challenges, as I don’t even know if the frequency is adding daily cases correctly.
The major impediment seems to provide the correct binning for the as.Date format.
ggplot(mdf, aes(x=Date)) + geom_histogram(binwidth=30, colour="white") +
scale_x_date(#labels = format("%Y-%b"),
breaks = seq(min(mdf$Date), as.Date("2018-05-31"), 30),
limits = c(as.Date("2010-01-01"), as.Date("2018-05-31"))) +
ylab("Frequency") +
xlab("Date") + # Year and Month #ideally but the transformation was not working
theme_bw() +
theme(axis.text.x = element_text(angle=90)
) +
#ggplot(mdf, aes(x=Date, y=avgTempC)) +
geom_line(aes(y=avgTempC), binwidth=30, colour="red") +
scale_y_continuous(
# Features of the first axis
name = "Number of cases",
# Add a second axis and specify its features
sec.axis = sec_axis(trans=~./10, name="Scaled temperature")
) +
ggtitle("Temperature up, violence up")
My Initial plan was to collect a couple of datasets, with which possibly tell a story about climate change and crop production, or as a second option, the effect of the ability to buy Tesla cars with Bitcoin, in the cryptocurrency’s price and trading volumes. I was not able to find an untidy dataset on those fields, and in the process, I found the dataset’s we are going to utilise.
Some of the questions are: Is there a correlation between violent incidents and higher temperatures? Does family violence suffers from environmental inputs, as well as, from socio-economic ones?
The initial difficulty, to find the desired datasets, disappeared with the subject change. Also, as mention in the section above, dealing with date format data to generate a x-axis histogram. The other major couple of difficulties have been the incredible number of outliers in the number of victims, something I was not really expecting to encounter. And plotting the wanted graph, to be able to visualise the possible correlation.
With the outliers, there is a lot more work to be done to get them right, although I think the major problem comes from the dataset. With the plot, I have tried a few methods and this is the closest that I’ve got to plot the desired data, more learning would have to be done in order to complete this task.
The major insight, which I’ve already knew, but trusted the source to be better is to check the data more at the beginning, to be able to produce the desired results. In this case, given that the aim of the project is to showcase our capabilities, it is not a major issue, however it would have been, if an analysis was required. Another major insight is how much more R I need to personally learn, to feel as comfortable as I am with Python, to perform and deliver some of the key points.
-PERILLA, Oscar David (2021), ‘Domestic violence in Colombia’, Kaggle (CC0: Public Domain). Downloaded on 19 April 2022 https://www.kaggle.com/datasets/oscardavidperilla/domestic-violence-in-colombia
-Kumar, Sudalai Raj aka SRK (2020), ‘Daily Temperature of Major Cities’, Kaggle (Data files: Original Authors). Downloaded on 19 April 2022 https://www.kaggle.com/datasets/sudalairajkumar/daily-temperature-of-major-cities?select=city_temperature.csv
-Wickham & Grolemund (2020), ‘R for Data Science’, O’Reilly.
-University of Dayton (Unknown), ‘Environmental Protection Agency Average Daily Temperature Archive’. Accessed on 19 April 2022 http://academic.udayton.edu/kissock/http/Weather/default.htm