library(dplyr)
library(kableExtra)
library(DT)
library(ggplot2)
# load data
data_utl <- "https://raw.githubusercontent.com/monuchacko/cuny_msds/master/data_606/security_breaches.csv"
#Read csv data
security_breaches <- read.csv(data_utl, stringsAsFactors = FALSE)
#Display output
#kable(sample(security_breaches))
colnames(security_breaches)
## [1] "Number" "Name_of_Covered_Entity"
## [3] "State" "Business_Associate_Involved"
## [5] "Individuals_Affected" "Date_of_Breach"
## [7] "Type_of_Breach" "Location_of_Breached_Information"
## [9] "Date_Posted_or_Updated" "Summary"
## [11] "breach_start" "breach_end"
## [13] "year"
security_breaches_df = security_breaches[c("Name_of_Covered_Entity","State","Individuals_Affected","Date_of_Breach","Type_of_Breach","Location_of_Breached_Information","breach_start","breach_end","year")]
datatable(security_breaches_df)
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
What are the cases, and how many are there?
dim(security_breaches_df)
## [1] 1055 9
Describe the method of data collection.
What type of study is this (observational/experiment)?
If you collected the data, state self-collected. If not, provide a citation/link.
What is the response variable? Is it quantitative or qualitative?
You should have two independent variables, one quantitative and one qualitative.
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
quantile(security_breaches_df$Individuals_Affected)
## 0% 25% 50% 75% 100%
## 500 1000 2300 6941 4900000
#Use regex for find if remote
security_breaches_df = security_breaches_df %>%
mutate(IsRemote = grepl("Unauthorized Access|Hacking", Type_of_Breach , ignore.case = TRUE))
#Show data
datatable(security_breaches_df)
#Find quantile for IsRemote
quantile(security_breaches_df$IsRemote)
## 0% 25% 50% 75% 100%
## 0 0 0 1 1
#summary
summary(security_breaches_df)
## Name_of_Covered_Entity State Individuals_Affected
## Length:1055 Length:1055 Min. : 500
## Class :character Class :character 1st Qu.: 1000
## Mode :character Mode :character Median : 2300
## Mean : 30262
## 3rd Qu.: 6941
## Max. :4900000
## Date_of_Breach Type_of_Breach Location_of_Breached_Information
## Length:1055 Length:1055 Length:1055
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## breach_start breach_end year IsRemote
## Length:1055 Length:1055 Min. :1997 Mode :logical
## Class :character Class :character 1st Qu.:2010 FALSE:771
## Mode :character Mode :character Median :2012 TRUE :284
## Mean :2011
## 3rd Qu.:2013
## Max. :2014
#Plot
ggplot(security_breaches_df, aes(x=Individuals_Affected)) + stat_function(fun = dnorm) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(security_breaches_df, aes(x="", y=Individuals_Affected, fill=IsRemote)) + geom_bar(width = 1, stat = "identity")
ggplot(data = security_breaches_df, aes(x = year, y = Individuals_Affected)) + geom_line(color = "#FC4E07", size = 2)
ggplot(data = security_breaches_df, aes(x = year, y = IsRemote)) + geom_line(color = "#FC4E07", size = 2)