The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.
Your task is to:
Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets:
Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
The data is read from a .csv file located in my github. It can be accessed here
The column names had a X before the name so I used some regular expression to clean that out.
data1 <- read.csv("https://raw.githubusercontent.com/irene908/DATA-607/master/dataset1_project2.csv")
names(data1)[1] <- "Country"
names(data1) <- gsub(x = names(data1), pattern = "\\X", replacement = "")
datatable(data1, options = list(scrollY = 300, scroller = TRUE ))
nrow(data1)
## [1] 232
excess<- c('North America', 'Central & South America', 'Antarctica', 'Eurasia', 'Middle East', 'Asia & Oceania', 'World', 'Africa', 'Europe', 'Former Czechoslovakia', 'Former Serbia and Montenegro', 'Former Yugoslavia', 'East', 'Hawaiian Trade Zone', 'U.S. Pacific Islands', 'Wake Island', 'Former U.S.S.R.')
data1_new <- data1[ !grepl(paste(excess, collapse="|"), data1$Country),]
datatable(data1_new, options = list(scrollY = 300, scroller = TRUE ))
nrow(data1_new)
## [1] 212
A separate list of the country codes of each of the countries is created and it will be appended to the original data later on.
Continent <- as.factor(countrycode(sourcevar = data1_new[, "Country"], origin = "country.name", destination = "continent"))
Continent<-data.frame(Continent)
datatable(Continent, options = list(scrollY = 300, scroller = TRUE ))
nrow(Continent)
## [1] 212
data1_append<-data.frame(Continent, data1_new)
names(data1_append) <- gsub(x = names(data1_append), pattern = "\\X", replacement = "")
datatable(data1_append, options = list(scrollY = 300, scroller = TRUE ))
nrow(data1_append)
## [1] 212
The tidyr gather() function is used.
data1_long <- df %>% gather(data=data1_append, Population, "1980":"2010")
colnames(data1_long)[colnames(data1_long)=="."]<-"Year"
datatable(data1_long, options = list(scrollY = 300, scroller = TRUE ))
nrow(data1_long)
## [1] 6572
data1_france<-subset(data1_long, Country=='France', select=c(Country, Year, Population))
ggplot(data=data1_france, aes(x=Year, y=Population, group=1)) +
geom_line()+
geom_point()+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
data1_Martinique <-subset(data1_long, Country=='Martinique', select=c(Country, Year, Population))
ggplot(data=data1_Martinique, aes(x=Year, y=Population, group=1)) +
geom_line()+
geom_point()+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
In FRANCE there was a steady increase in population throughout the years. In MARTINIQUE there was a steady increase in population till 2003. After 2003 there was not much increase in population.
This dataset has 2 tables. One for property crime and one for violent crimes. After reading both the tables they are merged together for better analysis. The rates mentioned here are reported offenses per 100,000 population.
The data is read from a .csv file located in my github. It can be accessed here
data2_property <- read.csv("https://raw.githubusercontent.com/irene908/DATA-607/master/dataset2_project2.csv")
names(data2_property) <- gsub(x = names(data2_property), pattern = "\\X", replacement = "")
datatable(data2_property, options = list(scrollY = 300, scroller = TRUE ))
data2_property_long <- data2_property %>%
select(-16) %>% #omitting the last empty column
gather("Year", "Total", 3:15)
datatable(data2_property_long, options = list(scrollY = 300, scroller = TRUE ))
The data is read from a .csv file located in my github. It can be accessed here
data2_violent <- read.csv("https://raw.githubusercontent.com/irene908/DATA-607/master/dataset2.1_project2.csv")
names(data2_violent) <- gsub(x = names(data2_violent), pattern = "\\X", replacement = "")
datatable(data2_violent, options = list(scrollY = 300, scroller = TRUE ))
data2_violent_long <- data2_violent %>%
select(-16) %>% #omitting the last empty column
gather("Year", "Total", 3:15)
datatable(data2_violent_long, options = list(scrollY = 300, scroller = TRUE ))
Using a for loop the two dataframes are merged together
data2_violent_long$Crime <- "Violent"
data2_property_long$Crime <- "Property"
data2_crime = data.frame(matrix(vector(), 286, 5,
dimnames=list(c(), c("Agency", "State", "Year", "Total", "Crime"))))
for(i in 1:5){
data2_crime[i] <- bind_rows(data2_violent_long[i], data2_property_long[i])
}
datatable(data2_crime, options = list(scrollX = TRUE))
The crime rate is low for NEW YORK CITY and the change is also low for this city.
ggplot(data2_crime, aes(Year, Total)) + geom_bar(stat = "identity", position=position_dodge(), aes(fill = Crime)) + facet_wrap(.~Agency) + theme(axis.text.x = element_text(angle = 90, hjust = 1))
#crime rates for all cities over time
ggplot(data2_crime) +
geom_point(aes(Year, Total, color=Agency))
The crime rate is low for NEW YORK CITY and the change is also low for this city.
This dataset contains the unemployment rates of 88 countries from 1990 - 2017
The data is read from a .csv file located in my github. It can be accessed here
data3 <- read.csv("https://raw.githubusercontent.com/irene908/DATA-607/master/dataset3_project2.csv")
names(data3)[1] <- "Year"
names(data3) <- gsub(x = names(data3), pattern = "\\X", replacement = "")
data3 <- data3[-c(1), ]
datatable(data3, options = list(scrollY = 300, scroller = TRUE ))
data3_new<-rename(data3,ARG=Argentina,SWE=Sweden,US=United.States,KOR=Korea..Rep.,JPN=Japan,BRA=Brazil,CAN=Canada,IND=India,MEX=Mexico,IRE=Ireland)
data3_new<-select(data3_new,Year,ARG,SWE,US,KOR,JPN,BRA,CAN,IND,MEX,IRE)
datatable(data3_new, options = list(scrollY = 300, scroller = TRUE ))
data3_long <- gather(data3_new,Country,Rate_Unemployment,-Year)
datatable(data3_long, options = list(scrollY = 300, scroller = TRUE ))
data3_long_ire<-subset(data3_long,Country=="IRE")
ggplot(data3_long_ire, aes(x=Year, y=Rate_Unemployment)) + geom_bar(stat='identity')
data3_long_ind<-subset(data3_long,Country=="IND")
ggplot(data3_long_ind, aes(x=Year, y=Rate_Unemployment)) + geom_bar(stat='identity')
## Warning: Removed 19 rows containing missing values (position_stack).
In 1993 Ireland had the highest unemployment rates. In 2009 India had the highest unemployment rates.