Goal

The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.

Your task is to:

Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets:

 Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.

 Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]

 Perform the analysis requested in the discussion item.

 Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

Dataset 1 : The Population Change Dataset

Read the data

The data is read from a .csv file located in my github. It can be accessed here

The column names had a X before the name so I used some regular expression to clean that out.

data1 <- read.csv("https://raw.githubusercontent.com/irene908/DATA-607/master/dataset1_project2.csv")
names(data1)[1] <- "Country"
names(data1) <- gsub(x = names(data1), pattern = "\\X", replacement = "")
datatable(data1, options = list(scrollY = 300, scroller = TRUE ))
nrow(data1)
## [1] 232

Remove the rows which are not countries

excess<- c('North America', 'Central & South America', 'Antarctica', 'Eurasia', 'Middle East', 'Asia & Oceania', 'World', 'Africa', 'Europe', 'Former Czechoslovakia', 'Former Serbia and Montenegro', 'Former Yugoslavia', 'East', 'Hawaiian Trade Zone', 'U.S. Pacific Islands', 'Wake Island', 'Former U.S.S.R.')

data1_new <- data1[ !grepl(paste(excess, collapse="|"), data1$Country),]

datatable(data1_new, options = list(scrollY = 300, scroller = TRUE ))
nrow(data1_new)
## [1] 212

Create a list of country code

A separate list of the country codes of each of the countries is created and it will be appended to the original data later on.

Continent <- as.factor(countrycode(sourcevar = data1_new[, "Country"], origin = "country.name", destination = "continent"))
Continent<-data.frame(Continent)
datatable(Continent, options = list(scrollY = 300, scroller = TRUE ))
nrow(Continent)
## [1] 212

Append data and the country code list

data1_append<-data.frame(Continent, data1_new)
names(data1_append) <- gsub(x = names(data1_append), pattern = "\\X", replacement = "")
datatable(data1_append, options = list(scrollY = 300, scroller = TRUE ))
nrow(data1_append)
## [1] 212

convert to long

The tidyr gather() function is used.

data1_long <- df %>% gather(data=data1_append, Population, "1980":"2010")
colnames(data1_long)[colnames(data1_long)=="."]<-"Year" 

datatable(data1_long, options = list(scrollY = 300, scroller = TRUE ))
nrow(data1_long) 
## [1] 6572

Analyse the change in population in FRANCE

data1_france<-subset(data1_long, Country=='France', select=c(Country, Year, Population))

ggplot(data=data1_france, aes(x=Year, y=Population, group=1)) +
  geom_line()+
  geom_point()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Analyse the change in population in Martinique

data1_Martinique <-subset(data1_long, Country=='Martinique', select=c(Country, Year, Population))
ggplot(data=data1_Martinique, aes(x=Year, y=Population, group=1)) +
  geom_line()+
  geom_point()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Conclusion

In FRANCE there was a steady increase in population throughout the years. In MARTINIQUE there was a steady increase in population till 2003. After 2003 there was not much increase in population.

Dataset 2 : U.S. Department of Justice Crime Dataset

This dataset has 2 tables. One for property crime and one for violent crimes. After reading both the tables they are merged together for better analysis. The rates mentioned here are reported offenses per 100,000 population.

Read the property crime data

The data is read from a .csv file located in my github. It can be accessed here

data2_property <- read.csv("https://raw.githubusercontent.com/irene908/DATA-607/master/dataset2_project2.csv")
names(data2_property) <- gsub(x = names(data2_property), pattern = "\\X", replacement = "")
datatable(data2_property, options = list(scrollY = 300, scroller = TRUE ))

Convert to long

data2_property_long <- data2_property %>% 
  select(-16) %>%                   #omitting the last empty column
  gather("Year", "Total", 3:15)
datatable(data2_property_long, options = list(scrollY = 300, scroller = TRUE ))

Read the violent crime data

The data is read from a .csv file located in my github. It can be accessed here

data2_violent <- read.csv("https://raw.githubusercontent.com/irene908/DATA-607/master/dataset2.1_project2.csv")
names(data2_violent) <- gsub(x = names(data2_violent), pattern = "\\X", replacement = "")
datatable(data2_violent, options = list(scrollY = 300, scroller = TRUE ))

Convert to long

data2_violent_long <- data2_violent %>% 
  select(-16) %>%                   #omitting the last empty column
  gather("Year", "Total", 3:15)
datatable(data2_violent_long, options = list(scrollY = 300, scroller = TRUE ))

Combine the 2 dataframes

Using a for loop the two dataframes are merged together

data2_violent_long$Crime <- "Violent"
data2_property_long$Crime <- "Property"

data2_crime = data.frame(matrix(vector(), 286, 5,
                dimnames=list(c(), c("Agency", "State", "Year", "Total", "Crime"))))

for(i in 1:5){
data2_crime[i] <- bind_rows(data2_violent_long[i], data2_property_long[i])
}

datatable(data2_crime, options = list(scrollX = TRUE))

Analysis

The crime rate is low for NEW YORK CITY and the change is also low for this city.

ggplot(data2_crime, aes(Year, Total)) + geom_bar(stat = "identity", position=position_dodge(), aes(fill = Crime)) + facet_wrap(.~Agency) + theme(axis.text.x = element_text(angle = 90, hjust = 1))

Crime rate in all the cities

#crime rates for all cities over time
ggplot(data2_crime) + 
  geom_point(aes(Year, Total, color=Agency))

Conclusion

The crime rate is low for NEW YORK CITY and the change is also low for this city.

Dataset 3 : Unemployment Rates Dataset

This dataset contains the unemployment rates of 88 countries from 1990 - 2017

Read the data

The data is read from a .csv file located in my github. It can be accessed here

data3 <- read.csv("https://raw.githubusercontent.com/irene908/DATA-607/master/dataset3_project2.csv")
names(data3)[1] <- "Year"
names(data3) <- gsub(x = names(data3), pattern = "\\X", replacement = "")
data3 <- data3[-c(1), ]
datatable(data3, options = list(scrollY = 300, scroller = TRUE ))

Cleaning the data and creating a subset of the required data

data3_new<-rename(data3,ARG=Argentina,SWE=Sweden,US=United.States,KOR=Korea..Rep.,JPN=Japan,BRA=Brazil,CAN=Canada,IND=India,MEX=Mexico,IRE=Ireland)
data3_new<-select(data3_new,Year,ARG,SWE,US,KOR,JPN,BRA,CAN,IND,MEX,IRE)
datatable(data3_new, options = list(scrollY = 300, scroller = TRUE ))

Convert to long

data3_long <- gather(data3_new,Country,Rate_Unemployment,-Year)
datatable(data3_long, options = list(scrollY = 300, scroller = TRUE ))

Analysis of unemployment in IRELAND

data3_long_ire<-subset(data3_long,Country=="IRE")
ggplot(data3_long_ire, aes(x=Year, y=Rate_Unemployment)) + geom_bar(stat='identity')

Analysis of unemployment in INDIA

data3_long_ind<-subset(data3_long,Country=="IND")
ggplot(data3_long_ind, aes(x=Year, y=Rate_Unemployment)) + geom_bar(stat='identity')
## Warning: Removed 19 rows containing missing values (position_stack).

Conclusion

In 1993 Ireland had the highest unemployment rates. In 2009 India had the highest unemployment rates.