Introduction

  For this project, we were tasked with cleaning multiple untidy data sets into a R friendly format that can be used for analysis. The following data sets are excellent examples of wild data that require special processing in order to become useful. The following sections will introduce the data used, explain the issues with the data, and then demonstrate various methods that can be used to fix those problems.

Data 1, Part One

    For the first data set, we will be looking at demographic data-sets from the Data Center, a not-for-profit that collates and publishes data for southeastern Louisiana. specifically, a file titled New Orleans Neighborhood Profiles. It contains data tables that support Neighborhood Statistical Area Data Profiles. This includes 252 data points for all 72 neighborhood statistical areas, with New Orleans and United States comparisons. Data originates from Census 2000 and 2010, American Community Survey 2014–2018, and 2017 Local Employment Dynamics data — including housing and housing costs, income and poverty, transportation, educational attainment, language, employment, and more.

  This file contains 23 worksheets (tables), each containing parish data for a unique demographic variable (e.g., income distribution) compared between 2000 and 2014-2018. The data are nested by variable, year, and value. In order to efficiently fix the data within this file, we shall chose to read in every table as one list of data frames. This approach offers many advantages, not the least of which being the massive reduction of repetitious code and an environment not bloated by multiple data frames.

  Shown below is the function we create to read in each table from the xlsx file as one big list of data frames. Using that, we only need to make one call on our original data frame for it to be read in. If we attempted to do read in each table individually, that would have required making 23 separate reader calls. This file can be found and downloaded here.

read_excel_allsheets <- function(filename, tibble = FALSE) {
    sheets <- readxl::excel_sheets(filename)
    x <- lapply(sheets, function(X) readxl::read_excel(filename, sheet = X,
                       skip = 11, col_names = T, .name_repair = "unique" ))
    if(!tibble) x <- lapply(x, as.data.frame)
    names(x) <- sheets
    x
}

if(!file.exists('New Orleans Neighborhood profiles data tables.xlsx')) {
    download.file(
'https://github.com/zachsfr/Cuny-SPS/blob/main/607/Projects/Project%202/Hurricane/New_Orleans_Demographics-main/New_Orleans_Demographics-main/New%20Orleans%20Neighborhood%20profiles%20data%20tables.xlsx?raw=true', 
                  destfile = 'New Orleans Neighborhood profiles data tables.xlsx',mode = "wb")
}

mysheets <- read_excel_allsheets("New Orleans Neighborhood profiles data tables.xlsx")

options(digits = 3)


extra <- mysheets[c(1,2)]


mysheets[[1]] <- NULL
mysheets[[1]] <- NULL

Data 1, Part Two

  Below, we see both the initial R list of dataframes, as well as an image from the data set open in excel. Through looking at both, we can start figuring out some common problems. For one, we can see that, due to the spacing between male and female in the original file, the reader function added an empty column between them. This same problem occurs for every table in this file.

  While exploring the data, we begin to notice and increasing number of similar shared problems. Every table begins with an unnamed column containing the neighborhoods of New Orleans. In addition, due to NA values being inputted as “NaN%” and “na”, despite the rest of the values being numeric, R has no choice but to make all the values of those columns character values. This causes the global rounding option we specified above to not function for those columns, leaving many of the numbers with an inane number of digits.



Data 1, Part Three

  The choice to leave our data in a list of dataframes allows us to use the following method shown below. The map function from the purr library allows us to apply other functions to a list. Through using map, we can fix various common issues on every data table using a minimal amount of code.

  First and foremost, using another library called janitor, we can remove every empty column in our data using remove_empty() and specifying “cols”. Then, we can add the missing name to the first column of every table. For the purposes of this project, the MOE column, or margin of error, is unnecessary, so we can elect to remove that as well. The next important piece to remove are the NA values inputted using character values. For whatever reason, dpylr’s function to convert values to NA, na_if(), only allows us to specify one value to convert. However, the fauxnaif library was designed with the single sole purpose to fix this, allowing us to choose both values we want to convert to NA. Using the across method within mutate, we can find and transform both “NaN%” and “na” to NA. With both values properly set to NA, we can use the base R method, type.convert(), to transform our columns back to the correct type. And the final piece, clean_names(), is another janitor function that transforms the column names into a friendly format for analysis.

 mysheets<- map(mysheets, ~ .x %>% remove_empty("cols")   %>%
     rename(neighborhood = ...1)   %>%
       select(!starts_with("M")) %>%
       mutate(across(where(is.character), ~fauxnaif::na_if_in(., c("NaN%","na"))))  %>%
       type.convert() %>%      
       clean_names() 
              )

  The table below is a sample of the much tidier data. The neighborhood column is named, the empty columns are gone, and the digits are back to normal. We can now begin fixing the format of the tables in our data.

Data 1, Part Four

  One of the major problems we have in fixing the data comes from the original format. As seen in the image from the data open in excel earlier, and in the tidier table above, the original data set not only put the original variables names above the year information, but also had them side by side. As a result, in the tidier table above, we have no information letting us know that the first 2,000 and 2014-2018 columns are for females, and that the second set are for males.

  However, fixing this problem has become trivial with the introduction of pivot_longer. It allows us to pull information from the column names to be used to fill in new columns. By re-naming our columns with information we want in our final data frame, we can then use the parameters in pivot_longer to pull in that information, filling in the new columns we want to create. The second completely tidied second table can be seen below the code.

  As an additional note, the extra code in the first function was used to fix an issue that can be viewed in the original data inside of excel. That extra bit of information added at the end of only Table 1 was read in when we used our reader function. It’s extremely important you take extra care to notice abnormalities such as this when cleaning untidy data.


mysheets[[1]] <- mysheets$`Table 1` %>%
  filter(!str_detect(neighborhood,"\\*|^So|na") )  %>%
rename(p_2000= x2000_2,`p_2014-2018`=x2014_2018_3, t_2000 = x2000_6, `t_2014-2018` = x2014_2018_7,
        f_2000= x2000_10, `f_2014-2018`= x2014_2018_11)  %>%
  pivot_longer(cols = p_2000:`f_2014-2018`,
             names_to = c("pop_type","year"),
             names_pattern = "(.)_([0-9]+-[0-9]+|[0-9]+)"
            )


mysheets[[2]] <- mysheets$`Table 2` %>%
  rename(m_2000= x2000_2,`m_2014-2018`=x2014_2018_3, f_2000 = x2000_6, `f_2014-2018` = x2014_2018_7) %>%

pivot_longer(cols = m_2000:`f_2014-2018`,
             names_to = c("gender","year"),
             names_pattern = "(.)_([0-9]+-[0-9]+|[0-9]+)"
            )


mysheets[[3]] <- mysheets$`Table 3` %>%
  rename(`0t4_2000`= x2000_2,`0t4_2014-2018`=x2014_2018_3, `5t9_2000` = x2000_6, `5t9_2014-2018` = x2014_2018_7,
        `10t14_2000`= x2000_10, `10t14_2014-2018`= x2014_2018_11,`15t17_2000` = x2000_14, `15t17_2014-2018` = x2014_2018_15, 
        `18t34_2000` = x2000_18, `18t34_2014-2018` = x2014_2018_19, `35t49_2000` = x2000_22, `35t49_2014-2018` = x2014_2018_23,
        `50t64_2000` = x2000_26, `50t64_2014-2018` = x2014_2018_27, `65t74_2000` = x2000_30, `65t74_2014-2018` = x2014_2018_31,
        `75t84_2000` = x2000_34, `75t84_2014-2018` = x2014_2018_35, `85+_2000` = x2000_38, `85+_2014-2018` = x2014_2018_39
        
        ) %>%
pivot_longer(cols = `0t4_2000`:`85+_2014-2018`,
             names_to = c("age_bracket","year"),
             names_pattern = "(.*?)_([0-9]+-[0-9]+|[0-9]+)"
          )

Data 2, Part One

  The second data set we deal with comes from this link here. It is a Wikipedia article containing a list of nuclear power accidents by country. Each country is separated into their own table.

  We can use the following functions from the rvest library to grab the data stored in the tables of the wikipedia page.

  The image below gives some valuable information about the initial state of our data frames. Our goal with this data set is to combine the separate tables into one complete dataframe. However, the default state of this data will now allow us to connect our data together. Below we will discuss the various fixes that need to be made.

nuclear <- read_html("https://en.wikipedia.org/wiki/List_of_nuclear_power_accidents_by_country") %>%
  
 html_table()  

Data 2, Part Two

  The first important thing to notice, is that the 15th table isn’t a table at all. The scraping tool accidentally took in extra information that we can drop. The next important piece to notice is that the date column type is not uniform. In the first table it is numeric, and in the other tables it is a character type. We cannot bind the rows together until the types are the same. Fixing that with the above mentioned map function combined with mutating across will again make this fix very easy. The last important thing to change will be the column names themselves. While the values for each column are uniform, the names are not. If we were to combine the data without fixing this, the data set created will have separate columns based on those names.

  To fix this, we need to remove the excess information. The INES column for example is sometimes simply “INES”, and other times “INES Rating”. Likewise, with the Cost column, we either simply have “Cost” or Cost followed by a space and then some other information. The key here, is that in both cases, the additional information comes after a space. With that in mind, we can simple grab the text prior to any space, and delete the rest. This almost entirely fixes our problem, with one exception. In table 13, Cost is not followed by a space, which means, even with our fix, it will still have a different name. We need to manually correct this column to fix our issue.

  With these changes, we have created a uniform data set from our list of separate tables. There are still some more fixes that could be made, but that goes outside the scope of this project. The final data set can be seen in the table below.


nuclear <- nuclear[c(1:14)]


nuclear[[13]] <- nuclear[[13]] %>%
   rename(Cost = `Cost(in millions2006 US$)`)


nuclear <- map(nuclear, ~ .x %>% mutate(across(everything(), as.character))  %>%
               rename_with(~qdap::beg2char(.))
            )

nuclear <- nuclear %>%
  bind_rows()

Data Three, Part One

  The final data set is from here. It contains information on various apps that were web scraped Google Play store, last updated 2 years ago. The table below is the initial data from our dataaset.
googleplaystore <- read_csv("https://raw.githubusercontent.com/zachsfr/Cuny-SPS/main/Data/archive/googleplaystore.csv")


  The first thing to notice that the prices have a dollar sign in front of them, making them character values. We can create a fixed column using gsub.
googleplaystore %>%
  count(Price)  

  googleplaystore$Price_ac <- as.numeric(gsub("\\$", "", googleplaystore$Price))

  We then notice that the web scraping method used created double copies of various apps in the data.
googleplaystore  %>%
  group_by(App)  %>%
  filter(n()>1)%>%
    arrange(desc(App))

  We fix this by selecting only one version of each App name using the function below.
googleplaystore <- googleplaystore  %>%
distinct(App, .keep_all = TRUE )

  We can now use ggplot to discover which categories are the most common in the app store.
googleplaystore %>%
  group_by(Category) %>%
  summarise(count = n()) %>%
  
  ggplot( aes(x= reorder(Category,count),y=count ) )+ 
  geom_bar( stat ="identity",fill="purple") +
  coord_flip() +
labs(x="Category",y="Count",title="Number of Apps in Each  Category")


  This graphic shows us the price distribution of the top most common apps. As we can see, the majority of these apps are all free which causes the data to be extremely right skewed.
googleplaystore %>%
 filter(Category %in% c("FAMILY","GAME","TOOLS","BUSINESS")) %>%
 ggplot(aes(x=Price_ac)) + geom_histogram( bins=46) + facet_wrap(~Category)    +
labs(x="Price",y="Count",title="Distribution of Price for Top Apps")

Part Two

  In this section, we explore another major advantage of keeping our data as a list of data frames. By doing so, we can efficiently upload our finished data to a SQL database.

  Creating the connection is a simple process once you have a database up and running. For the sake of simplicity, this example uses a dummy password. However, in the case of a real database, we would use a file with the password in it in order to hide that value from anyone looking at our code.

con <- dbConnect(RPostgres::Postgres(), user='postgres', password='zach', dbname='zach')

  As seen below, the process of uploading our data to our data base is extremely simple. We start by adding the final version of our second and third data sets to our list, and then we create a list of names based on the data frame names inside our list. We make sure there are no spaces and that all our values are lowercase, as SQL will not function properly if we try adding tables with either spaces or uppercase in their name.

  Once we have our list of names properly formatted, we can loop through our list of dataframes, and using the dbWriteTable function, we can write all our data into our data base. I am certain there is a way of doing this without looping. Unfortunately, I could not figure out how to make lapply work for me this time.

mysheets[["Nuclear"]] <- nuclear
mysheets[["googleplaystore"]] <- googleplaystore


str_replace_all(names(mysheets), fixed(" "), "") ->list_name


list_name <- tolower(list_name)

a = 1
 for(i in mysheets){
  dbWriteTable(con,name = list_name[a],value =i ,row.names=FALSE,overwrite=TRUE)
   a = a +1
   }

As seen below, we have successfully added all our data into our SQL databasee.

SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;
Displaying records 1 - 10
table_name
googleplaystore
nuclear
table1
table10
table11
table12
table13
table14
table15
table16

  This SQL chunk verafies that all our values were also added properly.

 SELECT * FROM table1
Displaying records 1 - 10
neighborhood pop_type year value
Algiers Point p 2000 2381
Algiers Point p 2014-2018 2735
Algiers Point t 2000 1145
Algiers Point t 2014-2018 1326
Algiers Point f 2000 555
Algiers Point f 2014-2018 670
Audubon p 2000 14898
Audubon p 2014-2018 16374
Audubon t 2000 5700
Audubon t 2014-2018 4559

Conclusion

  This project had me work with various kinds data and led me to discover various new libraries and functions. I have had to work with excel sheets full of different tables in past. Having to write the same read function multiple times to get all the data inside R, and then doing the same fixes many times over, always bothered me. Realizing that I could fix this with lists was an amazing revelation. The advantages you get from having a list of a data frames will likely be something that I continue to make use of from here on.

  Unfortunately, there was much I was unable to finish with this project. If I had more time, I would fix more of the New Orleans tables and do a full analysis with the fixed data. From there, I would make some nice visuals and create a story based on my findings.

  With the nuclear data, there is still some more cleaning to be done. In addition, we could attempt to fill in some of the missing information from the wiki pages with other data sources. With a more complete data set, we could analyze and work to discover some trends that occur with nuclear disasters.

  The final data set, the Google Play Store data, has many interesting uses. We could delve deeper into analyzing the data to discover important trends. Such as, key aspects that lead to an app being successful. There are also plenty other interesting applications of the data that we could discover.

  With this in mind, there is still plenty more to investigate. We will leave that adventure for next time! Thank you very much for taking the time to read my project, and hopefully you learned something as well.