Instructions

Use the dataset named “MedianZIP.xlsx” that has median income by zip code (an excel file).
Read the MedianZIP.xlsx, use the proper function, and save it as mydata. Save the .xlsx file in this project’s data folder.


Step 1 - Load the Data

Step 1.1 - Read the data

Install the readxl and tidyverse package and load them. Read the MedianZIP.xlsx, use the proper function, and save it as mydata. If you need other packages for this lab, you may need to install them too. Try what works. There are so many ways to do this.

# Write your code below.

# Install and loading packages
options(repos = c(CRAN = "https://cran.rstudio.com"))
install.packages("readxl")
## 
## The downloaded binary packages are in
##  /var/folders/jz/p8s05vwx5ql1kc7g34pxkmz00000gn/T//RtmpNS0tEX/downloaded_packages
install.packages("tidyverse")
## 
## The downloaded binary packages are in
##  /var/folders/jz/p8s05vwx5ql1kc7g34pxkmz00000gn/T//RtmpNS0tEX/downloaded_packages
library(readxl)
library(tidyverse)
## Warning: package 'dplyr' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
#file path for MedianZip
file_path <- "/Users/auz/Downloads/Week4_HW_zip/MedianZIP.xlsx"

# Read the data using read_excel function from readxl package
mydata <- read_excel(file_path)
## Warning: Expecting numeric in C5070 / R5070C3: got '.'
## Warning: Expecting numeric in C25733 / R25733C3: got '.'
## Warning: Expecting numeric in C25734 / R25734C3: got '.'
## Warning: Expecting numeric in C25735 / R25735C3: got '.'
## Warning: Expecting numeric in C25811 / R25811C3: got '.'
## Warning: Expecting numeric in C29646 / R29646C3: got '.'
## Warning: Expecting numeric in C29981 / R29981C3: got '.'
# Display a summary of the data
summary(mydata)
##       Zip            Median               Mean               Pop        
##  Min.   : 1001   Min.   :    32.98   Min.   :    53.6   Min.   :     1  
##  1st Qu.:27301   1st Qu.: 38462.00   1st Qu.: 48593.2   1st Qu.:   736  
##  Median :49875   Median : 46503.32   Median : 56949.6   Median :  2756  
##  Mean   :49875   Mean   : 50938.21   Mean   : 63452.2   Mean   :  9193  
##  3rd Qu.:72134   3rd Qu.: 58255.50   3rd Qu.: 70341.2   3rd Qu.: 12513  
##  Max.   :99929   Max.   :223106.17   Max.   :361842.3   Max.   :113916  
##                                      NA's   :7

Step 1.2 - Clean the data

Clean up the dataframe if needed and make sure the column names are zip, median, mean, population. Make sure the values in each column are “numeric”. If they are factors or characters, you must change them to quantitative data.

# Write your code below.

# Specify the file path
file_path <- "/Users/auz/Downloads/Week4_HW_zip/MedianZIP.xlsx"

# Read the data using read_excel function from readxl package
mydata <- readxl::read_excel(file_path, na = c(".", ""))

# Display a summary of the original data
summary(mydata)
##       Zip            Median               Mean               Pop        
##  Min.   : 1001   Min.   :    32.98   Min.   :    53.6   Min.   :     1  
##  1st Qu.:27301   1st Qu.: 38462.00   1st Qu.: 48593.2   1st Qu.:   736  
##  Median :49875   Median : 46503.32   Median : 56949.6   Median :  2756  
##  Mean   :49875   Mean   : 50938.21   Mean   : 63452.2   Mean   :  9193  
##  3rd Qu.:72134   3rd Qu.: 58255.50   3rd Qu.: 70341.2   3rd Qu.: 12513  
##  Max.   :99929   Max.   :223106.17   Max.   :361842.3   Max.   :113916  
##                                      NA's   :7
# Create a new data frame with cleaned and standardized zipcodes
cleaned_data <- data.frame(
  zip = as.character(mydata$Zip),  # Assuming 'Zip' is the correct column name
  median = as.numeric(mydata$Median),
  mean = as.numeric(mydata$Mean),
  population = as.numeric(mydata$Pop)
)

# Display the cleaned data
head(cleaned_data)
##     zip    median      mean population
## 1 10001  71244.61 123112.78      17678
## 2 10002  30843.96  46258.61      70878
## 3 10003  89998.53 139331.00      53609
## 4 10004 110183.69 156682.76       1271
## 5 10005 115133.29 163762.66       1517
## 6 10006 111220.00 156776.00        972
# Display a summary of the cleaned data
summary(cleaned_data)
##      zip                median               mean            population    
##  Length:32634       Min.   :    32.98   Min.   :    53.6   Min.   :     1  
##  Class :character   1st Qu.: 38462.00   1st Qu.: 48593.2   1st Qu.:   736  
##  Mode  :character   Median : 46503.32   Median : 56949.6   Median :  2756  
##                     Mean   : 50938.21   Mean   : 63452.2   Mean   :  9193  
##                     3rd Qu.: 58255.50   3rd Qu.: 70341.2   3rd Qu.: 12513  
##                     Max.   :223106.17   Max.   :361842.3   Max.   :113916  
##                                         NA's   :7
# Assign the cleaned data back to mydata
mydata <- cleaned_data

Step 1.3 - Load library and data

  1. Load the zipcode package (install the zipcode package first, which is an archived package). The zipcode package can be installed by doing the following. Alternatively, you can use more recent ‘zipcodeR’ package instead.
install.packages("remotes")
library(remotes)
install_version("zipcode", "1.0")
library(zipcode)
  1. Use data(zipcode) to load a dataframe that contains city, state, latitude, and longitude for US zip codes.
  2. Double-check your environment to find the zipcode dataframe with five variables and 44336 rows.
# Write your code below.
# Install and load the remotes package
install.packages("remotes")
## 
## The downloaded binary packages are in
##  /var/folders/jz/p8s05vwx5ql1kc7g34pxkmz00000gn/T//RtmpNS0tEX/downloaded_packages
library(remotes)
install_version("zipcode", "1.0")
## Downloading package from url: https://cran.rstudio.com/src/contrib/Archive/zipcode/zipcode_1.0.tar.gz
library(zipcode)

# Load the data
data(zipcode)

Step 1.4 - Merge the data

Merge the zipcode information from the two dataframes (merge into one dataframe).

  1. First, clean up and standardize the zipcodes in mydata using the clean.zipcodes() function, and save the values to the zip column of mydata.
  2. Merge mydata and zipcode by the common column zip and store the new dataframe as dfNew.
  3. use the merge() function for this.
# Write your code below.
# Step 1.4 - Merge the data

# Clean up and standardize zipcodes in mydata
mydata$zip <- clean.zipcodes(mydata$zip)

# Merge mydata and zipcode by the common column 'zip'
dfNew <- merge(mydata, zipcode, by = 'zip')

# Display the merged dataframe
head(dfNew)
##     zip   median     mean population        city state latitude longitude
## 1 01001 56662.57 66687.75      16445      Agawam    MA 42.07061 -72.62029
## 2 01002 49853.42 75062.63      28069     Amherst    MA 42.37765 -72.50323
## 3 01003 28462.00 35121.00       8491     Amherst    MA 42.36956 -72.63599
## 4 01005 75423.00 82442.00       4798       Barre    MA 42.41209 -72.10443
## 5 01007 79076.35 85801.98      12962 Belchertown    MA 42.27842 -72.41100
## 6 01008 63980.00 78391.00       1244   Blandford    MA 42.17431 -72.94828

Step 1.5 - Clean the data again

Remove Hawaii and Alaska (just focus on the “lower 48” states). HINT: You can use the which() function we learned from Intro to Data Science or you can use dplyr to filter the proper rows (use of course the filter() function in the dplyr package).

  • After removing the two states, you should have 32321 rows in your new dataframe. (mydata has 32634 rows in it.)
# Write your code below.

dfLower48 <- filter(dfNew, state != "HI" & state != "AK")

# Check the number of rows after filtering
nrow(dfLower48)
## [1] 32321

Step 2 - Show the income and population per state

Step 2.1 - Create a simpler dataframe

Create a simpler dataframe (call it dfSimple), with just the average median income and the population for each state.

  • There are many ways to do this. But the simplest way is by using dplyr. Use group_by() and summarize() from “dplyr” to do this.
  • The new dataframe should look like this:
Step 2.1 Environment
Step 2.1 Environment
# Write your code below.

dfSimple <- dfLower48 %>%
  group_by(state) %>%
  summarize(avg_median_income = mean(median, na.rm = TRUE),
            total_population = sum(population, na.rm = TRUE))

print(dfSimple)
## # A tibble: 49 × 3
##    state avg_median_income total_population
##    <chr>             <dbl>            <dbl>
##  1 AL               40550.          4770242
##  2 AR               36961.          2936699
##  3 AZ               48132.          6360679
##  4 CA               62629.         36927999
##  5 CO               56303.          4979279
##  6 CT               78520.          3548308
##  7 DC               66834.           579471
##  8 DE               64299.           892487
##  9 FL               50188.         18593035
## 10 GA               44564.          9603179
## # ℹ 39 more rows

Step 2.2 - Update columns

Add the state abbreviations and the state names as new columns (make sure the state names are all lower case).

  1. Get the state name (not just the abbreviations). Use the built-in state.name and state.abb datasets. This is the code: dfSimple$stateName <- state.name[match(dfSimple$state, state.abb)]
  2. Convert stateName to lowercase and save the values in the stateName column.
# Write your code below.

# Add the state name
dfSimple$stateName <- tolower(state.name[match(dfSimple$state, state.abb)])

# Convert stateName to lowercase
dfSimple$stateName <- tolower(dfSimple$stateName)

Step 2.3 - Visualize the US (pt1)

Show the U.S. map, using color to represent the average median income of each state.

  1. Get the data on the state to be mapped. Use map_data() function to read "state" object and save the result as us.
  2. Use dfSimple to create a map and set stateName as map_id. (follow the course content practice, written in the textbook and in the video).

It should look like this (please do not forget to add the title of the map):

Step 2.3 Map
Step 2.3 Map
# Write your code below.
install.packages("ggplot2")
## 
## The downloaded binary packages are in
##  /var/folders/jz/p8s05vwx5ql1kc7g34pxkmz00000gn/T//RtmpNS0tEX/downloaded_packages
install.packages("ggmap")
## 
## The downloaded binary packages are in
##  /var/folders/jz/p8s05vwx5ql1kc7g34pxkmz00000gn/T//RtmpNS0tEX/downloaded_packages
library(ggplot2)
library(ggmap)
## Warning: package 'ggmap' was built under R version 4.2.3
## ℹ Google's Terms of Service: ]8;;https://mapsplatform.google.com<https://mapsplatform.google.com>]8;;
##   Stadia Maps' Terms of Service: ]8;;https://stadiamaps.com/terms-of-service/<https://stadiamaps.com/terms-of-service/>]8;;
##   OpenStreetMap's Tile Usage Policy: ]8;;https://operations.osmfoundation.org/policies/tiles/<https://operations.osmfoundation.org/policies/tiles/>]8;;
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
us <- map_data("state")

# Create a map using ggplot
map <- ggplot(dfSimple, aes(map_id = stateName)) +
  geom_map(map = us, fill = "white", color = "black") +
  expand_limits(x = us$long, y = us$lat) +
  coord_map() +
  ggtitle("Average Median Income by State")

# Display the map
print(map)

Step 2.4 - Visualize the US (pt2)

Create a second map with color representing the population of the state. It should look like this:

Step 2.4 Map
Step 2.4 Map
# Write your code below.

Step 3 - Show the income per zip code

Draw each zipcode on the map, where the color of the “dot” is based on the median income. To make the map look appealing, set the background of the map to black.

The graph should look like this:

Step 3 Map
Step 3 Map
# Write your code below.

Step 4 - Show zip code density

Now generate a different map, one where we can easily see where there are lots of zipcodes and where there are few (using the stat_density2d() function). We will name this as mapD.

It should look like this:

Step 4 Map
Step 4 Map
# Write your code below.

Step 5 - Zoom in to the region around Tampa

Repeat steps 3 and 4, but have the image/map of the Tampa Bay area.

Below I am giving out the code for this:

# Before using geocode function, you must create Google API key. Follow directions in this url:
https://setcompass.com/How-to-Get-Google-Maps-API-Key-Guide.htm
# Google map requires your credit card information to avoid excessive use of Google resources.It will not charge you money as long as you use this to do this homework. It does not use a lot of calls. 

#register_google(key = "your key here", write = TRUE) #### please delete your key information before submitting the compiled file. You can either compile as a docx file and delete this line, or use other software to hide the key information. 

# use geocode function to get latitude and longtitude of Tampa
latlon <- geocode("Tampa, fl")

# create the first zoomed map based on "mapZip", and plot a point representing Tampa
mapZipZoomed <- mapZip + geom_point(aes(x = latlon$lon, y = latlon$lat), color="darkred", size = 3)

# zoom into the region arount Tampa with 10 degrees latitude and longtitude fluctuation (+/- 10)
mapZipZoomed <- mapZipZoomed + xlim(latlon$lon-10, latlon$lon+10) + ylim(latlon$lat-10,latlon$lat+10) + coord_map()

# plot the map
mapZipZoomed
# Write your code below.