Instructions

Use the dataset named “MedianZIP.xlsx” that has median income by zip code (an excel file).
Read the MedianZIP.xlsx, use the proper function, and save it as mydata. Save the .xlsx file in this project’s data folder.


Step 1 - Load the Data

Step 1.1 - Read the data

Install the readxl and tidyverse package and load them. Read the MedianZIP.xlsx, use the proper function, and save it as mydata. If you need other packages for this lab, you may need to install them too. Try what works. There are so many ways to do this.

# Write your code below.
# Step 1.1 - Load readxl and tidyverse, read the Excel file into mydata
library(readxl)
## Warning: package 'readxl' was built under R version 4.4.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tibble' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openxlsx)
## Warning: package 'openxlsx' was built under R version 4.4.3
mydata <- read_excel("D:/Documents/Week4_HW_zip/data/DATA 1/MedianZIP.xlsx", na = ".")
# Show first few rows
head(mydata)
## # A tibble: 6 × 4
##     Zip  Median    Mean   Pop
##   <dbl>   <dbl>   <dbl> <dbl>
## 1 10001  71245. 123113. 17678
## 2 10002  30844.  46259. 70878
## 3 10003  89999. 139331. 53609
## 4 10004 110184. 156683.  1271
## 5 10005 115133. 163763.  1517
## 6 10006 111220  156776    972
# Check structure of the dataset
str(mydata)
## tibble [32,634 × 4] (S3: tbl_df/tbl/data.frame)
##  $ Zip   : num [1:32634] 10001 10002 10003 10004 10005 ...
##  $ Median: num [1:32634] 71245 30844 89999 110184 115133 ...
##  $ Mean  : num [1:32634] 123113 46259 139331 156683 163763 ...
##  $ Pop   : num [1:32634] 17678 70878 53609 1271 1517 ...
# Summary statistics to verify numeric columns
summary(mydata)
##       Zip            Median               Mean               Pop        
##  Min.   : 1001   Min.   :    32.98   Min.   :    53.6   Min.   :     1  
##  1st Qu.:27301   1st Qu.: 38462.00   1st Qu.: 48593.2   1st Qu.:   736  
##  Median :49875   Median : 46503.32   Median : 56949.6   Median :  2756  
##  Mean   :49875   Mean   : 50938.21   Mean   : 63452.2   Mean   :  9193  
##  3rd Qu.:72134   3rd Qu.: 58255.50   3rd Qu.: 70341.2   3rd Qu.: 12513  
##  Max.   :99929   Max.   :223106.17   Max.   :361842.3   Max.   :113916  
##                                      NA's   :7

Step 1.2 - Clean the data

Clean up the dataframe if needed and make sure the column names are zip, median, mean, population. Make sure the values in each column are “numeric”. If they are factors or characters, you must change them to quantitative data.

# Write your code below.
library(dplyr)

colnames(mydata) <- c("zip", "median", "mean", "population")

mydata <- mydata %>%
  mutate(
    zip = as.character(zip),
    across(c(median, mean, population), ~ as.numeric(ifelse(. == ".", NA, .)))
  )
# Show first few rows with cleaned columns
head(mydata)
## # A tibble: 6 × 4
##   zip    median    mean population
##   <chr>   <dbl>   <dbl>      <dbl>
## 1 10001  71245. 123113.      17678
## 2 10002  30844.  46259.      70878
## 3 10003  89999. 139331.      53609
## 4 10004 110184. 156683.       1271
## 5 10005 115133. 163763.       1517
## 6 10006 111220  156776         972
# Check structure and types of columns
str(mydata)
## tibble [32,634 × 4] (S3: tbl_df/tbl/data.frame)
##  $ zip       : chr [1:32634] "10001" "10002" "10003" "10004" ...
##  $ median    : num [1:32634] 71245 30844 89999 110184 115133 ...
##  $ mean      : num [1:32634] 123113 46259 139331 156683 163763 ...
##  $ population: num [1:32634] 17678 70878 53609 1271 1517 ...
# Show summary statistics for numeric columns
summary(mydata)
##      zip                median               mean            population    
##  Length:32634       Min.   :    32.98   Min.   :    53.6   Min.   :     1  
##  Class :character   1st Qu.: 38462.00   1st Qu.: 48593.2   1st Qu.:   736  
##  Mode  :character   Median : 46503.32   Median : 56949.6   Median :  2756  
##                     Mean   : 50938.21   Mean   : 63452.2   Mean   :  9193  
##                     3rd Qu.: 58255.50   3rd Qu.: 70341.2   3rd Qu.: 12513  
##                     Max.   :223106.17   Max.   :361842.3   Max.   :113916  
##                                         NA's   :7

Step 1.3 - Load library and data

  1. Load the zipcode package (install the zipcode package first, which is an archived package). The zipcode package can be installed by doing the following. Alternatively, you can use more recent ‘zipcodeR’ package instead.
install.packages("remotes")
library(remotes)
install_version("zipcode", "1.0")
library(zipcode)
  1. Use data(zipcode) to load a dataframe that contains city, state, latitude, and longitude for US zip codes.
  2. Double-check your environment to find the zipcode dataframe with five variables and 44336 rows.
# Write your code below.
# Step 1.3 - Install and load the zipcode package, then load zipcode data
library(remotes)
## Warning: package 'remotes' was built under R version 4.4.3
library(zipcode)
data(zipcode)
head(zipcode)
##     zip       city state latitude longitude
## 1 00210 Portsmouth    NH  43.0059  -71.0132
## 2 00211 Portsmouth    NH  43.0059  -71.0132
## 3 00212 Portsmouth    NH  43.0059  -71.0132
## 4 00213 Portsmouth    NH  43.0059  -71.0132
## 5 00214 Portsmouth    NH  43.0059  -71.0132
## 6 00215 Portsmouth    NH  43.0059  -71.0132
str(zipcode)
## 'data.frame':    44336 obs. of  5 variables:
##  $ zip      : chr  "00210" "00211" "00212" "00213" ...
##  $ city     : chr  "Portsmouth" "Portsmouth" "Portsmouth" "Portsmouth" ...
##  $ state    : chr  "NH" "NH" "NH" "NH" ...
##  $ latitude : num  43 43 43 43 43 ...
##  $ longitude: num  -71 -71 -71 -71 -71 ...
ls()
## [1] "mydata"  "zipcode"

Step 1.4 - Merge the data

Merge the zipcode information from the two dataframes (merge into one dataframe).

  1. First, clean up and standardize the zipcodes in mydata using the clean.zipcodes() function, and save the values to the zip column of mydata.
  2. Merge mydata and zipcode by the common column zip and store the new dataframe as dfNew.
  3. use the merge() function for this.
# Write your code below.
# Step 1.4 - Clean zipcodes and merge mydata with zipcode
mydata$zip <- clean.zipcodes(mydata$zip)
dfNew <- merge(mydata, zipcode, by = "zip")
head(dfNew)
##     zip   median     mean population        city state latitude longitude
## 1 01001 56662.57 66687.75      16445      Agawam    MA 42.07061 -72.62029
## 2 01002 49853.42 75062.63      28069     Amherst    MA 42.37765 -72.50323
## 3 01003 28462.00 35121.00       8491     Amherst    MA 42.36956 -72.63599
## 4 01005 75423.00 82442.00       4798       Barre    MA 42.41209 -72.10443
## 5 01007 79076.35 85801.98      12962 Belchertown    MA 42.27842 -72.41100
## 6 01008 63980.00 78391.00       1244   Blandford    MA 42.17431 -72.94828
str(dfNew)
## 'data.frame':    32634 obs. of  8 variables:
##  $ zip       : chr  "01001" "01002" "01003" "01005" ...
##  $ median    : num  56663 49853 28462 75423 79076 ...
##  $ mean      : num  66688 75063 35121 82442 85802 ...
##  $ population: num  16445 28069 8491 4798 12962 ...
##  $ city      : chr  "Agawam" "Amherst" "Amherst" "Barre" ...
##  $ state     : chr  "MA" "MA" "MA" "MA" ...
##  $ latitude  : num  42.1 42.4 42.4 42.4 42.3 ...
##  $ longitude : num  -72.6 -72.5 -72.6 -72.1 -72.4 ...
nrow(dfNew)
## [1] 32634
nrow(mydata)
## [1] 32634
nrow(zipcode)
## [1] 44336
nrow(dfNew)
## [1] 32634
colnames(dfNew)
## [1] "zip"        "median"     "mean"       "population" "city"      
## [6] "state"      "latitude"   "longitude"
summary(dfNew$zip)
##    Length     Class      Mode 
##     32634 character character
summary(dfNew$city)  # from zipcode package
##    Length     Class      Mode 
##     32634 character character
summary(dfNew$median) # from your data
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     32.98  38462.00  46503.32  50938.21  58255.50 223106.17
sum(is.na(dfNew$zip))
## [1] 0
sum(is.na(dfNew$city))
## [1] 0

Step 1.5 - Clean the data again

Remove Hawaii and Alaska (just focus on the “lower 48” states). HINT: You can use the which() function we learned from Intro to Data Science or you can use dplyr to filter the proper rows (use of course the filter() function in the dplyr package).

  • After removing the two states, you should have 32321 rows in your new dataframe. (mydata has 32634 rows in it.)
# Write your code below.
# Step 1.5 - Remove Alaska and Hawaii to focus on "lower 48"
dfNew <- dfNew %>% filter(!state %in% c("AK", "HI"))
any(dfNew$state %in% c("AK", "HI"))  # Should be FALSE
## [1] FALSE
nrow_before <- nrow(dfNew) + sum(mydata$state %in% c("AK", "HI"))  # approximate before count
## Warning: Unknown or uninitialised column: `state`.
nrow_after <- nrow(dfNew)
print(nrow_before)  # Should be greater than nrow_after
## [1] 32321
print(nrow_after)
## [1] 32321
head(dfNew[dfNew$state %in% c("AK", "HI"), ])  # Should return zero rows or empty
## [1] zip        median     mean       population city       state      latitude  
## [8] longitude 
## <0 rows> (or 0-length row.names)

Step 2 - Show the income and population per state

Step 2.1 - Create a simpler dataframe

Create a simpler dataframe (call it dfSimple), with just the average median income and the population for each state.

  • There are many ways to do this. But the simplest way is by using dplyr. Use group_by() and summarize() from “dplyr” to do this.
  • The new dataframe should look like this:
Step 2.1 Environment
Step 2.1 Environment
# Write your code below.
# Step 2.1 - Create dfSimple with mean median income and population per state
dfSimple <- dfNew %>%
  group_by(state) %>%
  summarize(
    median = mean(median, na.rm = TRUE),
    population = sum(population, na.rm = TRUE)
  )
head(dfSimple)
## # A tibble: 6 × 3
##   state median population
##   <chr>  <dbl>      <dbl>
## 1 AL    40550.    4770242
## 2 AR    36961.    2936699
## 3 AZ    48132.    6360679
## 4 CA    62629.   36927999
## 5 CO    56303.    4979279
## 6 CT    78520.    3548308
str(dfSimple)
## tibble [49 × 3] (S3: tbl_df/tbl/data.frame)
##  $ state     : chr [1:49] "AL" "AR" "AZ" "CA" ...
##  $ median    : num [1:49] 40550 36961 48132 62629 56303 ...
##  $ population: num [1:49] 4770242 2936699 6360679 36927999 4979279 ...
summary(dfSimple)
##     state               median        population      
##  Length:49          Min.   :34210   Min.   :  555216  
##  Class :character   1st Qu.:43875   1st Qu.: 1811426  
##  Mode  :character   Median :49342   Median : 4464564  
##                     Mean   :51645   Mean   : 6080705  
##                     3rd Qu.:56303   3rd Qu.: 6645671  
##                     Max.   :82208   Max.   :36927999
nrow(dfSimple)
## [1] 49

Step 2.2 - Update columns

Add the state abbreviations and the state names as new columns (make sure the state names are all lower case).

  1. Get the state name (not just the abbreviations). Use the built-in state.name and state.abb datasets. This is the code: dfSimple$stateName <- state.name[match(dfSimple$state, state.abb)]
  2. Convert stateName to lowercase and save the values in the stateName column.
# Write your code below.
# Step 2.2 - Add state full names and convert to lower case
dfSimple$stateName <- tolower(state.name[match(dfSimple$state, state.abb)])
# Check that the new column stateName exists
"stateName" %in% colnames(dfSimple)
## [1] TRUE
# View first few rows to see lowercase state names with abbreviations
head(dfSimple[, c("state", "stateName")])
## # A tibble: 6 × 2
##   state stateName  
##   <chr> <chr>      
## 1 AL    alabama    
## 2 AR    arkansas   
## 3 AZ    arizona    
## 4 CA    california 
## 5 CO    colorado   
## 6 CT    connecticut
# Check the structure to verify the type is character and content makes sense
str(dfSimple$stateName)
##  chr [1:49] "alabama" "arkansas" "arizona" "california" "colorado" ...
# Check for any missing values in stateName (should be none if all matched correctly)
sum(is.na(dfSimple$stateName))
## [1] 1

Step 2.3 - Visualize the US (pt1)

Show the U.S. map, using color to represent the average median income of each state.

  1. Get the data on the state to be mapped. Use map_data() function to read "state" object and save the result as us.
  2. Use dfSimple to create a map and set stateName as map_id. (follow the course content practice, written in the textbook and in the video).

It should look like this (please do not forget to add the title of the map):

Step 2.3 Map
Step 2.3 Map
# Write your code below.
# Step 2.3 - Map of US with average median income per state

library(maps)
## Warning: package 'maps' was built under R version 4.4.3
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
us <- map_data("state")
library(ggplot2)
ggplot() +
  geom_map(data = us, map = us,
           aes(long, lat, map_id = region), fill = "white", color = "black") +
  geom_map(data = dfSimple, map = us,
           aes(fill = median, map_id = stateName)) +
  scale_fill_continuous(name = "Avg Median Income", na.value = "grey50") +
  ggtitle("Average Median Income by State") +
  theme_minimal()
## Warning in geom_map(data = us, map = us, aes(long, lat, map_id = region), :
## Ignoring unknown aesthetics: x and y

Step 2.4 - Visualize the US (pt2)

Create a second map with color representing the population of the state. It should look like this:

Step 2.4 Map
Step 2.4 Map
# Write your code below.
# Step 2.4 - Map colored by population per state
ggplot() +
  geom_map(data = us, map = us,
           aes(long, lat, map_id = region), fill = "black", color = "white") +
  geom_map(data = dfSimple, map = us,
           aes(fill = population, map_id = stateName)) +
  scale_fill_continuous(name = "Population", na.value = "grey50") +
  ggtitle("Population by State") +
  theme_minimal()
## Warning in geom_map(data = us, map = us, aes(long, lat, map_id = region), :
## Ignoring unknown aesthetics: x and y


Step 3 - Show the income per zip code

Draw each zipcode on the map, where the color of the “dot” is based on the median income. To make the map look appealing, set the background of the map to black.

The graph should look like this:

Step 3 Map
Step 3 Map
# Write your code below.
# Step 3 - Map zipcodes colored by median income with black background
mapZip <- ggplot() +
  geom_map(data = us, map = us,
           aes(long, lat, map_id = region),
           fill = "black", color = "white") +
  geom_point(data = dfNew, aes(x = longitude, y = latitude, color = median), alpha = 0.5) +
  scale_color_continuous(name = "Median Income", na.value = "grey50") +
  expand_limits(x = us$long, y = us$lat) +
  ggtitle("Median Income by Zip Code") +
  theme_minimal()
## Warning in geom_map(data = us, map = us, aes(long, lat, map_id = region), :
## Ignoring unknown aesthetics: x and y
mapZip


Step 4 - Show zip code density

Now generate a different map, one where we can easily see where there are lots of zipcodes and where there are few (using the stat_density2d() function). We will name this as mapD.

It should look like this:

Step 4 Map
Step 4 Map
# Write your code below.
# Step 4 - Density map of zipcode distribution using stat_density2d()
mapD <- ggplot() +
  geom_map(data = us, map = us,
           aes(long, lat, map_id = region), fill = "black", color = "white") +
  stat_density2d(data = dfNew,
                 aes(x = longitude, y = latitude, fill = ..level.., alpha = ..level..), 
                 geom = "polygon", contour = TRUE) +
  scale_fill_viridis_c() +
  scale_alpha(range = c(0.1, 0.7), guide = FALSE) +
  ggtitle("Zip Code Density") +
  theme_minimal()
## Warning in geom_map(data = us, map = us, aes(long, lat, map_id = region), :
## Ignoring unknown aesthetics: x and y
mapD
## Warning: The dot-dot notation (`..level..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(level)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
## ggplot2 3.3.4.
## ℹ Please use "none" instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.


Step 5 - Zoom in to the region around Tampa

Repeat steps 3 and 4, but have the image/map of the Tampa Bay area.

Below I am giving out the code for this:

# Before using geocode function, you must create Google API key. Follow directions in this url:
https://setcompass.com/How-to-Get-Google-Maps-API-Key-Guide.htm
# Google map requires your credit card information to avoid excessive use of Google resources.It will not charge you money as long as you use this to do this homework. It does not use a lot of calls. 

#register_google(key = "your key here", write = TRUE) #### please delete your key information before submitting the compiled file. You can either compile as a docx file and delete this line, or use other software to hide the key information. 

# use geocode function to get latitude and longtitude of Tampa
latlon <- geocode("Tampa, fl")

# create the first zoomed map based on "mapZip", and plot a point representing Tampa
mapZipZoomed <- mapZip + geom_point(aes(x = latlon$lon, y = latlon$lat), color="darkred", size = 3)

# zoom into the region arount Tampa with 10 degrees latitude and longtitude fluctuation (+/- 10)
mapZipZoomed <- mapZipZoomed + xlim(latlon$lon-10, latlon$lon+10) + ylim(latlon$lat-10,latlon$lat+10) + coord_map()

# plot the map
mapZipZoomed
# Write your code below.
library(ggmap)
## Warning: package 'ggmap' was built under R version 4.4.3
## ℹ Google's Terms of Service: <https://mapsplatform.google.com>
##   Stadia Maps' Terms of Service: <https://stadiamaps.com/terms-of-service>
##   OpenStreetMap's Tile Usage Policy: <https://operations.osmfoundation.org/policies/tiles>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
# Register API key
register_google(key = "AIzaSyCDNcQrWtZu3e0v4i6v_Du_YfKy3NFv-UQ", write = TRUE)
## ℹ Replacing old key (AIzaSyCDNcQrWtZu3e0v4i6v_Du_YfKy3NFv) with new key in D:\Documents/.Renviron
# Get the lat/lon for Tampa
latlon <- geocode("Tampa, FL")
## ℹ <https://maps.googleapis.com/maps/api/geocode/json?address=Tampa,+FL&key=xxx-UQ>
# Define bounding box or center with zoom level (zoom ranges 3-21; higher numbers zoom in further)
zoom_level <- 10  # adjust zoom level as needed (higher zoom = closer)

# Get the Google map centered on Tampa with zoom
map_tampa <- get_googlemap(center = c(lon = latlon$lon, lat = latlon$lat), zoom = zoom_level, maptype = "roadmap")
## ℹ <https://maps.googleapis.com/maps/api/staticmap?center=27.95169,-82.458753&zoom=10&size=640x640&scale=2&maptype=roadmap&key=xxx-UQ>
# Plot the map with Tampa point overlay
print(latlon)
## # A tibble: 1 × 2
##     lon   lat
##   <dbl> <dbl>
## 1 -82.5  28.0
# Check if lon and lat exist and are not NA
if (is.na(latlon$lon) || is.na(latlon$lat)) {
  stop("Geocoding failed: lon or lat is NA")
}
latlon <- data.frame(lon = -82.4572, lat = 27.9506) # Tampa coordinates
map_tampa <- get_googlemap(center = c(lon = latlon$lon, lat = latlon$lat), zoom = 10)
## ℹ <https://maps.googleapis.com/maps/api/staticmap?center=27.9506,-82.4572&zoom=10&size=640x640&scale=2&maptype=terrain&key=xxx-UQ>
ggmap(map_tampa) + 
  geom_point(aes(x = latlon$lon, y = latlon$lat), color = "darkred", size = 3)
## Warning in geom_point(aes(x = latlon$lon, y = latlon$lat), color = "darkred", : All aesthetics have length 1, but the data has 4 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.