Use the dataset named “MedianZIP.xlsx” that has median income by zip
code (an excel file).
Read the MedianZIP.xlsx
, use the proper function, and save
it as mydata
. Save the .xlsx file in this project’s data
folder.
Install the readxl and tidyverse package and load them. Read the
MedianZIP.xlsx
, use the proper function, and save it as
mydata
. If you need other packages for this lab, you may
need to install them too. Try what works. There are so many ways to do
this.
# Write your code below.
# Step 1.1 - Load readxl and tidyverse, read the Excel file into mydata
library(readxl)
## Warning: package 'readxl' was built under R version 4.4.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tibble' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openxlsx)
## Warning: package 'openxlsx' was built under R version 4.4.3
mydata <- read_excel("D:/Documents/Week4_HW_zip/data/DATA 1/MedianZIP.xlsx", na = ".")
# Show first few rows
head(mydata)
## # A tibble: 6 × 4
## Zip Median Mean Pop
## <dbl> <dbl> <dbl> <dbl>
## 1 10001 71245. 123113. 17678
## 2 10002 30844. 46259. 70878
## 3 10003 89999. 139331. 53609
## 4 10004 110184. 156683. 1271
## 5 10005 115133. 163763. 1517
## 6 10006 111220 156776 972
# Check structure of the dataset
str(mydata)
## tibble [32,634 × 4] (S3: tbl_df/tbl/data.frame)
## $ Zip : num [1:32634] 10001 10002 10003 10004 10005 ...
## $ Median: num [1:32634] 71245 30844 89999 110184 115133 ...
## $ Mean : num [1:32634] 123113 46259 139331 156683 163763 ...
## $ Pop : num [1:32634] 17678 70878 53609 1271 1517 ...
# Summary statistics to verify numeric columns
summary(mydata)
## Zip Median Mean Pop
## Min. : 1001 Min. : 32.98 Min. : 53.6 Min. : 1
## 1st Qu.:27301 1st Qu.: 38462.00 1st Qu.: 48593.2 1st Qu.: 736
## Median :49875 Median : 46503.32 Median : 56949.6 Median : 2756
## Mean :49875 Mean : 50938.21 Mean : 63452.2 Mean : 9193
## 3rd Qu.:72134 3rd Qu.: 58255.50 3rd Qu.: 70341.2 3rd Qu.: 12513
## Max. :99929 Max. :223106.17 Max. :361842.3 Max. :113916
## NA's :7
Clean up the dataframe if needed and make sure the column names are
zip
, median
, mean
,
population
. Make sure the values in each column are
“numeric”. If they are factors or characters, you must change them to
quantitative data.
# Write your code below.
library(dplyr)
colnames(mydata) <- c("zip", "median", "mean", "population")
mydata <- mydata %>%
mutate(
zip = as.character(zip),
across(c(median, mean, population), ~ as.numeric(ifelse(. == ".", NA, .)))
)
# Show first few rows with cleaned columns
head(mydata)
## # A tibble: 6 × 4
## zip median mean population
## <chr> <dbl> <dbl> <dbl>
## 1 10001 71245. 123113. 17678
## 2 10002 30844. 46259. 70878
## 3 10003 89999. 139331. 53609
## 4 10004 110184. 156683. 1271
## 5 10005 115133. 163763. 1517
## 6 10006 111220 156776 972
# Check structure and types of columns
str(mydata)
## tibble [32,634 × 4] (S3: tbl_df/tbl/data.frame)
## $ zip : chr [1:32634] "10001" "10002" "10003" "10004" ...
## $ median : num [1:32634] 71245 30844 89999 110184 115133 ...
## $ mean : num [1:32634] 123113 46259 139331 156683 163763 ...
## $ population: num [1:32634] 17678 70878 53609 1271 1517 ...
# Show summary statistics for numeric columns
summary(mydata)
## zip median mean population
## Length:32634 Min. : 32.98 Min. : 53.6 Min. : 1
## Class :character 1st Qu.: 38462.00 1st Qu.: 48593.2 1st Qu.: 736
## Mode :character Median : 46503.32 Median : 56949.6 Median : 2756
## Mean : 50938.21 Mean : 63452.2 Mean : 9193
## 3rd Qu.: 58255.50 3rd Qu.: 70341.2 3rd Qu.: 12513
## Max. :223106.17 Max. :361842.3 Max. :113916
## NA's :7
zipcode
package (install the
zipcode
package first, which is an archived package). The
zipcode package can be installed by doing the following. Alternatively,
you can use more recent ‘zipcodeR’ package instead.install.packages("remotes")
library(remotes)
install_version("zipcode", "1.0")
library(zipcode)
data(zipcode)
to load a dataframe that contains
city
, state
, latitude
, and
longitude
for US zip codes.zipcode
dataframe with five variables and 44336 rows.# Write your code below.
# Step 1.3 - Install and load the zipcode package, then load zipcode data
library(remotes)
## Warning: package 'remotes' was built under R version 4.4.3
library(zipcode)
data(zipcode)
head(zipcode)
## zip city state latitude longitude
## 1 00210 Portsmouth NH 43.0059 -71.0132
## 2 00211 Portsmouth NH 43.0059 -71.0132
## 3 00212 Portsmouth NH 43.0059 -71.0132
## 4 00213 Portsmouth NH 43.0059 -71.0132
## 5 00214 Portsmouth NH 43.0059 -71.0132
## 6 00215 Portsmouth NH 43.0059 -71.0132
str(zipcode)
## 'data.frame': 44336 obs. of 5 variables:
## $ zip : chr "00210" "00211" "00212" "00213" ...
## $ city : chr "Portsmouth" "Portsmouth" "Portsmouth" "Portsmouth" ...
## $ state : chr "NH" "NH" "NH" "NH" ...
## $ latitude : num 43 43 43 43 43 ...
## $ longitude: num -71 -71 -71 -71 -71 ...
ls()
## [1] "mydata" "zipcode"
Merge the zipcode information from the two dataframes (merge into one dataframe).
mydata
using the clean.zipcodes()
function, and save the values to
the zip
column of mydata
.mydata
and zipcode
by the common
column zip
and store the new dataframe as
dfNew
.merge()
function for this.# Write your code below.
# Step 1.4 - Clean zipcodes and merge mydata with zipcode
mydata$zip <- clean.zipcodes(mydata$zip)
dfNew <- merge(mydata, zipcode, by = "zip")
head(dfNew)
## zip median mean population city state latitude longitude
## 1 01001 56662.57 66687.75 16445 Agawam MA 42.07061 -72.62029
## 2 01002 49853.42 75062.63 28069 Amherst MA 42.37765 -72.50323
## 3 01003 28462.00 35121.00 8491 Amherst MA 42.36956 -72.63599
## 4 01005 75423.00 82442.00 4798 Barre MA 42.41209 -72.10443
## 5 01007 79076.35 85801.98 12962 Belchertown MA 42.27842 -72.41100
## 6 01008 63980.00 78391.00 1244 Blandford MA 42.17431 -72.94828
str(dfNew)
## 'data.frame': 32634 obs. of 8 variables:
## $ zip : chr "01001" "01002" "01003" "01005" ...
## $ median : num 56663 49853 28462 75423 79076 ...
## $ mean : num 66688 75063 35121 82442 85802 ...
## $ population: num 16445 28069 8491 4798 12962 ...
## $ city : chr "Agawam" "Amherst" "Amherst" "Barre" ...
## $ state : chr "MA" "MA" "MA" "MA" ...
## $ latitude : num 42.1 42.4 42.4 42.4 42.3 ...
## $ longitude : num -72.6 -72.5 -72.6 -72.1 -72.4 ...
nrow(dfNew)
## [1] 32634
nrow(mydata)
## [1] 32634
nrow(zipcode)
## [1] 44336
nrow(dfNew)
## [1] 32634
colnames(dfNew)
## [1] "zip" "median" "mean" "population" "city"
## [6] "state" "latitude" "longitude"
summary(dfNew$zip)
## Length Class Mode
## 32634 character character
summary(dfNew$city) # from zipcode package
## Length Class Mode
## 32634 character character
summary(dfNew$median) # from your data
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 32.98 38462.00 46503.32 50938.21 58255.50 223106.17
sum(is.na(dfNew$zip))
## [1] 0
sum(is.na(dfNew$city))
## [1] 0
Remove Hawaii and Alaska (just focus on the “lower 48” states).
HINT: You can use the which()
function we
learned from Intro to Data Science or you can use dplyr to filter the
proper rows (use of course the filter()
function in the
dplyr package).
mydata
has 32634 rows in it.)# Write your code below.
# Step 1.5 - Remove Alaska and Hawaii to focus on "lower 48"
dfNew <- dfNew %>% filter(!state %in% c("AK", "HI"))
any(dfNew$state %in% c("AK", "HI")) # Should be FALSE
## [1] FALSE
nrow_before <- nrow(dfNew) + sum(mydata$state %in% c("AK", "HI")) # approximate before count
## Warning: Unknown or uninitialised column: `state`.
nrow_after <- nrow(dfNew)
print(nrow_before) # Should be greater than nrow_after
## [1] 32321
print(nrow_after)
## [1] 32321
head(dfNew[dfNew$state %in% c("AK", "HI"), ]) # Should return zero rows or empty
## [1] zip median mean population city state latitude
## [8] longitude
## <0 rows> (or 0-length row.names)
Create a simpler dataframe (call it dfSimple
), with just
the average median income and the population for each state.
group_by()
and summarize()
from
“dplyr” to do this.# Write your code below.
# Step 2.1 - Create dfSimple with mean median income and population per state
dfSimple <- dfNew %>%
group_by(state) %>%
summarize(
median = mean(median, na.rm = TRUE),
population = sum(population, na.rm = TRUE)
)
head(dfSimple)
## # A tibble: 6 × 3
## state median population
## <chr> <dbl> <dbl>
## 1 AL 40550. 4770242
## 2 AR 36961. 2936699
## 3 AZ 48132. 6360679
## 4 CA 62629. 36927999
## 5 CO 56303. 4979279
## 6 CT 78520. 3548308
str(dfSimple)
## tibble [49 × 3] (S3: tbl_df/tbl/data.frame)
## $ state : chr [1:49] "AL" "AR" "AZ" "CA" ...
## $ median : num [1:49] 40550 36961 48132 62629 56303 ...
## $ population: num [1:49] 4770242 2936699 6360679 36927999 4979279 ...
summary(dfSimple)
## state median population
## Length:49 Min. :34210 Min. : 555216
## Class :character 1st Qu.:43875 1st Qu.: 1811426
## Mode :character Median :49342 Median : 4464564
## Mean :51645 Mean : 6080705
## 3rd Qu.:56303 3rd Qu.: 6645671
## Max. :82208 Max. :36927999
nrow(dfSimple)
## [1] 49
Add the state abbreviations and the state names as new columns (make sure the state names are all lower case).
state.name
and state.abb
datasets. This is the
code:
dfSimple$stateName <- state.name[match(dfSimple$state, state.abb)]
stateName
to lowercase and save the values in
the stateName
column.# Write your code below.
# Step 2.2 - Add state full names and convert to lower case
dfSimple$stateName <- tolower(state.name[match(dfSimple$state, state.abb)])
# Check that the new column stateName exists
"stateName" %in% colnames(dfSimple)
## [1] TRUE
# View first few rows to see lowercase state names with abbreviations
head(dfSimple[, c("state", "stateName")])
## # A tibble: 6 × 2
## state stateName
## <chr> <chr>
## 1 AL alabama
## 2 AR arkansas
## 3 AZ arizona
## 4 CA california
## 5 CO colorado
## 6 CT connecticut
# Check the structure to verify the type is character and content makes sense
str(dfSimple$stateName)
## chr [1:49] "alabama" "arkansas" "arizona" "california" "colorado" ...
# Check for any missing values in stateName (should be none if all matched correctly)
sum(is.na(dfSimple$stateName))
## [1] 1
Show the U.S. map, using color to represent the average median income of each state.
map_data()
function to read "state"
object and save the result as
us
.dfSimple
to create a map and set
stateName
as map_id
. (follow the course
content practice, written in the textbook and in the video).It should look like this (please do not forget to add the title of the map):
# Write your code below.
# Step 2.3 - Map of US with average median income per state
library(maps)
## Warning: package 'maps' was built under R version 4.4.3
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
us <- map_data("state")
library(ggplot2)
ggplot() +
geom_map(data = us, map = us,
aes(long, lat, map_id = region), fill = "white", color = "black") +
geom_map(data = dfSimple, map = us,
aes(fill = median, map_id = stateName)) +
scale_fill_continuous(name = "Avg Median Income", na.value = "grey50") +
ggtitle("Average Median Income by State") +
theme_minimal()
## Warning in geom_map(data = us, map = us, aes(long, lat, map_id = region), :
## Ignoring unknown aesthetics: x and y
Create a second map with color representing the population of the state. It should look like this:
# Write your code below.
# Step 2.4 - Map colored by population per state
ggplot() +
geom_map(data = us, map = us,
aes(long, lat, map_id = region), fill = "black", color = "white") +
geom_map(data = dfSimple, map = us,
aes(fill = population, map_id = stateName)) +
scale_fill_continuous(name = "Population", na.value = "grey50") +
ggtitle("Population by State") +
theme_minimal()
## Warning in geom_map(data = us, map = us, aes(long, lat, map_id = region), :
## Ignoring unknown aesthetics: x and y
Draw each zipcode on the map, where the color of the “dot” is based on the median income. To make the map look appealing, set the background of the map to black.
dfNew
to create a map and set
stateName
as the map IDexpand_limits
)geom_point
,
use color
attribute using Median
value).The graph should look like this:
# Write your code below.
# Step 3 - Map zipcodes colored by median income with black background
mapZip <- ggplot() +
geom_map(data = us, map = us,
aes(long, lat, map_id = region),
fill = "black", color = "white") +
geom_point(data = dfNew, aes(x = longitude, y = latitude, color = median), alpha = 0.5) +
scale_color_continuous(name = "Median Income", na.value = "grey50") +
expand_limits(x = us$long, y = us$lat) +
ggtitle("Median Income by Zip Code") +
theme_minimal()
## Warning in geom_map(data = us, map = us, aes(long, lat, map_id = region), :
## Ignoring unknown aesthetics: x and y
mapZip
Now generate a different map, one where we can easily see where there
are lots of zipcodes and where there are few (using the
stat_density2d()
function). We will name this as
mapD
.
It should look like this:
# Write your code below.
# Step 4 - Density map of zipcode distribution using stat_density2d()
mapD <- ggplot() +
geom_map(data = us, map = us,
aes(long, lat, map_id = region), fill = "black", color = "white") +
stat_density2d(data = dfNew,
aes(x = longitude, y = latitude, fill = ..level.., alpha = ..level..),
geom = "polygon", contour = TRUE) +
scale_fill_viridis_c() +
scale_alpha(range = c(0.1, 0.7), guide = FALSE) +
ggtitle("Zip Code Density") +
theme_minimal()
## Warning in geom_map(data = us, map = us, aes(long, lat, map_id = region), :
## Ignoring unknown aesthetics: x and y
mapD
## Warning: The dot-dot notation (`..level..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(level)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
## ggplot2 3.3.4.
## ℹ Please use "none" instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Repeat steps 3 and 4, but have the image/map of the Tampa Bay area.
Below I am giving out the code for this:
# Before using geocode function, you must create Google API key. Follow directions in this url:
https://setcompass.com/How-to-Get-Google-Maps-API-Key-Guide.htm
# Google map requires your credit card information to avoid excessive use of Google resources.It will not charge you money as long as you use this to do this homework. It does not use a lot of calls.
#register_google(key = "your key here", write = TRUE) #### please delete your key information before submitting the compiled file. You can either compile as a docx file and delete this line, or use other software to hide the key information.
# use geocode function to get latitude and longtitude of Tampa
latlon <- geocode("Tampa, fl")
# create the first zoomed map based on "mapZip", and plot a point representing Tampa
mapZipZoomed <- mapZip + geom_point(aes(x = latlon$lon, y = latlon$lat), color="darkred", size = 3)
# zoom into the region arount Tampa with 10 degrees latitude and longtitude fluctuation (+/- 10)
mapZipZoomed <- mapZipZoomed + xlim(latlon$lon-10, latlon$lon+10) + ylim(latlon$lat-10,latlon$lat+10) + coord_map()
# plot the map
mapZipZoomed
# Write your code below.
library(ggmap)
## Warning: package 'ggmap' was built under R version 4.4.3
## ℹ Google's Terms of Service: <https://mapsplatform.google.com>
## Stadia Maps' Terms of Service: <https://stadiamaps.com/terms-of-service>
## OpenStreetMap's Tile Usage Policy: <https://operations.osmfoundation.org/policies/tiles>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
# Register API key
register_google(key = "AIzaSyCDNcQrWtZu3e0v4i6v_Du_YfKy3NFv-UQ", write = TRUE)
## ℹ Replacing old key (AIzaSyCDNcQrWtZu3e0v4i6v_Du_YfKy3NFv) with new key in D:\Documents/.Renviron
# Get the lat/lon for Tampa
latlon <- geocode("Tampa, FL")
## ℹ <https://maps.googleapis.com/maps/api/geocode/json?address=Tampa,+FL&key=xxx-UQ>
# Define bounding box or center with zoom level (zoom ranges 3-21; higher numbers zoom in further)
zoom_level <- 10 # adjust zoom level as needed (higher zoom = closer)
# Get the Google map centered on Tampa with zoom
map_tampa <- get_googlemap(center = c(lon = latlon$lon, lat = latlon$lat), zoom = zoom_level, maptype = "roadmap")
## ℹ <https://maps.googleapis.com/maps/api/staticmap?center=27.95169,-82.458753&zoom=10&size=640x640&scale=2&maptype=roadmap&key=xxx-UQ>
# Plot the map with Tampa point overlay
print(latlon)
## # A tibble: 1 × 2
## lon lat
## <dbl> <dbl>
## 1 -82.5 28.0
# Check if lon and lat exist and are not NA
if (is.na(latlon$lon) || is.na(latlon$lat)) {
stop("Geocoding failed: lon or lat is NA")
}
latlon <- data.frame(lon = -82.4572, lat = 27.9506) # Tampa coordinates
map_tampa <- get_googlemap(center = c(lon = latlon$lon, lat = latlon$lat), zoom = 10)
## ℹ <https://maps.googleapis.com/maps/api/staticmap?center=27.9506,-82.4572&zoom=10&size=640x640&scale=2&maptype=terrain&key=xxx-UQ>
ggmap(map_tampa) +
geom_point(aes(x = latlon$lon, y = latlon$lat), color = "darkred", size = 3)
## Warning in geom_point(aes(x = latlon$lon, y = latlon$lat), color = "darkred", : All aesthetics have length 1, but the data has 4 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
## a single row.