Use the dataset named “MedianZIP.xlsx” that has median income by zip
code (an excel file).
Read the MedianZIP.xlsx, use the proper function, and save
it as mydata. Save the .xlsx file in this project’s data
folder.
Install the readxl and tidyverse package and load them. Read the
MedianZIP.xlsx, use the proper function, and save it as
mydata. If you need other packages for this lab, you may
need to install them too. Try what works. There are so many ways to do
this.
# Write your code below.
# Install and loading packages
options(repos = c(CRAN = "https://cran.rstudio.com"))
install.packages("readxl")
##
## The downloaded binary packages are in
## /var/folders/jz/p8s05vwx5ql1kc7g34pxkmz00000gn/T//RtmpNS0tEX/downloaded_packages
install.packages("tidyverse")
##
## The downloaded binary packages are in
## /var/folders/jz/p8s05vwx5ql1kc7g34pxkmz00000gn/T//RtmpNS0tEX/downloaded_packages
library(readxl)
library(tidyverse)
## Warning: package 'dplyr' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
#file path for MedianZip
file_path <- "/Users/auz/Downloads/Week4_HW_zip/MedianZIP.xlsx"
# Read the data using read_excel function from readxl package
mydata <- read_excel(file_path)
## Warning: Expecting numeric in C5070 / R5070C3: got '.'
## Warning: Expecting numeric in C25733 / R25733C3: got '.'
## Warning: Expecting numeric in C25734 / R25734C3: got '.'
## Warning: Expecting numeric in C25735 / R25735C3: got '.'
## Warning: Expecting numeric in C25811 / R25811C3: got '.'
## Warning: Expecting numeric in C29646 / R29646C3: got '.'
## Warning: Expecting numeric in C29981 / R29981C3: got '.'
# Display a summary of the data
summary(mydata)
## Zip Median Mean Pop
## Min. : 1001 Min. : 32.98 Min. : 53.6 Min. : 1
## 1st Qu.:27301 1st Qu.: 38462.00 1st Qu.: 48593.2 1st Qu.: 736
## Median :49875 Median : 46503.32 Median : 56949.6 Median : 2756
## Mean :49875 Mean : 50938.21 Mean : 63452.2 Mean : 9193
## 3rd Qu.:72134 3rd Qu.: 58255.50 3rd Qu.: 70341.2 3rd Qu.: 12513
## Max. :99929 Max. :223106.17 Max. :361842.3 Max. :113916
## NA's :7
Clean up the dataframe if needed and make sure the column names are
zip, median, mean,
population. Make sure the values in each column are
“numeric”. If they are factors or characters, you must change them to
quantitative data.
# Write your code below.
# Specify the file path
file_path <- "/Users/auz/Downloads/Week4_HW_zip/MedianZIP.xlsx"
# Read the data using read_excel function from readxl package
mydata <- readxl::read_excel(file_path, na = c(".", ""))
# Display a summary of the original data
summary(mydata)
## Zip Median Mean Pop
## Min. : 1001 Min. : 32.98 Min. : 53.6 Min. : 1
## 1st Qu.:27301 1st Qu.: 38462.00 1st Qu.: 48593.2 1st Qu.: 736
## Median :49875 Median : 46503.32 Median : 56949.6 Median : 2756
## Mean :49875 Mean : 50938.21 Mean : 63452.2 Mean : 9193
## 3rd Qu.:72134 3rd Qu.: 58255.50 3rd Qu.: 70341.2 3rd Qu.: 12513
## Max. :99929 Max. :223106.17 Max. :361842.3 Max. :113916
## NA's :7
# Create a new data frame with cleaned and standardized zipcodes
cleaned_data <- data.frame(
zip = as.character(mydata$Zip), # Assuming 'Zip' is the correct column name
median = as.numeric(mydata$Median),
mean = as.numeric(mydata$Mean),
population = as.numeric(mydata$Pop)
)
# Display the cleaned data
head(cleaned_data)
## zip median mean population
## 1 10001 71244.61 123112.78 17678
## 2 10002 30843.96 46258.61 70878
## 3 10003 89998.53 139331.00 53609
## 4 10004 110183.69 156682.76 1271
## 5 10005 115133.29 163762.66 1517
## 6 10006 111220.00 156776.00 972
# Display a summary of the cleaned data
summary(cleaned_data)
## zip median mean population
## Length:32634 Min. : 32.98 Min. : 53.6 Min. : 1
## Class :character 1st Qu.: 38462.00 1st Qu.: 48593.2 1st Qu.: 736
## Mode :character Median : 46503.32 Median : 56949.6 Median : 2756
## Mean : 50938.21 Mean : 63452.2 Mean : 9193
## 3rd Qu.: 58255.50 3rd Qu.: 70341.2 3rd Qu.: 12513
## Max. :223106.17 Max. :361842.3 Max. :113916
## NA's :7
# Assign the cleaned data back to mydata
mydata <- cleaned_data
zipcode package (install the
zipcode package first, which is an archived package). The
zipcode package can be installed by doing the following. Alternatively,
you can use more recent ‘zipcodeR’ package instead.install.packages("remotes")
library(remotes)
install_version("zipcode", "1.0")
library(zipcode)
data(zipcode) to load a dataframe that contains
city, state, latitude, and
longitude for US zip codes.zipcode
dataframe with five variables and 44336 rows.# Write your code below.
# Install and load the remotes package
install.packages("remotes")
##
## The downloaded binary packages are in
## /var/folders/jz/p8s05vwx5ql1kc7g34pxkmz00000gn/T//RtmpNS0tEX/downloaded_packages
library(remotes)
install_version("zipcode", "1.0")
## Downloading package from url: https://cran.rstudio.com/src/contrib/Archive/zipcode/zipcode_1.0.tar.gz
library(zipcode)
# Load the data
data(zipcode)
Merge the zipcode information from the two dataframes (merge into one dataframe).
mydata
using the clean.zipcodes() function, and save the values to
the zip column of mydata.mydata and zipcode by the common
column zip and store the new dataframe as
dfNew.merge() function for this.# Write your code below.
# Step 1.4 - Merge the data
# Clean up and standardize zipcodes in mydata
mydata$zip <- clean.zipcodes(mydata$zip)
# Merge mydata and zipcode by the common column 'zip'
dfNew <- merge(mydata, zipcode, by = 'zip')
# Display the merged dataframe
head(dfNew)
## zip median mean population city state latitude longitude
## 1 01001 56662.57 66687.75 16445 Agawam MA 42.07061 -72.62029
## 2 01002 49853.42 75062.63 28069 Amherst MA 42.37765 -72.50323
## 3 01003 28462.00 35121.00 8491 Amherst MA 42.36956 -72.63599
## 4 01005 75423.00 82442.00 4798 Barre MA 42.41209 -72.10443
## 5 01007 79076.35 85801.98 12962 Belchertown MA 42.27842 -72.41100
## 6 01008 63980.00 78391.00 1244 Blandford MA 42.17431 -72.94828
Remove Hawaii and Alaska (just focus on the “lower 48” states).
HINT: You can use the which() function we
learned from Intro to Data Science or you can use dplyr to filter the
proper rows (use of course the filter() function in the
dplyr package).
mydata has 32634 rows in it.)# Write your code below.
dfLower48 <- filter(dfNew, state != "HI" & state != "AK")
# Check the number of rows after filtering
nrow(dfLower48)
## [1] 32321
Create a simpler dataframe (call it dfSimple), with just
the average median income and the population for each state.
group_by() and summarize() from
“dplyr” to do this.# Write your code below.
dfSimple <- dfLower48 %>%
group_by(state) %>%
summarize(avg_median_income = mean(median, na.rm = TRUE),
total_population = sum(population, na.rm = TRUE))
print(dfSimple)
## # A tibble: 49 × 3
## state avg_median_income total_population
## <chr> <dbl> <dbl>
## 1 AL 40550. 4770242
## 2 AR 36961. 2936699
## 3 AZ 48132. 6360679
## 4 CA 62629. 36927999
## 5 CO 56303. 4979279
## 6 CT 78520. 3548308
## 7 DC 66834. 579471
## 8 DE 64299. 892487
## 9 FL 50188. 18593035
## 10 GA 44564. 9603179
## # ℹ 39 more rows
Add the state abbreviations and the state names as new columns (make sure the state names are all lower case).
state.name and state.abb datasets. This is the
code:
dfSimple$stateName <- state.name[match(dfSimple$state, state.abb)]stateName to lowercase and save the values in
the stateName column.# Write your code below.
# Add the state name
dfSimple$stateName <- tolower(state.name[match(dfSimple$state, state.abb)])
# Convert stateName to lowercase
dfSimple$stateName <- tolower(dfSimple$stateName)
Show the U.S. map, using color to represent the average median income of each state.
map_data()
function to read "state" object and save the result as
us.dfSimple to create a map and set
stateName as map_id. (follow the course
content practice, written in the textbook and in the video).It should look like this (please do not forget to add the title of the map):
# Write your code below.
install.packages("ggplot2")
##
## The downloaded binary packages are in
## /var/folders/jz/p8s05vwx5ql1kc7g34pxkmz00000gn/T//RtmpNS0tEX/downloaded_packages
install.packages("ggmap")
##
## The downloaded binary packages are in
## /var/folders/jz/p8s05vwx5ql1kc7g34pxkmz00000gn/T//RtmpNS0tEX/downloaded_packages
library(ggplot2)
library(ggmap)
## Warning: package 'ggmap' was built under R version 4.2.3
## ℹ Google's Terms of Service: ]8;;https://mapsplatform.google.com<https://mapsplatform.google.com>]8;;
## Stadia Maps' Terms of Service: ]8;;https://stadiamaps.com/terms-of-service/<https://stadiamaps.com/terms-of-service/>]8;;
## OpenStreetMap's Tile Usage Policy: ]8;;https://operations.osmfoundation.org/policies/tiles/<https://operations.osmfoundation.org/policies/tiles/>]8;;
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
us <- map_data("state")
# Create a map using ggplot
map <- ggplot(dfSimple, aes(map_id = stateName)) +
geom_map(map = us, fill = "white", color = "black") +
expand_limits(x = us$long, y = us$lat) +
coord_map() +
ggtitle("Average Median Income by State")
# Display the map
print(map)
Create a second map with color representing the population of the state. It should look like this:
# Write your code below.
Draw each zipcode on the map, where the color of the “dot” is based on the median income. To make the map look appealing, set the background of the map to black.
dfNew to create a map and set
stateName as the map IDexpand_limits)geom_point,
use color attribute using Median value).The graph should look like this:
# Write your code below.
Now generate a different map, one where we can easily see where there
are lots of zipcodes and where there are few (using the
stat_density2d() function). We will name this as
mapD.
It should look like this:
# Write your code below.
Repeat steps 3 and 4, but have the image/map of the Tampa Bay area.
Below I am giving out the code for this:
# Before using geocode function, you must create Google API key. Follow directions in this url:
https://setcompass.com/How-to-Get-Google-Maps-API-Key-Guide.htm
# Google map requires your credit card information to avoid excessive use of Google resources.It will not charge you money as long as you use this to do this homework. It does not use a lot of calls.
#register_google(key = "your key here", write = TRUE) #### please delete your key information before submitting the compiled file. You can either compile as a docx file and delete this line, or use other software to hide the key information.
# use geocode function to get latitude and longtitude of Tampa
latlon <- geocode("Tampa, fl")
# create the first zoomed map based on "mapZip", and plot a point representing Tampa
mapZipZoomed <- mapZip + geom_point(aes(x = latlon$lon, y = latlon$lat), color="darkred", size = 3)
# zoom into the region arount Tampa with 10 degrees latitude and longtitude fluctuation (+/- 10)
mapZipZoomed <- mapZipZoomed + xlim(latlon$lon-10, latlon$lon+10) + ylim(latlon$lat-10,latlon$lat+10) + coord_map()
# plot the map
mapZipZoomed
# Write your code below.