At the beginning of the R script, we need to include all the neccesary libraries (packages) we will need for the assignment. R itself has tons of useful functions. But packages may provide more extensive functions. In this assignment, we will need to take use of library “{leaflet}”, “{magrittr}”, “{sf}”, “{geojsonio}” and “{graphics}”. Install them by using the “Install packages” tool in the “Tools” tab (see Figure 2).
During the package installation, if you are asked with,
{“Do you want to install from sources the package which needs compilation? (Yes/no/cancel)”}
You should type {no} !
(If you still encounter errors in installing packages, check you R version and update it to the latest.)
Execute the below three lines to include these packages for this R script. You don’t need to include all the packages you just installed, because some of them just serve as the base for other packages.
library(leaflet)
library(magrittr) #provide the support for %>% operator
library(graphics)
Before we start with real data, let’s first review the “vector” data type.
Now you are asked to create a vector v1, with 6 elements 0, 0.2, 0.4, 0.6, 0.8 and 1.0.
(Hint: use the “c()” operator.)
#Complete the below line
v1=c(0,0.2,0.4,0.6,0.8,1.0)
#
print(v1)
## [1] 0.0 0.2 0.4 0.6 0.8 1.0
Use an easy way to create a vector v2, each element of which is twice of that in v1, namely 0 \(\times\) 2, 0.2 \(\times\) 2, 0.4 \(\times\) 2, 0.6 \(\times\) 2, 0.8 \(\times\) 2 and 1.0 \(\times\) 2.
#Complete the below line
v2=c(2*v1)
#
print(v2)
## [1] 0.0 0.4 0.8 1.2 1.6 2.0
Then, you are given two numeric variables a and b
a=2
b=3
You are asked to create a vector v3, the first element of which is a and second is b.
#Complete the below line
v3=c(a,b)
#
print(v3)
## [1] 2 3
Create a vector v4, the first part of which is v3 and second is v2
#Complete the below line
v4=c(v3,v2)
#
Print out v4 to see if its element values are as expected.
print(v4)
## [1] 2.0 3.0 0.0 0.4 0.8 1.2 1.6 2.0
Import US_data.xls into R, using the “import dataset” tool located on the top left side of R studio (See Figure 3).
library(readxl)
US_data <- read_excel("US_data.xls")
After importing the Excel file, you will see a dataframe named US_data in the variable viewer. Use {summary()} function to examine the dataframe “US_data”.
#Type you code in the blank
summary(US_data)
State Average_Income High_school_graduate
Length:52 Length:52 Min. : 534707
Class :character Class :character 1st Qu.: 1541774
Mode :character Mode :character Median : 3735320
Mean : 5472931
3rd Qu.: 6293786
Max. :31550249
Bachelor_degree Advanced_degree Population
Min. : 148883 Min. : 49821 Min. : 579315
1st Qu.: 456662 1st Qu.: 180405 1st Qu.: 1791128
Median : 1145565 Median : 424102 Median : 4298482
Mean : 1912830 Mean : 724625 Mean : 6328007
3rd Qu.: 2558175 3rd Qu.:1031555 3rd Qu.: 7113638
Max. :12414509 Max. :4586251 Max. :39536653
NA's :1 NA's :1
#
## Error: <text>:3:18: unexpected symbol
## 2: summary(US_data)
## 3: State Average_Income
## ^
Answer questions :
{1. How many rows and columns are there in this dataframe? What are the column names?} Type your answer here:52 rows and 6 collumns
{2. What are the data types of Average_Income and Population, respectively?} Type your answer here:Average-Income, High_School_Graduate,Bachelor_degree, Advanced_Degree, Population
{3. What is the median of the population for all the states?} Type your answer here:4298482
‘$’ operator can be used to extract certain column of a data frame, for instance,
US_data$High_school_graduate
## [1] 4109411 681351 6033992 2547628 31550249 5085688 3225777 850354
## [9] 619716 18235443 8906689 1299059 1536663 10484856 5853466 2878325
## [17] 2627636 3750427 3906733 1223690 5410646 6160117 8926230 5152783
## [25] 2455914 5404362 974857 1741508 2551331 1235371 7979000 1758154
## [33] 16991085 8814593 692695 10387820 3415920 3720212 11422539 913408
## [41] 4300859 790526 5742166 24879739 2828871 572517 7479027 6694791
## [49] 1543478 5273889 534707 2436139
This command extracts only the “High_school_graduate” column of the “US_data”. It is equivalent to,
US_data['High_school_graduate']
Since the ‘High_school_graduate’ column is the third column of the dataframe, it is also equivalent to ,
US_data[3]
Now we can do some basic analysis on the ‘High_school_graduate’ data, for instance, getting the mean of it.
mean(US_data$High_school_graduate)
## [1] 5472931
Now its your turn.
{4. Calculate the total population of USA based on the “Population” column of “US_data”} and store the results in variable US_total_population.
(Hint: Use {sum()} function.)
#Complete the below line
US_total_population<-sum(US_data$Population)
#
Now you’ve obtained the total population and it is stored in the variable “US_total_population”. We can print it out by.
print(US_total_population)
## [1] 329056355
Now let’s make a simple pie chart.
pie(x=US_data$Population,label=US_data$State)
{5. Try to understand the this line. What information is this pie chart giving?} Type your answer here:It gives the how much space of a pie chart the population of each state would take up.
“$” operator can also be used to add a column to a dataframe. For instance, the below code calculates the percentage of population getting high school degree in each state and add the results as a new column named “High_school_percent” to US_data.
US_data$High_school_percent<- US_data$High_school_graduate/US_data$Population*100
Now you are asked to add two more columns “Bachelor_percent” and “Advanced_percent” which are the percentages of population getting bachelors degree and advanced degree.
#Complete the below lines
US_data$Bachelor_percent<-US_data$Bachelor_degree/US_data$Population*100
US_data$Advanced_percent<-US_data$Advanced_degree/US_data$Population*100
#
print(US_data$Bachelor_percent)
## [1] 23.49999 27.99992 27.50000 21.09997 31.40000 38.09999 37.59999 29.99993
## [9] 54.59990 27.30000 28.80000 30.79995 25.89999 27.60000 24.10000 26.69997
## [17] 31.00000 22.30000 22.49998 29.00000 37.90000 40.49999 26.89999 33.70000
## [25] 20.69998 27.10000 29.49996 29.29999 22.99997 34.89997 36.79999 26.29998
## [33] 34.20000 28.39999 27.69989 26.09999 24.09999 30.80000 28.60000 31.89992
## [41] 25.80000 26.99991 24.90000 32.30000 31.10000 35.99992 36.30000 32.89999
## [49] 19.19997 27.80000 25.69984 NA
print(US_data$Advanced_percent)
## [1] 8.699980 10.099960 10.199992 7.499969 11.599998 13.999990 16.599985
## [8] 12.199942 31.299966 9.799999 10.699995 10.499966 8.199981 9.399999
## [15] 8.699998 8.499986 10.999982 9.199991 7.699986 10.299968 17.299990
## [22] 17.699986 10.499993 11.199984 7.699977 10.199996 9.499921 9.699981
## [29] 7.899997 12.999974 13.999998 11.499998 14.800000 9.899995 7.599885
## [36] 9.699999 7.999997 11.499994 11.199999 12.799925 9.299994 7.999968
## [43] 8.999992 12.399997 10.399980 14.299848 15.399999 11.999998 7.399977
## [50] 9.399993 8.599984 NA
{6. Will you be able to calculate the average income for the whole US with with the “Average Income” column? Why or why not? We would not be able to calculate it solely off of the Average income column we must also take into consideration the Average population column.
You don’t need to calculate it at this time. First, we print the column out.
US_data$Average_Income
## [1] "$44,765" "$73,355" "$51,492" "$41,995" "$64,500" "$63,909" "$71,346"
## [8] "$61,255" "$75,628" "$49,426" "$51,244" "$73,486" "$48,275" "$59,588"
## [15] "$50,532" "$54,736" "$53,906" "$45,215" "$45,727" "$51,494" "$75,847"
## [22] "$70,628" "$51,084" "$63,488" "$40,593" "$50,238" "$49,509" "$54,996"
## [29] "$52,431" "$70,303" "$72,222" "$45,382" "$60,850" "$47,830" "$60,557"
## [36] "$51,075" "$48,568" "$54,148" "$55,702" "$58,073" "$47,238" "$53,017"
## [43] "$47,275" "$55,653" "$62,912" "$56,990" "$66,262" "$64,129" "$42,019"
## [50] "$55,638" "$60,214" "$18,626"
If we want the unit of income to be in thousand dollar, we can simply divide the income column by 1000,
US_data$Average_Income/1000
## Error in US_data$Average_Income/1000: non-numeric argument to binary operator
But R gives error on it. Think about it and answer the question 6.Because the values are not numeric values but instead they are read as a character
Then we start to convert the format of “Average_Income” column.
We see that each “number” in “Average_Income” column has a dollar sign ‘$’ ahead of it and comma “,” in the midlle, so R recognize it as “character”(“string”) rather than numeric. To convert all the “charaters” to numeric, we need to remove the dollar sign and comma, by the below commands.
# *
US_data$Average_Income <- gsub('\\$', '', US_data$Average_Income)
US_data$Average_Income <- gsub(',', '', US_data$Average_Income)
By far you don’t need to fully understand all the codes, but you can try to understand them.
{7. Now, what is the data type of “Average_Income”?} (Hint, use {class()} function to examine the “Average_Income” column) Type your answer here:Character
#Type you code in below blank
class(US_data$Average_Income)
## [1] "character"
#
It seems that we need one more step to make the column “Average_Income” numeric.
{Before running the below line, make sure you have executed the above two lines with “gsub” function. Otherwise, you will ruin the “Average_Income” column and need to re-run the part from section III to VII. }
US_data$Average_Income <- as.numeric(US_data$Average_Income)
{8. Now, examine the data type of “Average_Income”.}
#Type you code in below blank
class(US_data$Average_Income)
## [1] "numeric"
#
{9. Like what you did in GIS lab, you are asked to create a scatterplot showing the relationship}
{between average income and educational level}.
(Hint: Since we don’t have an indicator for educational level, we can use the percentage of population getting bachelor’s degree, namely the “Bachelor_percent” column, as the indicator.)
(Hint: we can take use of the plot function “plot(x, y)”, where x can be the education level and y can be the average income.)
#Type you code in below blank
plot(US_data$Bachelor_percent,US_data$Average_Income)
#
{10. Calculate the average income for the whole US.}
(Hint: it is not just taking an average for the income column.)
(Hint: in the equation, you can take use of the variable “US_total_population” you created previously.)
#Complete the below line
Average_income_US=sum(US_data$Average_Income*US_data$Population)/US_total_population
#
print(Average_income_US)
## [1] 56424.95
Now we start to make the data visable on map!
First, we plan to create a US maps, with states different colors representing different income levels.
Run below code to import the geodata file “us-states.json”, and store the geo-data in the variable "states_geodata’
# *
states_geodata <- geojsonio::geojson_read("us-states.json", what = "sp")
If R gives the error: File does not exist. You need to set the working directory to the folder where contains the “us-states.json” file.
If R gives the error on “cannot find package named ‘geojsonio’”, it’s because you have not installed it yet. Try to install it now.
Then, we create a variable “income_map” as the media containing the income level map
# *
income_map <- leaflet(states_geodata) %>% #attach geodata to the
setView(-96, 37.8, 4) %>% #set the mapview range to US
addTiles()
income_map #show the map
If R cannot recognize “%>%”, it’s because you forget to include the “magrittr” library for this script. Go to the beginning of the R script and run the code “{library(magrittr)}”
By far you should only see a map of US without any other information, because we have not attached any social economical data to it.
Before we attach the income data, we first need to classify them. The average income for each state is a unique number. So without classification, each state will have a unique color on the map, making the map messy. To make the map readable, we need to make it with less classes, say 8.
There are many ways to make the classification. Here, we use percentile. We will use the function {quantile()} to classify “Average_Income” column of US_data
Then, let’s look at the function {quantile()}.
{quantile(x,probs)} may take two arguments. {x} stands for the data yet to be classified. {probs} is a vector containing the pre-set percentiles.
To illustrate, we create a vector x, which has 20 elements, each of which follows a uniform distribution from 0 to 10.
x=runif(20,0,10)
x
## [1] 0.05610275 2.96289078 3.95404715 9.45144048 3.84839898 7.24206201
## [7] 7.96813684 8.44514035 9.13738237 0.12466709 2.51096238 3.32485602
## [13] 3.86331292 3.25117045 7.82163496 9.31171799 3.54241519 5.22597458
## [19] 5.14136473 2.39476279
If we want to get the 20% and 60% percentile of x, we should call the function,
quantile(x,c(0.2,0.6))
## 20% 60%
## 2.872505 5.175209
If we want to equally divide x into 4 parts respect to percentiles, then we should call
quantile(x,c(0.00,0.25,0.5,0.75,1.00))
## 0% 25% 50% 75% 100%
## 0.05610275 3.17910054 3.90868004 7.85826043 9.45144048
If we want to equally divide US_data$Average_Income into 8 parts respect to percentiles, how should we write the code?
#Replace the ???? with correct code
bins_income <- quantile(US_data$Average_Income, c(1/8,2/8,3/8,4/8,5/8,6/8,7/8,8/8))
#
print(bins_income)
## 12.5% 25% 37.5% 50% 62.5% 75% 87.5% 100%
## 45511.38 49211.50 51275.00 54442.00 57937.62 63056.00 70506.12 75847.00
Now the Average_Income has been classified into 8 categories and the quantile information has been stored in the variable “bins_income”, with which we can then assign colors to each of the 8 classes and attach the data to the map.
{11. Run the below codes to generate the income level map.} You may try to understand this part if you are interested in it. It is not mandatory.
# *
pal_income <- colorBin("YlOrRd", #use a pre-defined color ramp
domain = US_data$Average_Income, #the data used to assign colors
bins = bins_income) #
# *
income_map %>% addPolygons( #add Polygons to the maps with below information
fillColor = ~pal_income(US_data$Average_Income), #colored income data
weight = 2, #weight of polygon (state) boundaries
opacity = 1, #opacity of polygon (state) boundaries
color = "white", #color of polygon (state) boundaries
dashArray = "3", #line type of polygon (state) boundaries
fillOpacity = 0.7 #Opacity of filled state colors
)
## Warning in pal_income(US_data$Average_Income): Some values were outside the
## color scale and will be treated as NA
{{12.}} (Bonus part, extra 5 points)
{ Create a US map showing the percentages of population with highschool or higher degree of each state.}
Some requirements:
The percentages of population should be divided into 5 classes.
The color of the boundary of the states should be “orange”.
The weight of the boundary of the states should be 3.
You just need to replace all the ????? to correct codes below.
#Replace the ????? with correct code
edu_map <- leaflet(states_geodata) %>% #attach geodata to the
setView(-96, 37.8, 4) %>% #set the map range to US
addTiles()
bins_edu <- quantile(?????, ?????)
pal_edu <- colorBin("YlGnBu", #use a pre-defined color ramp
domain = ?????, #the data used to assign colors
bins = bins_edu)
edu_map %>% addPolygons(
fillColor = ~pal_edu(?????),
weight = ?????,
opacity = 1,
color = ?????,
dashArray = "3",
fillOpacity = 0.7)
#
## Error: <text>:7:27: unexpected ','
## 6:
## 7: bins_edu <- quantile(?????,
## ^
#The End