I. Packages

At the beginning of the R script, we need to include all the neccesary libraries (packages) we will need for the assignment. R itself has tons of useful functions. But packages may provide more extensive functions. In this assignment, we will need to take use of library “{leaflet}”, “{magrittr}”, “{sf}”, “{geojsonio}” and “{graphics}”. Install them by using the “Install packages” tool in the “Tools” tab (see Figure 2).

During the package installation, if you are asked with,

{“Do you want to install from sources the package which needs compilation? (Yes/no/cancel)”}

You should type {no} !

(If you still encounter errors in installing packages, check you R version and update it to the latest.)

Execute the below three lines to include these packages for this R script. You don’t need to include all the packages you just installed, because some of them just serve as the base for other packages.

library(leaflet)       
library(magrittr)    #provide the support for %>% operator
library(graphics)

II. Reviews on vectors

Before we start with real data, let’s first review the “vector” data type.

Now you are asked to create a vector v1, with 6 elements 0, 0.2, 0.4, 0.6, 0.8 and 1.0.

(Hint: use the “c()” operator.)

#Complete the below line
v1=c(0,0.2,0.4,0.6,0.8,1.0)
#
print(v1)

## [1] 0.0 0.2 0.4 0.6 0.8 1.0

Use an easy way to create a vector v2, each element of which is twice of that in v1, namely 0 $\times$ 2, 0.2 $\times$ 2, 0.4 $\times$ 2, 0.6 $\times$ 2, 0.8 $\times$ 2 and 1.0 $\times$ 2.

#Complete the below line
v2=c(2*v1)
#
print(v2)

## [1] 0.0 0.4 0.8 1.2 1.6 2.0

Then, you are given two numeric variables a and b

a=2
b=3

You are asked to create a vector v3, the first element of which is a and second is b.

#Complete the below line
v3=c(a,b)
#
print(v3)

## [1] 2 3

Create a vector v4, the first part of which is v3 and second is v2

#Complete the below line
v4=c(v3,v2)
#

Print out v4 to see if its element values are as expected.

print(v4)

## [1] 2.0 3.0 0.0 0.4 0.8 1.2 1.6 2.0

III. Data examination

Import US_data.xls into R, using the “import dataset” tool located on the top left side of R studio (See Figure 3).

library(readxl)
US_data <- read_excel("US_data.xls")

After importing the Excel file, you will see a dataframe named US_data in the variable viewer. Use {summary()} function to examine the dataframe “US_data”.

#Type you code in the blank
summary(US_data)
 State           Average_Income     High_school_graduate
 Length:52          Length:52          Min.   :  534707    
 Class :character   Class :character   1st Qu.: 1541774    
 Mode  :character   Mode  :character   Median : 3735320    
                                       Mean   : 5472931    
                                       3rd Qu.: 6293786    
                                       Max.   :31550249    
                                                           
 Bachelor_degree    Advanced_degree     Population      
 Min.   :  148883   Min.   :  49821   Min.   :  579315  
 1st Qu.:  456662   1st Qu.: 180405   1st Qu.: 1791128  
 Median : 1145565   Median : 424102   Median : 4298482  
 Mean   : 1912830   Mean   : 724625   Mean   : 6328007  
 3rd Qu.: 2558175   3rd Qu.:1031555   3rd Qu.: 7113638  
 Max.   :12414509   Max.   :4586251   Max.   :39536653  
 NA's   :1          NA's   :1                     

#

## Error: <text>:3:18: unexpected symbol
## 2: summary(US_data)
## 3:  State           Average_Income
##                     ^

Answer questions :

{1. How many rows and columns are there in this dataframe? What are the column names?} Type your answer here:52 rows and 6 collumns

{2. What are the data types of Average_Income and Population, respectively?} Type your answer here:Average-Income, High_School_Graduate,Bachelor_degree, Advanced_Degree, Population

{3. What is the median of the population for all the states?} Type your answer here:4298482

IV. $ operators

‘$’ operator can be used to extract certain column of a data frame, for instance,

US_data$High_school_graduate

##  [1]  4109411   681351  6033992  2547628 31550249  5085688  3225777   850354
##  [9]   619716 18235443  8906689  1299059  1536663 10484856  5853466  2878325
## [17]  2627636  3750427  3906733  1223690  5410646  6160117  8926230  5152783
## [25]  2455914  5404362   974857  1741508  2551331  1235371  7979000  1758154
## [33] 16991085  8814593   692695 10387820  3415920  3720212 11422539   913408
## [41]  4300859   790526  5742166 24879739  2828871   572517  7479027  6694791
## [49]  1543478  5273889   534707  2436139

This command extracts only the “High_school_graduate” column of the “US_data”. It is equivalent to,

US_data['High_school_graduate']

Since the ‘High_school_graduate’ column is the third column of the dataframe, it is also equivalent to ,

US_data[3]

Now we can do some basic analysis on the ‘High_school_graduate’ data, for instance, getting the mean of it.

mean(US_data$High_school_graduate)

## [1] 5472931

Now its your turn.

{4. Calculate the total population of USA based on the “Population” column of “US_data”} and store the results in variable US_total_population.

(Hint: Use {sum()} function.)

#Complete the below line
US_total_population<-sum(US_data$Population)
#

Now you’ve obtained the total population and it is stored in the variable “US_total_population”. We can print it out by.

print(US_total_population)

## [1] 329056355

V. Pie Chart Plot

Now let’s make a simple pie chart.

pie(x=US_data$Population,label=US_data$State)

{5. Try to understand the this line. What information is this pie chart giving?} Type your answer here:It gives the how much space of a pie chart the population of each state would take up.

VI. Add column to dataframe

“$” operator can also be used to add a column to a dataframe. For instance, the below code calculates the percentage of population getting high school degree in each state and add the results as a new column named “High_school_percent” to US_data.

US_data$High_school_percent<- US_data$High_school_graduate/US_data$Population*100

Now you are asked to add two more columns “Bachelor_percent” and “Advanced_percent” which are the percentages of population getting bachelors degree and advanced degree.

#Complete the below lines
US_data$Bachelor_percent<-US_data$Bachelor_degree/US_data$Population*100
US_data$Advanced_percent<-US_data$Advanced_degree/US_data$Population*100
#
print(US_data$Bachelor_percent)

##  [1] 23.49999 27.99992 27.50000 21.09997 31.40000 38.09999 37.59999 29.99993
##  [9] 54.59990 27.30000 28.80000 30.79995 25.89999 27.60000 24.10000 26.69997
## [17] 31.00000 22.30000 22.49998 29.00000 37.90000 40.49999 26.89999 33.70000
## [25] 20.69998 27.10000 29.49996 29.29999 22.99997 34.89997 36.79999 26.29998
## [33] 34.20000 28.39999 27.69989 26.09999 24.09999 30.80000 28.60000 31.89992
## [41] 25.80000 26.99991 24.90000 32.30000 31.10000 35.99992 36.30000 32.89999
## [49] 19.19997 27.80000 25.69984       NA

print(US_data$Advanced_percent)

##  [1]  8.699980 10.099960 10.199992  7.499969 11.599998 13.999990 16.599985
##  [8] 12.199942 31.299966  9.799999 10.699995 10.499966  8.199981  9.399999
## [15]  8.699998  8.499986 10.999982  9.199991  7.699986 10.299968 17.299990
## [22] 17.699986 10.499993 11.199984  7.699977 10.199996  9.499921  9.699981
## [29]  7.899997 12.999974 13.999998 11.499998 14.800000  9.899995  7.599885
## [36]  9.699999  7.999997 11.499994 11.199999 12.799925  9.299994  7.999968
## [43]  8.999992 12.399997 10.399980 14.299848 15.399999 11.999998  7.399977
## [50]  9.399993  8.599984        NA

VII. Data type conversion

{6. Will you be able to calculate the average income for the whole US with with the “Average Income” column? Why or why not? We would not be able to calculate it solely off of the Average income column we must also take into consideration the Average population column.

You don’t need to calculate it at this time. First, we print the column out.

US_data$Average_Income

##  [1] "$44,765" "$73,355" "$51,492" "$41,995" "$64,500" "$63,909" "$71,346"
##  [8] "$61,255" "$75,628" "$49,426" "$51,244" "$73,486" "$48,275" "$59,588"
## [15] "$50,532" "$54,736" "$53,906" "$45,215" "$45,727" "$51,494" "$75,847"
## [22] "$70,628" "$51,084" "$63,488" "$40,593" "$50,238" "$49,509" "$54,996"
## [29] "$52,431" "$70,303" "$72,222" "$45,382" "$60,850" "$47,830" "$60,557"
## [36] "$51,075" "$48,568" "$54,148" "$55,702" "$58,073" "$47,238" "$53,017"
## [43] "$47,275" "$55,653" "$62,912" "$56,990" "$66,262" "$64,129" "$42,019"
## [50] "$55,638" "$60,214" "$18,626"

If we want the unit of income to be in thousand dollar, we can simply divide the income column by 1000,

US_data$Average_Income/1000

## Error in US_data$Average_Income/1000: non-numeric argument to binary operator

But R gives error on it. Think about it and answer the question 6.Because the values are not numeric values but instead they are read as a character

Then we start to convert the format of “Average_Income” column.

We see that each “number” in “Average_Income” column has a dollar sign ‘$’ ahead of it and comma “,” in the midlle, so R recognize it as “character”(“string”) rather than numeric. To convert all the “charaters” to numeric, we need to remove the dollar sign and comma, by the below commands.

#   *
US_data$Average_Income <- gsub('\\$', '', US_data$Average_Income)
US_data$Average_Income <- gsub(',', '', US_data$Average_Income)

By far you don’t need to fully understand all the codes, but you can try to understand them.

{7. Now, what is the data type of “Average_Income”?} (Hint, use {class()} function to examine the “Average_Income” column) Type your answer here:Character

#Type you code in below blank
class(US_data$Average_Income)

## [1] "character"

It seems that we need one more step to make the column “Average_Income” numeric.

{Before running the below line, make sure you have executed the above two lines with “gsub” function. Otherwise, you will ruin the “Average_Income” column and need to re-run the part from section III to VII. }

US_data$Average_Income <- as.numeric(US_data$Average_Income)

{8. Now, examine the data type of “Average_Income”.}

#Type you code in below blank
class(US_data$Average_Income)

## [1] "numeric"

VIII. Scatterplot

{9. Like what you did in GIS lab, you are asked to create a scatterplot showing the relationship}

{between average income and educational level}.

(Hint: Since we don’t have an indicator for educational level, we can use the percentage of population getting bachelor’s degree, namely the “Bachelor_percent” column, as the indicator.)

(Hint: we can take use of the plot function “plot(x, y)”, where x can be the education level and y can be the average income.)

#Type you code in below blank
 plot(US_data$Bachelor_percent,US_data$Average_Income)

IX. Calculation

{10. Calculate the average income for the whole US.}

(Hint: it is not just taking an average for the income column.)

(Hint: in the equation, you can take use of the variable “US_total_population” you created previously.)

#Complete the below line
Average_income_US=sum(US_data$Average_Income*US_data$Population)/US_total_population
#
print(Average_income_US)

## [1] 56424.95

X. Map Visualization

Now we start to make the data visable on map!

First, we plan to create a US maps, with states different colors representing different income levels.

Run below code to import the geodata file “us-states.json”, and store the geo-data in the variable "states_geodata’

#   *
states_geodata <- geojsonio::geojson_read("us-states.json", what = "sp")

If R gives the error: File does not exist. You need to set the working directory to the folder where contains the “us-states.json” file.

If R gives the error on “cannot find package named ‘geojsonio’”, it’s because you have not installed it yet. Try to install it now.

Then, we create a variable “income_map” as the media containing the income level map

#   *
income_map <- leaflet(states_geodata) %>%     #attach geodata to the
  setView(-96, 37.8, 4) %>%                   #set the mapview range to US
  addTiles()
income_map                                    #show the map

If R cannot recognize “%>%”, it’s because you forget to include the “magrittr” library for this script. Go to the beginning of the R script and run the code “{library(magrittr)}”

By far you should only see a map of US without any other information, because we have not attached any social economical data to it.

Before we attach the income data, we first need to classify them. The average income for each state is a unique number. So without classification, each state will have a unique color on the map, making the map messy. To make the map readable, we need to make it with less classes, say 8.

There are many ways to make the classification. Here, we use percentile. We will use the function {quantile()} to classify “Average_Income” column of US_data

Quantile

Then, let’s look at the function {quantile()}.

{quantile(x,probs)} may take two arguments. {x} stands for the data yet to be classified. {probs} is a vector containing the pre-set percentiles.

To illustrate, we create a vector x, which has 20 elements, each of which follows a uniform distribution from 0 to 10.

x=runif(20,0,10)
x

##  [1] 0.05610275 2.96289078 3.95404715 9.45144048 3.84839898 7.24206201
##  [7] 7.96813684 8.44514035 9.13738237 0.12466709 2.51096238 3.32485602
## [13] 3.86331292 3.25117045 7.82163496 9.31171799 3.54241519 5.22597458
## [19] 5.14136473 2.39476279

If we want to get the 20% and 60% percentile of x, we should call the function,

quantile(x,c(0.2,0.6))

##      20%      60% 
## 2.872505 5.175209

If we want to equally divide x into 4 parts respect to percentiles, then we should call

quantile(x,c(0.00,0.25,0.5,0.75,1.00))

##         0%        25%        50%        75%       100% 
## 0.05610275 3.17910054 3.90868004 7.85826043 9.45144048

If we want to equally divide US_data$Average_Income into 8 parts respect to percentiles, how should we write the code?

#Replace the ???? with correct code
bins_income <- quantile(US_data$Average_Income, c(1/8,2/8,3/8,4/8,5/8,6/8,7/8,8/8))
#
print(bins_income)

##    12.5%      25%    37.5%      50%    62.5%      75%    87.5%     100% 
## 45511.38 49211.50 51275.00 54442.00 57937.62 63056.00 70506.12 75847.00

Now the Average_Income has been classified into 8 categories and the quantile information has been stored in the variable “bins_income”, with which we can then assign colors to each of the 8 classes and attach the data to the map.

{11. Run the below codes to generate the income level map.} You may try to understand this part if you are interested in it. It is not mandatory.

#   *
pal_income <- colorBin("YlOrRd",                  #use a pre-defined color ramp
                domain = US_data$Average_Income,  #the data used to assign colors 
                bins = bins_income)               #
#   *
income_map %>% addPolygons( #add Polygons to the maps with below information
  fillColor = ~pal_income(US_data$Average_Income),  #colored income data
  weight = 2,                     #weight of polygon (state) boundaries
  opacity = 1,                    #opacity of polygon (state) boundaries
  color = "white",                #color of polygon (state) boundaries
  dashArray = "3",                #line type of polygon (state) boundaries
  fillOpacity = 0.7               #Opacity of filled state colors
  )

## Warning in pal_income(US_data$Average_Income): Some values were outside the
## color scale and will be treated as NA

{{12.}} (Bonus part, extra 5 points)

{ Create a US map showing the percentages of population with highschool or higher degree of each state.}

Some requirements:

The percentages of population should be divided into 5 classes.

The color of the boundary of the states should be “orange”.

The weight of the boundary of the states should be 3.

You just need to replace all the ????? to correct codes below.

#Replace the ????? with correct code

edu_map <- leaflet(states_geodata) %>%        #attach geodata to the
  setView(-96, 37.8, 4) %>%                   #set the map range to US
  addTiles()

bins_edu <- quantile(?????, ?????)

pal_edu <- colorBin("YlGnBu",                 #use a pre-defined color ramp
                domain = ?????,               #the data used to assign colors 
                bins = bins_edu)               

edu_map %>% addPolygons(
  fillColor = ~pal_edu(?????),
  weight = ?????,
  opacity = 1,
  color = ?????,
  dashArray = "3",
  fillOpacity = 0.7)
#

## Error: <text>:7:27: unexpected ','
## 6: 
## 7: bins_edu <- quantile(?????,
##                              ^

#The End

ECI016 R Lab Assignment 1