5122 Visual Analytics Assignment @ DSBA UNC Charlotte

Part 1: Installation

Task 1

Answer the following questions about RMarkdown from the short lessons

Details

Who did you collaborate with: Alone
Approximately how much time did you spend on this problem set: 2-3 hours.
What, if anything, gave you the most trouble: Language used in Code 4 was not clear. Knitting.

Instructions

For this problem set, you’ll have three possible things to do.

“### Question #” / “xxxxx” = Write your response to each question by replacing the "xxxx. 1-2 sentences is fine.
“### Code #” = Write R code in the chunk corresponding to the instructions.

For this problem set, there are 15 questions and 10 code parts. If you get one wrong, it will be -3 points.

Additional points will be deducted for problems with the formatting as outlined by the problem set 1 instructions (e.g., do not provide .html output, put data into data folder, etc.).

Question 1

What three (3) characters start and end the YAML information for an RMarkdown file?

Answer 1

1- Code chunks to run. R code chunks surrounded by ``` s 2- Text to display. Text mixed with simple text formatting 3- YAML metadata to guide the R Markdown build process. YAML header surrounded by — s

Question 2

What does the code chunk parameter echo = FALSE do?

Answer 2

‘echo=False’ prevents code, but not the results from appearing in the finished file. This is useful way to embed figures.

Question 3

How do you change the output type for an RMarkdown file?

Answer 3

Set the output_format argument of render to render my .Rmd file into any of R Markdown’supported formats. For example, below render 1-example.Rmd to a Microsoft Word document.

library(rmarkdown)

render(“1-example.Rmd”, output_format = “word_document”)

Task 2: Creating baseline plot

Code 1

Load the tidyverse package.

Code 1 Answer

# install.packages('tidyverse',repos="http://cran.us.r-project.org")
library(tidyverse)
library(ggplot2)

Question 4

What packages are loaded when you call tidyverse?

Answer 4

tidyverse_packages()

##  [1] "broom"      "cli"        "crayon"     "dbplyr"     "dplyr"     
##  [6] "forcats"    "ggplot2"    "haven"      "hms"        "httr"      
## [11] "jsonlite"   "lubridate"  "magrittr"   "modelr"     "pillar"    
## [16] "purrr"      "readr"      "readxl"     "reprex"     "rlang"     
## [21] "rstudioapi" "rvest"      "stringr"    "tibble"     "tidyr"     
## [26] "xml2"       "tidyverse"

v ggplot2 3.3.2 v purrr 0.3.4 v tibble 3.0.3 v dplyr 1.0.2 v tidyr 1.1.2 v stringr 1.4.0 v readr 1.3.1 v forcats 0.5.0

Code 2

Read in the corrupt.csv file and assign it to corrupt.

Code 2 Answer

corrupt<-read.csv(file='data/corrupt.csv')
head(corrupt)

##       country                  region year cpi   hdi
## 1     Denmark Europe and Central Asia 2015  91 0.925
## 2 New Zealand            Asia Pacific 2015  91 0.915
## 3     Finland Europe and Central Asia 2015  90 0.895
## 4      Sweden Europe and Central Asia 2015  89 0.913
## 5 Switzerland Europe and Central Asia 2015  86 0.939
## 6      Norway Europe and Central Asia 2015  88 0.949

Code 3

Run the glimpse() function on the data to explore the column formats.

Code 3 Answer

glimpse(corrupt)

## Rows: 704
## Columns: 5
## $ country <chr> "Denmark", "New Zealand", "Finland", "Sweden", "Switzerland...
## $ region  <chr> "Europe and Central Asia", "Asia Pacific", "Europe and Cent...
## $ year    <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,...
## $ cpi     <int> 91, 91, 90, 89, 86, 88, 85, 84, 83, 81, 85, 81, 79, 79, 77,...
## $ hdi     <dbl> 0.925, 0.915, 0.895, 0.913, 0.939, 0.949, 0.925, 0.924, 0.9...

Question 5

How many rows (observations) and columns (variables) does the data frame include?

Answer 5

Way 5_1

dim(corrupt)

## [1] 704   5

Way 5_2

Rows (observation): 704 Columns (variables):5

Way 5_3

nrow(corrupt)

## [1] 704

ncol(corrupt)

## [1] 5

Way 5_4

summary(corrupt)

##    country             region               year           cpi       
##  Length:704         Length:704         Min.   :2012   Min.   : 8.00  
##  Class :character   Class :character   1st Qu.:2013   1st Qu.:28.00  
##  Mode  :character   Mode  :character   Median :2014   Median :38.00  
##                                        Mean   :2014   Mean   :42.88  
##                                        3rd Qu.:2014   3rd Qu.:55.00  
##                                        Max.   :2015   Max.   :92.00  
##                                                       NA's   :20     
##       hdi        
##  Min.   :0.3410  
##  1st Qu.:0.5507  
##  Median :0.7320  
##  Mean   :0.6947  
##  3rd Qu.:0.8263  
##  Max.   :0.9490  
##  NA's   :84

However, it’s not clear if there are duplicated records by year (i.e., this is panel data (record and time oriented)).

Code 4

Run the count() function on corrupt and use year as the 2nd parameter. This will count how many records by each unique category in year (that is, each year)

Code 4 Answer

Way 4_1

corrupt%>%
  group_by(year)%>%
  summarise(count=n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 4 x 2
##    year count
##   <int> <int>
## 1  2012   176
## 2  2013   176
## 3  2014   176
## 4  2015   176

Way 4_2

corrupt%>%as.tibble()%>%count(year)

## Warning: `as.tibble()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## # A tibble: 4 x 2
##    year     n
##   <int> <int>
## 1  2012   176
## 2  2013   176
## 3  2014   176
## 4  2015   176

Question 6

How many different years does the dataset include?

There are 4 different years (2012, 2013, 2014, 2015)

Answer 6

unique(corrupt$year)

## [1] 2015 2014 2013 2012

For simplicity, let’s only keep 2015 records.

corrupt <- corrupt %>%filter(year == 2015) %>%
na.omit()
corrupt

##                      country                       region year cpi   hdi
## 1                    Denmark      Europe and Central Asia 2015  91 0.925
## 2                New Zealand                 Asia Pacific 2015  91 0.915
## 3                    Finland      Europe and Central Asia 2015  90 0.895
## 4                     Sweden      Europe and Central Asia 2015  89 0.913
## 5                Switzerland      Europe and Central Asia 2015  86 0.939
## 6                     Norway      Europe and Central Asia 2015  88 0.949
## 7                  Singapore                 Asia Pacific 2015  85 0.925
## 8                Netherlands      Europe and Central Asia 2015  84 0.924
## 9                     Canada                     Americas 2015  83 0.920
## 10                   Germany      Europe and Central Asia 2015  81 0.926
## 11                Luxembourg      Europe and Central Asia 2015  85 0.898
## 12            United Kingdom      Europe and Central Asia 2015  81 0.910
## 13                 Australia                 Asia Pacific 2015  79 0.939
## 14                   Iceland      Europe and Central Asia 2015  79 0.921
## 15                   Belgium      Europe and Central Asia 2015  77 0.896
## 17                   Austria      Europe and Central Asia 2015  76 0.893
## 18             United States                     Americas 2015  76 0.920
## 19                   Ireland      Europe and Central Asia 2015  75 0.923
## 20                     Japan                 Asia Pacific 2015  75 0.903
## 21                   Uruguay                     Americas 2015  74 0.795
## 22                   Estonia      Europe and Central Asia 2015  70 0.865
## 23                    France      Europe and Central Asia 2015  70 0.897
## 25                     Chile                     Americas 2015  70 0.847
## 26      United Arab Emirates Middle East and North Africa 2015  70 0.840
## 27                    Bhutan                 Asia Pacific 2015  65 0.607
## 28                    Israel Middle East and North Africa 2015  61 0.899
## 29                    Poland      Europe and Central Asia 2015  63 0.855
## 30                  Portugal      Europe and Central Asia 2015  64 0.843
## 32                     Qatar Middle East and North Africa 2015  71 0.856
## 33                  Slovenia      Europe and Central Asia 2015  60 0.890
## 35                  Botswana           Sub Saharan Africa 2015  63 0.698
## 40                 Lithuania      Europe and Central Asia 2015  59 0.848
## 42                Costa Rica                     Americas 2015  55 0.776
## 43                     Spain      Europe and Central Asia 2015  58 0.884
## 44                   Georgia      Europe and Central Asia 2015  52 0.769
## 45                    Latvia      Europe and Central Asia 2015  56 0.830
## 47                    Cyprus      Europe and Central Asia 2015  61 0.856
## 48            Czech Republic      Europe and Central Asia 2015  56 0.878
## 49                     Malta      Europe and Central Asia 2015  60 0.856
## 50                 Mauritius           Sub Saharan Africa 2015  53 0.781
## 51                    Rwanda           Sub Saharan Africa 2015  54 0.498
## 53                   Namibia           Sub Saharan Africa 2015  53 0.640
## 54                  Slovakia      Europe and Central Asia 2015  51 0.845
## 55                   Croatia      Europe and Central Asia 2015  51 0.827
## 56                  Malaysia                 Asia Pacific 2015  50 0.789
## 57                   Hungary      Europe and Central Asia 2015  51 0.836
## 58                    Jordan Middle East and North Africa 2015  53 0.742
## 59                   Romania      Europe and Central Asia 2015  46 0.802
## 60                      Cuba                     Americas 2015  47 0.775
## 61                     Italy      Europe and Central Asia 2015  44 0.887
## 62     Sao Tome and Principe           Sub Saharan Africa 2015  42 0.574
## 63              Saudi Arabia Middle East and North Africa 2015  52 0.847
## 64                Montenegro      Europe and Central Asia 2015  44 0.807
## 65                      Oman Middle East and North Africa 2015  45 0.796
## 66                   Senegal           Sub Saharan Africa 2015  44 0.494
## 67              South Africa           Sub Saharan Africa 2015  44 0.666
## 68                  Suriname                     Americas 2015  36 0.725
## 69                    Greece      Europe and Central Asia 2015  46 0.866
## 70                   Bahrain Middle East and North Africa 2015  51 0.824
## 71                     Ghana           Sub Saharan Africa 2015  47 0.579
## 72              Burkina Faso           Sub Saharan Africa 2015  38 0.402
## 73                    Serbia      Europe and Central Asia 2015  40 0.776
## 75                  Bulgaria      Europe and Central Asia 2015  41 0.794
## 76                    Kuwait Middle East and North Africa 2015  49 0.800
## 77                   Tunisia Middle East and North Africa 2015  38 0.725
## 78                    Turkey      Europe and Central Asia 2015  42 0.767
## 79                   Belarus      Europe and Central Asia 2015  32 0.796
## 80                    Brazil                     Americas 2015  38 0.754
## 81                     China                 Asia Pacific 2015  37 0.738
## 82                     India                 Asia Pacific 2015  38 0.624
## 83                   Albania      Europe and Central Asia 2015  36 0.764
## 84    Bosnia and Herzegovina      Europe and Central Asia 2015  38 0.750
## 85                   Jamaica                     Americas 2015  41 0.730
## 86                   Lesotho           Sub Saharan Africa 2015  44 0.497
## 87                  Mongolia                 Asia Pacific 2015  39 0.735
## 88                    Panama                     Americas 2015  39 0.788
## 89                    Zambia           Sub Saharan Africa 2015  38 0.579
## 90                  Colombia                     Americas 2015  37 0.727
## 91                 Indonesia                 Asia Pacific 2015  36 0.689
## 92                   Liberia           Sub Saharan Africa 2015  37 0.427
## 93                   Morocco Middle East and North Africa 2015  36 0.647
## 95                 Argentina                     Americas 2015  32 0.827
## 96                     Benin           Sub Saharan Africa 2015  37 0.485
## 97               El Salvador                     Americas 2015  39 0.680
## 100                Sri Lanka                 Asia Pacific 2015  37 0.766
## 101                    Gabon           Sub Saharan Africa 2015  34 0.697
## 102                    Niger           Sub Saharan Africa 2015  34 0.353
## 103                     Peru                     Americas 2015  36 0.740
## 104              Philippines                 Asia Pacific 2015  35 0.682
## 105                 Thailand                 Asia Pacific 2015  38 0.740
## 106              Timor-Leste                 Asia Pacific 2015  28 0.606
## 107      Trinidad and Tobago                     Americas 2015  39 0.780
## 108                  Algeria Middle East and North Africa 2015  36 0.745
## 110                    Egypt Middle East and North Africa 2015  36 0.691
## 111                 Ethiopia           Sub Saharan Africa 2015  33 0.448
## 112                   Guyana                     Americas 2015  29 0.638
## 113                  Armenia      Europe and Central Asia 2015  35 0.743
## 116                     Mali           Sub Saharan Africa 2015  35 0.442
## 117                 Pakistan                 Asia Pacific 2015  30 0.550
## 119                     Togo           Sub Saharan Africa 2015  32 0.487
## 120       Dominican Republic                     Americas 2015  33 0.722
## 121                  Ecuador                     Americas 2015  32 0.739
## 122                   Malawi           Sub Saharan Africa 2015  31 0.476
## 123               Azerbaijan      Europe and Central Asia 2015  29 0.759
## 124                 Djibouti           Sub Saharan Africa 2015  34 0.473
## 125                 Honduras                     Americas 2015  31 0.625
## 127                   Mexico                     Americas 2015  31 0.762
## 129                 Paraguay                     Americas 2015  27 0.693
## 130             Sierra Leone           Sub Saharan Africa 2015  29 0.420
## 132               Kazakhstan      Europe and Central Asia 2015  28 0.794
## 133                    Nepal                 Asia Pacific 2015  27 0.558
## 135                  Ukraine      Europe and Central Asia 2015  27 0.743
## 136                Guatemala                     Americas 2015  28 0.640
## 137               Kyrgyzstan      Europe and Central Asia 2015  28 0.664
## 138                  Lebanon Middle East and North Africa 2015  28 0.763
## 139                  Myanmar                 Asia Pacific 2015  22 0.556
## 140                  Nigeria           Sub Saharan Africa 2015  26 0.527
## 141         Papua New Guinea                 Asia Pacific 2015  25 0.516
## 142                   Guinea           Sub Saharan Africa 2015  25 0.414
## 143               Mauritania Middle East and North Africa 2015  31 0.513
## 144               Mozambique           Sub Saharan Africa 2015  31 0.418
## 145               Bangladesh                 Asia Pacific 2015  25 0.579
## 146                 Cameroon           Sub Saharan Africa 2015  27 0.518
## 147                   Gambia           Sub Saharan Africa 2015  28 0.452
## 148                    Kenya           Sub Saharan Africa 2015  25 0.555
## 149               Madagascar           Sub Saharan Africa 2015  28 0.512
## 150                Nicaragua                     Americas 2015  27 0.645
## 151               Tajikistan      Europe and Central Asia 2015  26 0.627
## 152                   Uganda           Sub Saharan Africa 2015  25 0.493
## 153                  Comoros           Sub Saharan Africa 2015  26 0.498
## 154             Turkmenistan      Europe and Central Asia 2015  18 0.692
## 155                 Zimbabwe           Sub Saharan Africa 2015  21 0.516
## 156                 Cambodia                 Asia Pacific 2015  21 0.563
## 158               Uzbekistan      Europe and Central Asia 2015  19 0.701
## 159                  Burundi           Sub Saharan Africa 2015  21 0.404
## 160 Central African Republic           Sub Saharan Africa 2015  24 0.352
## 161                     Chad           Sub Saharan Africa 2015  22 0.396
## 162                    Haiti                     Americas 2015  17 0.493
## 164                   Angola           Sub Saharan Africa 2015  15 0.533
## 165                  Eritrea           Sub Saharan Africa 2015  18 0.420
## 166                     Iraq Middle East and North Africa 2015  16 0.649
## 168            Guinea-Bissau           Sub Saharan Africa 2015  17 0.424
## 169              Afghanistan                 Asia Pacific 2015  11 0.479
## 170                    Libya Middle East and North Africa 2015  16 0.716
## 171                    Sudan Middle East and North Africa 2015  12 0.490
## 172                    Yemen Middle East and North Africa 2015  18 0.482
## 175              South Sudan           Sub Saharan Africa 2015  15 0.418

Let’s revise our existing region field. This will help us later on.

corrupt <- corrupt %>%
mutate(region = case_when(
region == "Middle East and North Africa" ~ "Middle East\nand North Africa",
region == "Europe and Central Asia" ~ "Europe and\nCentral Asia",
region == "Sub Saharan Africa" ~ "Sub-Saharan\nAfrica",
TRUE ~ region))

Let’s now see how many countries we have for each region.

Code 5

Using dplyr and piping (%>%), count the number of countries by region and assign it to the dataframe region_count. After running it, print it to the console by simply writing the name of the data frame.

Code 5 Answer

region_count<-corrupt%>%
  group_by(region)%>%
  summarise(count=n())

## `summarise()` ungrouping output (override with `.groups` argument)

region_count

## # A tibble: 5 x 2
##   region                          count
##   <chr>                           <int>
## 1 "Americas"                         24
## 2 "Asia Pacific"                     21
## 3 "Europe and\nCentral Asia"         46
## 4 "Middle East\nand North Africa"    18
## 5 "Sub-Saharan\nAfrica"              38

Question 7

How many total countries are in the “Asia Pacific” region?

Answer 7

Based on the above solution, it is 21.

Code 6

Create a scatterplot with the dataframe corrupt in which cpi is on the x axis, hdi is on the y axis, and the color of the points is region:

Code 6 Answer

ggplot(data = corrupt)+
  geom_point(mapping = aes(x=cpi, y=hdi, colour=region))

Question 8

Answer 8

What are three problems with this graph (or ways you could improve this graph)?

The values of cpi and hdi are rounded so the points appear on a grid and some points overlap each other. This problem is known as overplotting. We can avoid this gridding by setting the position adjustment to “jitter”. position = “jitter” adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise.

ggplot(data = corrupt)+
geom_point(mapping = aes(x=cpi, y=hdi, colour=region),position='jitter')

We can also use the same variable (region) for both shape and color

ggplot(data = corrupt)+
geom_point(mapping = aes(x=cpi, y=hdi, colour=region, shape=region),position='jitter')

3-We can use the smooth geom, a smooth line fitted to the region data.

ggplot(data = corrupt)+
geom_smooth (mapping = aes(x=cpi, y=hdi, color=region))

4- We can represent it usign facet wrap.

ggplot(corrupt) +
  aes(x = cpi, 
      y = hdi,
      color = region) +
  geom_line(
    aes(group = region),
    color = "grey75"
  ) +
  geom_point(size = 0.25) +
  geom_smooth() + 
  scale_x_continuous(breaks = 
    seq(0, 100, 15)
  ) +  
  facet_wrap(~ region) +
  guides(color = FALSE)

5- We can use world map.

Now, let’s modify our points.

First, let’s reshape each point.

Code 7

Within your geom_point function, add in the following fixed parameters:

size to 2.5
alpha to 0.5
shape to 21

hint: since these three values are fixed, should the be inside or outside the aesthetics (aes()) function?

Code 7 Answer

Size and alpha inside, shape outside.

ggplot(data=corrupt)+
  geom_point(mapping = aes(x=cpi,y=hdi, color=region, size=2.5, alpha=0.5), shape=21)

Question 9

What does the alpha parameter do?

Answer 9

Most geoms have an “alpha” parameter. Legal alpha values are any numbers from 0 (transparent) to 1 (opaque). The default alpha value usually is 1.To set the alpha to a constant value, use the alpha geom parameter (e.g., geom_point(data=d, mapping=aes(x=x, y=y), alpha=0.5) sets the alpha of all points in the layer to 0.5.

The plot is too transparent. The issue is the parameter color encodes the color of the border, not the color of the point.

That’s where we’ll need the fill parameter.

Code 8

Put these two parameters explicitly in the aes() function of the geom_point():

color = region
fill = region

Also, make sure to remove any mention of color or fill in the aes() of your main ggplot() function.

Code 8 Answer

ggplot(corrupt,aes(x=cpi,y=hdi,size=2.5,alpha=0.5),shape=21)+
  geom_point(mapping=aes(color=region,fill=region))

Last, let’s temporarily save this graph as an object g. We can use the same <- (gets arrow) assignment operator. This will enable us to view the object or we can use it to build additional layers (Task 2).

Code 9

Assign the ggplot from the previous part to g and then run g on its within the chunk.

Code 9 Answer

g<-ggplot(corrupt,aes(x=cpi,y=hdi,size=2.5,alpha=0.5),shape=21)+
  geom_point(mapping=aes(color=region,fill=region))
g

Part 2: Re-designing

In this part, you’ll add additional layers to our plot to re-design it.

This part is much more complicated, so your job will be easier:

Remove the eval=F parameter from each chunk to run each chunk when knit your output.
Answer questions on interpreting what’s going on.

For this, we’ll use the same g object you created in the last chunk and slowly add more layers to the plot.

Before starting, we’ll need two packages: cowplot and colorspace. You can install colorspace from CRAN (remember how to?). For cowplot, you need the most recent version which is on GitHub.

Installing packages from GitHub is relatively straight-forward. But you need an additional package: devtools. You can then run the line below to install it.

Code 10

Install cowplot and colorspace and call these libraries. Also, remove the eval=F parameter from each chunk to run each chunk when knit your output. (hint: you can do this for all parts via Edit > Replace and Find or CTRL + F)

Code 10 Answer

# install.packages('colorscale',repos="http://cran.us.r-project.org")
library(colorspace)

# install.packages('devtools')
library(devtools)

# devtools::install_github("wilkelab/cowplot")

library(cowplot)

Question 10

What does the parameters warning=F and message=F do within the code chunk?

Answer 10

message = FALSE prevents messages that are generated by code from appearing in the finished file.

warning = FALSE prevents warnings that are generated by code from appearing in the finished.

Import unique theme and font size

Modifying themes are very common in ggplot. There are a range of packages to change plot themes like ggtheme.

For this plot, we’ll use a theme built within the cowplot package that is a minimal background with a horizontal grid.

g <- g +
  cowplot::theme_minimal_hgrid(12, rel_small = 1) 

g

Question 11

What does the cowplot:: pre-fix for theme_minimal_hgrid() mean? When would it be necessary?

Answer 11

theme_minimal_hgrid is minimal_horizontal grid theme, which only draws horizontal grid lines without axis lines.

package ‘dplyr’ successfully unpacked and MD5 sums checked.

After package installation, the package is installed, but it isn’t loaded into memory. If you are going to be using the cowplot functions frequently, you will want to actually load the package into memory. Otherwise you have to preface your functions with double-colons: cowplot::theme(). Who wants to remember that every time?

Modify color scheme

Next, let’s modify the color scheme. Colors can be represented by hex colors.

Sometimes, color palettes come in as R packages (e.g., RColorBrewer). However, for this plot we’ll manually load up the colors.

# Okabe Ito colors
region_cols <- c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#999999")

g <- g +
  scale_fill_manual(values = region_cols) 

g

Alternative Modify color scheme

{r install colorblindr packages,warning=F,message=F} Alternative solution remotes::install_github(“clauswilke/colorblindr”)

{r install colorblindr library, message = FALSE} library(colorblindr)

Okabe Ito colors via colorblindr g<-g+scale_fill_OkabeIto() g

Darken colors by 30%

We can also darken the color scheme automatically through colorspace’s darken() function.

g<-g+
  scale_color_manual(values=colorspace::darken(region_cols,0.3))
g

Add smoothing line.

Let’s now overlay a basic regression, using the geom_smooth() function.

For this, we’ll make the function a log transformation of x.

g<-g+
  geom_smooth(
    aes(color='y~log(x)',fill='y~log(x)'),
    method = 'lm', formula = y~log(x),se=FALSE,fullrange=TRUE)
g

Question 12

What would be the difference in the plot if our smoothed function was y~x instead of y~log(x)?

Answer 12

In many situations, the relationship between x and y is non-linear. In order to simplify the underlying model, we can transform or convert either x or y or both to result in a more linear relationship. There are many common transformations such as logarithmic and reciprocal. Including higher order terms on x may also help to linearity the relationship between x and y.

g1<-g+
  geom_smooth(
    aes(color='y~log(x)',fill='y~log(x)'),
    method = 'lm', formula = y~x,se=FALSE,fullrange=TRUE)
g1

Set x and y scales, move legend on top.

Let’s now modify our scales, add scale labels, and modify the legend.

g<-g+
  scale_x_continuous(
    name='Corruption Perception Index, 2015 (100=least corrupt)',
    limits = c(10,95),
    breaks = c(20,40,60,80,100),
    expand=c(0,0))+
  scale_y_continuous(
    name='Human Development Index, 2015\n(1.0=most developed)',
    limits = c(0.3,1.05),
    breaks=c(0.2,0.4,0.6,0.8,1.0),
    expand=c(0,0))+
  theme(legend.position = 'top',
        legend.justification = 'right',
        legend.text = element_text(size=9),
        legend.box.spacing = unit(0,'pt'))+
guides(fill=guide_legend(
    nrow = 1,
    override.aes = list(
      linetype=c(rep(0,5),1),
      shape=c(rep(21,5),NA))))
  
g

Highlight select countries

Last, let’s add labels to highlight the countries.

We can use the ggrepel package that includes the geom_text_repel() function that makes sure not to overlap labels.

# install.packages("ggrepel")
library(ggrepel)

# don't assign this to g
# if you do, then simply recreate g by running the "Run All Chunks Above" button

g2<-g+
  geom_text_repel(
    aes(label=country),
    color='black',
    size=9/.pt,
    point.padding=0.1,
    box.padding=.6,
    min.segment.length=0,
    seed=7654)
g2

Obviously, this is too busy. We have too many labels.

Let’s instead create a vector of countries we want to plot. We can the add in a new column that has the country name only if we want to plot it and nothing ("") otherwise.

country_highlight <- c("Germany", "Norway", "United States", "Greece", "Singapore", "Rwanda", "Russia", "Venezuela", "Sudan", "Iraq", "Ghana", "Niger", "Chad", "Kuwait", "Qatar", "Myanmar", "Nepal", "Chile", "Argentina", "Japan", "China")

corrupt<-corrupt%>%
  mutate(
    label=if_else (country %in% country_highlight, country, ""))

# wow: %+%
# https://stackoverflow.com/questions/29336964/changing-the-dataset-of-a-ggplot-object

g <- g %+% 
  corrupt +
  geom_text_repel(
    aes(label = label),
    color = "black",
    size = 9/.pt, # font size 9 pt
    point.padding = 0.1,
    box.padding = .6,
    min.segment.length = 0,
    seed = 7654) 

g

Question 13

What do you think the %+% operator does (see the StackOverflow link)? Why is it necessary in this context?

Answer 13

1- Add components to a plot 2- Concatenate character vectors

Save as a pdf file

g+ggsave('corrupt.pdf', width=8, height=5)

Better size for corrupt.pdf

g + ggsave("corrupt.pdf", width = 10, height = 5)

You now have a pdf saved as this plot. By setting the width and height, it’ll make your life so much easier if you need to reproduce this plot (very likely).

Question 14

From this graph, how would you interpret countries that are above the regression line versus countries that are below?

Answer 14

fit.RP<-lm(hdi~cpi,data=corrupt)
summary(fit.RP)

## 
## Call:
## lm(formula = hdi ~ cpi, data = corrupt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28519 -0.06561  0.01051  0.08636  0.20099 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.4311409  0.0213609   20.18   <2e-16 ***
## cpi         0.0060898  0.0004425   13.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1085 on 145 degrees of freedom
## Multiple R-squared:  0.5664, Adjusted R-squared:  0.5634 
## F-statistic: 189.4 on 1 and 145 DF,  p-value: < 2.2e-16

1- There is a cluster formation under the line especially for Sub-Saharan Africa region.

2- Sub-Saharan Africa region seems to be dominant region that causes low R^2 and and high SSE.

3- Countries far away from the regression line have bigger sample error. It seems that countries below the regression line have high sample error.

4- Countries above regression line have smaller sample error since they are closer to the regression line.

5- Majority of Americas and almost all of the Europe and Central Asia countries have smaller sample error and are prone to enhance the regression (by increasing the R^2).

6- Rather then forming cluster as observed with countries below the regression line, there is a kind of linear spread with the countries above the line.

Question 15

What role does region have in the relationship between Corruption (Perception) and HDI?

Answer 15

1- Sub-Saharan Africa region has the lowest human development index, whereas Europe and Central Asia seem to have the highest human development index.

2- Most corrupted region is Sub-Saharan Africa, whereas prevalence of corruption is lowest in Europe and Central Asia.

3- Although, very very few Europe and Central Asia countries (2) have significant corruption, it seems that they also have more than 0.6 human development index. Maybe they do corruption for development of “selected group” in their countries. Since it is not possible to have reliable data in corrupted countries, their human development index data could be fake.

4- Except from Sub-Saharan Africa region, most of the other countries have less SSE, and are close to the regression line.

5- Americas have human development index between 0.6 to 0.8.

6- If corruption can be slowed down in Sub-Saharan Africa region, it’s expected to have better human development index.

7- In order to yield a better linearity (ideal condition),Higher the human development index lower the corruption should be attained. In ideal condition, the cpi, hdi should be 100, 1.0, respectively.

8-Europe and Central Asia has most significant linear data overall.

# install.packages('tinytex')
# tinytex::install_tinytex()
library(tinytex)

problem-set-1-demir

Yusuf Kemal Demir;Ph.D.

9/29/2020