The goal is to test your software installation, to demonstrate competency in Markdown, and in the basics of ggplot.

R and RStudio installation

You should successfully install R and R studio in your computer. We will do all of our work in this class with the open source (and free!) programming language R. However, we will use RStudio software application, an Integrated Development Environment (IDE), which allows us to seamlessly interact with R and write code in a pleasant environment.

Install tidyverse and gapminder packages

The basic installation of R is known as base R. (If you haven’t already done so) we need to install a couple of packages, namely tidyverse and gapminder. Go to the packages panel in the bottom right of RStudio, click on “Install,” type tidyverse, and press enter. Once it finishes, install gapminder. You’ll see a bunch of output in the RStudio console as all the packages are installed.

You can also just paste and run these two commands

  • install.packages("tidyverse")
  • install.packages("gapminder")

in the console (bottom left in RStudio) instead of using the packages panel.

You can find details on R packages here

Practice using Markdown

Written assignments will be submitted using Markdown. Markdown is a lightweight text formatting language that easily converts between file formats. It is integrated directly into R Markdown, which combines R code, output, and written text into a single document (.Rmd).

There is a very nice Markdwown tutorial that I suggest you go through before working on your assignment. If you want to use a stand-alone Markdown editor Typora is a lightweight Markdown editor that inherently supports pandoc-flavoured Markdown.

Pandoc

Pandocis a program that converts Markdown files into basically anything else. It was created by John MacFarlane, a philosophy professor at the University of California, Berkeley and is widely used as a writing tool and as a basis for publishing workflow. Kieran Healy’s Plain Text Social Science workflow describes how to use Markdown and then convert your Markdown document to HTML, PDF, word, etc.

You should create a file whose name will be your Name_Surname.Rmd.

Task 1: Short biography written using markdown

You should write within this Rmd file a brief biography of yourself using markdown syntax. I know you have already achieved a lot, but a couple of paragraphs is more than enough.

To achieve full marks, you should include at least 4 of the following elements:

  • Headers
  • Emphasis (italics or bold)
  • Bullet points
  • Links
  • Embeding images

Please write your short biography after this blockquote.

Self intro

Hi there, my name is Liangxu (Liam) Yin.

I graduated from Fudan University in Shanghai, China and I am currently a Global Masters in Management (GMiM) student at London Business School.

My interests include:

  • Guitars
  • Movies
  • Rock N’ Roll Music
  • F1

For my professional experiences, please check my linkedin here.

Task 2: gapminder country comparison

You have seen the gapminder dataset that has data on life expectancy, population, and GDP per capita for 142 countries from 1952 to 2007. To get a glipmse of the dataframe, namely to see the variable names, variable types, etc., we use the glimpse function. We also want to have a look at the first 20 rows of data.

glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
head(gapminder, 20) # look at the first 20 rows of the dataframe
## # A tibble: 20 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## 11 Afghanistan Asia       2002    42.1 25268405      727.
## 12 Afghanistan Asia       2007    43.8 31889923      975.
## 13 Albania     Europe     1952    55.2  1282697     1601.
## 14 Albania     Europe     1957    59.3  1476505     1942.
## 15 Albania     Europe     1962    64.8  1728137     2313.
## 16 Albania     Europe     1967    66.2  1984060     2760.
## 17 Albania     Europe     1972    67.7  2263554     3313.
## 18 Albania     Europe     1977    68.9  2509048     3533.
## 19 Albania     Europe     1982    70.4  2780097     3631.
## 20 Albania     Europe     1987    72    3075321     3739.

Type your answer after this blockquote.

# Initial Country Plot
country_data <- gapminder |>
            filter(country == "China")

continent_data <- gapminder |>
            filter(continent == "Asia")
            
plot1 <- ggplot(country_data, aes(x = year, y = lifeExp)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  theme_bw()
print(plot1)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# Country Plot with Labels
plot1_labs <- ggplot(country_data, aes(x = year, y = lifeExp)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  labs(
    title = paste("Life Expectancy of China"),
    subtitle = "Gapminder 1952–2007",
    x = "Year", y = "Life Expectancy (years)" 
  ) +
  theme_bw()
print(plot1_labs)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# Plot of Countries in the Selected Continent
plot2 <- ggplot(continent_data, aes(x = year, y = lifeExp, colour = country)) +
  geom_point() + 
  geom_smooth(se = FALSE) +
  labs(
    title    = paste("Life Expectancy of Asia"),
    subtitle = "Gapminder 1952–2007",
    x = "Year", y = "Life Expectancy (years)", colour = "Country"
  ) +
  theme_bw()
print(plot2)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# Life Expectancy by Continent
ggplot(data = gapminder , mapping = aes(x = year , y =  lifeExp, colour= country))+
   geom_point() + 
   geom_smooth(se = FALSE) +
   facet_wrap(~continent) +
   theme_bw() +
   theme(legend.position="none")  #remove all legends
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

My observation

I came from China so my plots are based on data from China and Asia.

  • In terms of China, life expentancy has increased drastically from 1950 to 1980, probably due to the founding of PRC, which improved the overall life quality of Chinese residents. After 1980, the growth rate is steady and relatively slower, which is in line with the law of dimishing returns.

  • In terms of Asia, the life expectancy has grown overall, but different countries diverged a lot. The main reason behind this phenomenon should be different social-economic conditions.

  • In terms of comparisons between different continents, countries in continents except Africa all witnessed an increase in life expectancy. This reflects global improvements in health care, nutrition, sanitation, and technology.

    • Africa, however, shows the lowest starting point and the widest gap and some countries even show declines or stagnation at certain points, likely due to HIV/AIDS epidemics, political instability, or conflict.
    • Europe and Oceania are consistently high, which indicates stable health care systems, economic development, and effective public health policies.
    • Americas show moderate and consistent growth. The gap between countries like the U.S. and less developed countries in Latin America narrows slightly, though differences remain.

Task 3: Brexit voting

We will have a quick look at the results of the 2016 Brexit vote in the UK. First we read the data using read_csv() and have a quick glimpse at the data

brexit_results <- read_csv("Data/brexit_results.csv")
glimpse(brexit_results)
## Rows: 632
## Columns: 11
## $ Seat        <chr> "Aldershot", "Aldridge-Brownhills", "Altrincham and Sale W…
## $ con_2015    <dbl> 50.592, 52.050, 52.994, 43.979, 60.788, 22.418, 52.454, 22…
## $ lab_2015    <dbl> 18.333, 22.369, 26.686, 34.781, 11.197, 41.022, 18.441, 49…
## $ ld_2015     <dbl> 8.824, 3.367, 8.383, 2.975, 7.192, 14.828, 5.984, 2.423, 1…
## $ ukip_2015   <dbl> 17.867, 19.624, 8.011, 15.887, 14.438, 21.409, 18.821, 21.…
## $ leave_share <dbl> 57.89777, 67.79635, 38.58780, 65.29912, 49.70111, 70.47289…
## $ born_in_uk  <dbl> 83.10464, 96.12207, 90.48566, 97.30437, 93.33793, 96.96214…
## $ male        <dbl> 49.89896, 48.92951, 48.90621, 49.21657, 48.00189, 49.17185…
## $ unemployed  <dbl> 3.637000, 4.553607, 3.039963, 4.261173, 2.468100, 4.742731…
## $ degree      <dbl> 13.870661, 9.974114, 28.600135, 9.336294, 18.775591, 6.085…
## $ age_18to24  <dbl> 9.406093, 7.325850, 6.437453, 7.747801, 5.734730, 8.209863…

The data comes from Elliott Morris, who cleaned it and made it available through his DataCamp class on analysing election and polling data in R.

Our main outcome variable (or y) is leave_share, which is the percent of votes cast in favour of Brexit, or leaving the EU. Each row is a UK parliament constituency.

To get a sense of the spread of the data, plot a histogram and a density plot of the leave share in all constituencies.

ggplot(brexit_results, aes(x = leave_share)) +
  geom_histogram(binwidth = 2.5) +
  labs(
    title = paste("Histogram Plot of Leave Share and Number of Constituencies"),
    subtitle = "2016 Brexit Vote",
    x = "Leave Share", y = "Number of Constituencies" 
  ) +
  theme_bw()

ggplot(brexit_results, aes(x = leave_share)) +
  geom_density() +
    labs(
    title = paste("Density Plot of Leave Share and Number of Constituencies"),
    subtitle = "2016 Brexit Vote",
    x = "Leave Share", y = "Density of Constituencies" 
  ) +
  theme_bw()

One common explanation for the Brexit outcome was fear of immigration and opposition to the EU’s more open border policy. We can check the relationship between the proportion of native born residents (born_in_uk) in a constituency and its leave_share. To do this, let us get the correlation between the two variables:

brexit_results %>% 
  select(leave_share, born_in_uk) %>% 
  cor()
##             leave_share born_in_uk
## leave_share   1.0000000  0.4934295
## born_in_uk    0.4934295  1.0000000

The correlation is almost 0.5, which shows that the two variables are positively correlated.

We can also create a scatter plot between these two variables using geom_point. We also add the best fit line, using geom_smooth(method = "lm").

ggplot(brexit_results, aes(x = born_in_uk, y = leave_share)) +
  geom_point(alpha=0.3) +
  geom_smooth(method = "lm") +
  theme_bw() +
    labs(
    title = paste("Scatter Plot and Fit Line of UK brith and Leave Share"),
    subtitle = "2016 Brexit Vote",
    x = "Proportion of Native Born Residents", y = "Leave Share" 
  ) 
## `geom_smooth()` using formula = 'y ~ x'

You have the code for the plots, I would like you to revisit all of them and use the labs() function to add an informative title, sub-title, and axes titles to all plots.

What can you say about the relationship shown above? Again, don’t just say what’s happening in the graph. Tell some sort of story and speculate about the differences in the patterns.

Type your answer after this blockquote.

My comment

  • Status quo: The core relationship ties a constituency’s UK-born population to Brexit voting leave share and shows that constituencies with larger shares of UK-born residents tend to have a larger leave share, which can be explained by the fear of immigration and opposition to the EU’s more open border policy by native born residents.

  • Caveat: But the result of different leave shares is driven by place, economic history, and lived experience and many other factors, not just demographics. While more leave support often correlates with larger UK-born shares, the relationship might not be causal.

  • Possible solution: Add more control variables (e.g. household income) to the regression and see if the regression result is still significant.

Details

If you want to, please answer the following

  • Who did you collaborate with: Myself
  • How much time did you spend on each of the 3 Datacamp chapters and on this preliminary assignment completion: 3 hours each chapter; 3 hours for this assignment
  • What, if anything, gave you the most trouble: /