p2: Tidy and Transform Three data set

Jie Zou

2021-03-14

Case I: covid_data.csv

Intro: The status of different cases of COVID-19

Read the file

covid <- read.csv("https://raw.githubusercontent.com/Sugarcane-svg/R/main/R607/Projects/p2/p2_covid_data.csv")
head(covid, 8)
##   X sl_no       country total_cases new_cases total_deaths new_deaths
## 1 0    NA North America     5359473    +6,261       214740        531
## 2 1    NA South America     3952480    +1,207       139781         88
## 3 2    NA          Asia     4130442    +3,348        94073         49
## 4 3    NA        Europe     2833461      +673       202526          3
## 5 4    NA        Africa      895157                  18908         NA
## 6 5    NA       Oceania       18032      +717          213         13
## 7 6    NA                       721                     15         NA
## 8 7    NA         World    17189766   +12,206       670256        684
##   total_recovered active_cases serious_critical_cases total_cases_per_1m_pop
## 1         2747302         5870                2397431                  26093
## 2         2692664          535                1120035                  14274
## 3         3002028         4192                1034341                  19740
## 4         1702083           15                 928852                   5119
## 5          544168           NA                 332081                   1120
## 6           12246         1000                   5573                     39
## 7             651           NA                     55                      4
## 8        10701142        11612                5818368                  66389
##   total_deaths_per_1m_pop total_tests total_tests_per_1m_pop population
## 1                      NA          NA                     NA         NA
## 2                      NA          NA                     NA         NA
## 3                      NA          NA                     NA         NA
## 4                      NA          NA                     NA         NA
## 5                      NA          NA                     NA         NA
## 6                      NA          NA                     NA         NA
## 7                      NA          NA                     NA         NA
## 8                    2205          86                     NA         NA

Tidy covid data

1. remove some rows

the top 8 rows are showing data grouped by continents, however, here, we care more about the cases in different countries. Therefore, I am going to remove the first eight rows.

covid <- covid[-c(0:8),]
head(covid)
##     X sl_no      country total_cases new_cases total_deaths new_deaths
## 9   8     1          USA     4568037                 153840         NA
## 10  9     2       Brazil     2555518                  90188         NA
## 11 10     3        India     1584384                  35003         NA
## 12 11     4       Russia      828990                  13673         NA
## 13 12     5 South Africa      471123                   7497         NA
## 14 13     6       Mexico      408449    +5,752        45361        485
##    total_recovered active_cases serious_critical_cases total_cases_per_1m_pop
## 9          2245044           NA                2169153                  18852
## 10         1787419           NA                 677911                   8318
## 11         1021611           NA                 527770                   8944
## 12          620333           NA                 194984                   2300
## 13          297967           NA                 165659                    539
## 14          267147         5690                  95941                   3922
##    total_deaths_per_1m_pop total_tests total_tests_per_1m_pop population
## 9                    13794         465               56684379     171171
## 10                   12016         424               12601096      59250
## 11                    1147          25               18190382      13171
## 12                    5680          94               27569646     188911
## 13                    7936         126                2873163      48398
## 14                    3165         352                 953514       7389

2. remove some columns

from the table, there are many columns that contains most of NA, therefore, we aren’t going to use these variables to do analyze. Otherwise, it will cause lost of skewness.

covid <- covid %>% 
  select("sl_no","country","total_cases","total_deaths", "total_recovered", "serious_critical_cases", "total_tests")
covid <- covid %>% drop_na()

Perform analysis

lets take a overlook of the data, there are 7 variables and total 187 observations of cases in different countries.

glimpse(covid)
## Rows: 187
## Columns: 7
## $ sl_no                  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
## $ country                <chr> "USA", "Brazil", "India", "Russia", "South Afr…
## $ total_cases            <int> 4568037, 2555518, 1584384, 828990, 471123, 408…
## $ total_deaths           <dbl> 153840, 90188, 35003, 13673, 7497, 45361, 1881…
## $ total_recovered        <dbl> 2245044, 1787419, 1021611, 620333, 297967, 267…
## $ serious_critical_cases <int> 2169153, 677911, 527770, 194984, 165659, 95941…
## $ total_tests            <dbl> 465, 424, 25, 94, 126, 352, 570, 485, 608, 677…

1. how is the death cases in the top 10 listed in sl_no?

from the ratio of deaths, UK seems have high death rate followed by Mexico

top_10 <- covid %>%
  filter(sl_no <= 10)%>%
  select(country, total_cases, total_deaths)%>%
  mutate(ratio_of_deaths = total_deaths/total_cases)

ggplot(top_10, aes(x = country, y = ratio_of_deaths)) + geom_bar(stat = "identity") + labs(title = "the ratio of death in 10 countries")

2. how does test cases contribute to total cases?

as we can see that the average test cases all over countries is roughly 94 cases, the average total cases is 91908 each country. If we assume that all test cases were positive and every country have average total cases, it means that there were only in average 0.1% of population contribute to total cases, in other words, no people would like to do the test. And there are only 47 out of 187 countries have the test cases above average.

# the average of test cases all over the countries
avg_test<-sum(covid$total_tests)/nrow(covid)
avg_test
## [1] 93.6631
# the average of total cases all over the countries
avg_total<-sum(covid$total_cases)/nrow(covid)
avg_total
## [1] 91907.93
# the average ratio of test case
avg_ratio<- avg_test / avg_total
avg_ratio
## [1] 0.001019097
# number of countries whose test case is above average ratio
num_country <- covid %>% 
  filter(total_tests>avg_test) %>%
  count(country) %>%
  summarise(sum(n))
num_country
##   sum(n)
## 1     47

3. what is the ratio of test cases contribute to their own country of total cases?

here, each country remains its own ratio of total test cases, where Montserrat has the highest population who is willing to do a covid test. The test rate here is 18%. It is much higher than the one we calculate before.

ratio_test<- covid %>% 
  select(country) %>%
  mutate(test_rate = covid$total_tests/covid$total_cases)

ggplot(ratio_test, aes(x=country, y = test_rate)) + geom_bar(stat = "identity")

# top test rate
ratio_test %>% filter(test_rate == max(test_rate))
##      country test_rate
## 1 Montserrat  16.66667
# avg test rate contributed by each country 
sum(ratio_test$test_rate)/187
## [1] 0.1831692

Conclusion

to do a test is encouraged, in that way we can see better the pandemic situation, we can have a better prediction in the positive test population. Even though there are many deaths cases, the ratio of deaths is still relatively low. Don’t be scary.

Case II: vaccin_data.csv

Intro: covid vaccine injection situation

Read the file

vaccine <- read.csv("https://raw.githubusercontent.com/Sugarcane-svg/R/main/R607/Projects/p2/p2_vaccin_data.csv")

Tidy and transfrom data

Overview of data: there are 15 variables and nearly 6k observations. the data is big.

glimpse(vaccine)
## Rows: 5,972
## Columns: 15
## $ country                             <chr> "Albania", "Albania", "Albania", …
## $ iso_code                            <chr> "ALB", "ALB", "ALB", "ALB", "ALB"…
## $ date                                <chr> "2021-01-10", "2021-01-11", "2021…
## $ total_vaccinations                  <dbl> 0, NA, 128, 188, 266, 308, 369, 4…
## $ people_vaccinated                   <dbl> 0, NA, 128, 188, 266, 308, 369, 4…
## $ people_fully_vaccinated             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ daily_vaccinations_raw              <dbl> NA, NA, NA, 60, 78, 42, 61, 36, 4…
## $ daily_vaccinations                  <dbl> NA, 64, 64, 63, 66, 62, 62, 58, 5…
## $ total_vaccinations_per_hundred      <dbl> 0.00, NA, 0.00, 0.01, 0.01, 0.01,…
## $ people_vaccinated_per_hundred       <dbl> 0.00, NA, 0.00, 0.01, 0.01, 0.01,…
## $ people_fully_vaccinated_per_hundred <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ daily_vaccinations_per_million      <dbl> NA, 22, 22, 22, 23, 22, 22, 20, 1…
## $ vaccines                            <chr> "Pfizer/BioNTech", "Pfizer/BioNTe…
## $ source_name                         <chr> "Ministry of Health", "Ministry o…
## $ source_website                      <chr> "https://shendetesia.gov.al/covid…

take a look at the first 10 rows, we observe that country and iso code remain the same, but the date is increment by one. Therefore, we can compress data

head(vaccine, 10)
##    country iso_code       date total_vaccinations people_vaccinated
## 1  Albania      ALB 2021-01-10                  0                 0
## 2  Albania      ALB 2021-01-11                 NA                NA
## 3  Albania      ALB 2021-01-12                128               128
## 4  Albania      ALB 2021-01-13                188               188
## 5  Albania      ALB 2021-01-14                266               266
## 6  Albania      ALB 2021-01-15                308               308
## 7  Albania      ALB 2021-01-16                369               369
## 8  Albania      ALB 2021-01-17                405               405
## 9  Albania      ALB 2021-01-18                447               447
## 10 Albania      ALB 2021-01-19                483               483
##    people_fully_vaccinated daily_vaccinations_raw daily_vaccinations
## 1                       NA                     NA                 NA
## 2                       NA                     NA                 64
## 3                       NA                     NA                 64
## 4                       NA                     60                 63
## 5                       NA                     78                 66
## 6                       NA                     42                 62
## 7                       NA                     61                 62
## 8                       NA                     36                 58
## 9                       NA                     42                 55
## 10                      NA                     36                 51
##    total_vaccinations_per_hundred people_vaccinated_per_hundred
## 1                            0.00                          0.00
## 2                              NA                            NA
## 3                            0.00                          0.00
## 4                            0.01                          0.01
## 5                            0.01                          0.01
## 6                            0.01                          0.01
## 7                            0.01                          0.01
## 8                            0.01                          0.01
## 9                            0.02                          0.02
## 10                           0.02                          0.02
##    people_fully_vaccinated_per_hundred daily_vaccinations_per_million
## 1                                   NA                             NA
## 2                                   NA                             22
## 3                                   NA                             22
## 4                                   NA                             22
## 5                                   NA                             23
## 6                                   NA                             22
## 7                                   NA                             22
## 8                                   NA                             20
## 9                                   NA                             19
## 10                                  NA                             18
##           vaccines        source_name
## 1  Pfizer/BioNTech Ministry of Health
## 2  Pfizer/BioNTech Ministry of Health
## 3  Pfizer/BioNTech Ministry of Health
## 4  Pfizer/BioNTech Ministry of Health
## 5  Pfizer/BioNTech Ministry of Health
## 6  Pfizer/BioNTech Ministry of Health
## 7  Pfizer/BioNTech Ministry of Health
## 8  Pfizer/BioNTech Ministry of Health
## 9  Pfizer/BioNTech Ministry of Health
## 10 Pfizer/BioNTech Ministry of Health
##                                                                                                                                                        source_website
## 1  https://shendetesia.gov.al/covid19-ministria-e-shendetesise-702-te-vaksinuar-3035-testime-714-te-sheruar-631-raste-te-reja-dhe-17-humbje-jete-ne-24-oret-e-fundit/
## 2  https://shendetesia.gov.al/covid19-ministria-e-shendetesise-702-te-vaksinuar-3035-testime-714-te-sheruar-631-raste-te-reja-dhe-17-humbje-jete-ne-24-oret-e-fundit/
## 3  https://shendetesia.gov.al/covid19-ministria-e-shendetesise-702-te-vaksinuar-3035-testime-714-te-sheruar-631-raste-te-reja-dhe-17-humbje-jete-ne-24-oret-e-fundit/
## 4  https://shendetesia.gov.al/covid19-ministria-e-shendetesise-702-te-vaksinuar-3035-testime-714-te-sheruar-631-raste-te-reja-dhe-17-humbje-jete-ne-24-oret-e-fundit/
## 5  https://shendetesia.gov.al/covid19-ministria-e-shendetesise-702-te-vaksinuar-3035-testime-714-te-sheruar-631-raste-te-reja-dhe-17-humbje-jete-ne-24-oret-e-fundit/
## 6  https://shendetesia.gov.al/covid19-ministria-e-shendetesise-702-te-vaksinuar-3035-testime-714-te-sheruar-631-raste-te-reja-dhe-17-humbje-jete-ne-24-oret-e-fundit/
## 7  https://shendetesia.gov.al/covid19-ministria-e-shendetesise-702-te-vaksinuar-3035-testime-714-te-sheruar-631-raste-te-reja-dhe-17-humbje-jete-ne-24-oret-e-fundit/
## 8  https://shendetesia.gov.al/covid19-ministria-e-shendetesise-702-te-vaksinuar-3035-testime-714-te-sheruar-631-raste-te-reja-dhe-17-humbje-jete-ne-24-oret-e-fundit/
## 9  https://shendetesia.gov.al/covid19-ministria-e-shendetesise-702-te-vaksinuar-3035-testime-714-te-sheruar-631-raste-te-reja-dhe-17-humbje-jete-ne-24-oret-e-fundit/
## 10 https://shendetesia.gov.al/covid19-ministria-e-shendetesise-702-te-vaksinuar-3035-testime-714-te-sheruar-631-raste-te-reja-dhe-17-humbje-jete-ne-24-oret-e-fundit/

1. drop some columns

There are many variables which cannot perform analysis later or are not so useful. For these variables, we are going to drop them to make data smaller. For example, [total_vaccinations_per_hundred], [people_vaccinated_per_hundred], [people_fully_vaccinated_per_hundred], [daily_vaccinations_per_million], [source_name], [source_website], [iso_code], [daily_vaccinations_raw], [vaccines]. After dropping columns, there are still 6k observations

vaccine <- vaccine %>% 
  select(-c(total_vaccinations_per_hundred,
            people_vaccinated_per_hundred,
            people_fully_vaccinated_per_hundred,
            daily_vaccinations_per_million,
            source_name,
            source_website,
            iso_code,daily_vaccinations_raw,
            vaccines))

#columns left
colnames(vaccine)
## [1] "country"                 "date"                   
## [3] "total_vaccinations"      "people_vaccinated"      
## [5] "people_fully_vaccinated" "daily_vaccinations"

2. group data by its country

take a close look at variable date, all observations for each country starts from different date and ends differently. In case we don’t miss any observation which would affect the number of vaccines, instead of dropping NA value, we group by them by its country.

approaches

  • split data into tables with primary key(country) and foreign key
# num of days
days <- vaccine %>% 
  group_by(country)%>%
  count(date) %>%
  summarize(num_of_days = sum(n))

# num of total vaccination
total_vaccine<-vaccine%>%
  group_by(country)%>%
  summarise(total_vaccine=max(total_vaccinations, na.rm = TRUE))

# num of people who vaccinated
people_vaccinated <- vaccine%>%
  group_by(country)%>%
  summarise(people_vaccinated= 
              if(max(people_vaccinated, na.rm = TRUE)==-Inf) 0 else max(people_vaccinated, na.rm = TRUE))

# num of people who got fully vaccinated
people_full_vacc<-vaccine%>%
  group_by(country)%>%
  summarise(people_full_vacc= 
              if(max(people_fully_vaccinated, na.rm = TRUE)==-Inf) 0 else max(people_fully_vaccinated, na.rm = TRUE))
  • join these tables by primary key(country)
df <- days
df<-left_join(df, total_vaccine, by = "country")
df<-left_join(df, people_vaccinated, by = "country")
df<-left_join(df, people_full_vacc, by = "country")
head(df)
## # A tibble: 6 x 5
##   country   num_of_days total_vaccine people_vaccinated people_full_vacc
##   <chr>           <int>         <dbl>             <dbl>            <dbl>
## 1 Albania            60         21613              6073              655
## 2 Algeria            22         75000                 0                0
## 3 Andorra            45          4914              3650             1264
## 4 Angola              4          6169              6169                0
## 5 Anguilla           23          3929              3929                0
## 6 Argentina          73       1919074           1564555           354519

Perform analysis

1. How many vaccines do each country get each day? and which country gets the most and the least vaccinations each day?

from avg_vacc table, we can see the average vaccination situation of each country. Meanwhile, the most vaccinated country is the United States, and the least vaccinated country is Montserrat

avg_vacc <- df %>% 
  select(country, num_of_days, total_vaccine)%>%
  mutate(avg_vacc = round(total_vaccine/num_of_days,0))
avg_vacc
## # A tibble: 133 x 4
##    country    num_of_days total_vaccine avg_vacc
##    <chr>            <int>         <dbl>    <dbl>
##  1 Albania             60         21613      360
##  2 Algeria             22         75000     3409
##  3 Andorra             45          4914      109
##  4 Angola               4          6169     1542
##  5 Anguilla            23          3929      171
##  6 Argentina           73       1919074    26289
##  7 Australia           25        125000     5000
##  8 Austria             75        929497    12393
##  9 Azerbaijan          54        436849     8090
## 10 Bahrain             79        520429     6588
## # … with 123 more rows
# the most vaccinated country
avg_vacc %>% filter(avg_vacc == max(avg_vacc))
## # A tibble: 1 x 4
##   country       num_of_days total_vaccine avg_vacc
##   <chr>               <int>         <dbl>    <dbl>
## 1 United States          82      98203893  1197608
# the least vaccinated country
avg_vacc %>% filter(avg_vacc == min(avg_vacc))
## # A tibble: 1 x 4
##   country    num_of_days total_vaccine avg_vacc
##   <chr>            <int>         <dbl>    <dbl>
## 1 Montserrat          16           652       41

2. What is the future demand for vaccines in the top five countries in terms of total vaccine doses?

approches

  • the top 5 countries are United States, China, India, United Kingdom, England.
df %>% 
  select(country, total_vaccine)%>%
  arrange(desc(total_vaccine))
## # A tibble: 133 x 2
##    country        total_vaccine
##    <chr>                  <dbl>
##  1 United States       98203893
##  2 China               52520000
##  3 India               26164920
##  4 United Kingdom      24405231
##  5 England             20526312
##  6 Brazil              10740859
##  7 Turkey              10502494
##  8 Israel               9179072
##  9 Germany              8716654
## 10 Russia               7273933
## # … with 123 more rows
  • select these 5 countries in vaccine table and create a new data set

the graph of China, England and United Kingdom show that the demand towards stability, however, the graph of India and United States seem still upward trend. So I assume that the demand for vaccines will continue to increase in these two countries

top_5 <- vaccine %>% 
  select(country,date,daily_vaccinations) %>%
  filter(country == "United States" | 
           country =="China" | 
           country == "India" | 
           country == "United Kingdom" | 
           country =="England")

ggplot(top_5, aes(x = date, y = daily_vaccinations)) + geom_histogram(stat = "identity")+ facet_wrap(~ country)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Removed 5 rows containing missing values (position_stack).

Case III: gdp_data.csv

Intro: The GDP of top 10 countries in the past ten years

Read file

gdp <- read.csv("https://raw.githubusercontent.com/Sugarcane-svg/R/main/R607/Projects/p2/p2_gdp_data.csv")

Tidy and transform data

1. summary of data

compare to previous data set, this data has much more variables and much less observations.

glimpse(gdp)
## Rows: 264
## Columns: 64
## $ Country.Name   <chr> "Aruba", "Afghanistan", "Angola", "Albania", "Andorra"…
## $ Country.Code   <chr> "ABW", "AFG", "AGO", "ALB", "AND", "ARB", "ARE", "ARG"…
## $ Indicator.Name <chr> "GDP growth (annual %)", "GDP growth (annual %)", "GDP…
## $ Indicator.Code <chr> "NY.GDP.MKTP.KD.ZG", "NY.GDP.MKTP.KD.ZG", "NY.GDP.MKTP…
## $ X1960          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ X1961          <dbl> NA, NA, NA, NA, NA, NA, NA, 5.4, NA, NA, NA, 2.5, 5.5,…
## $ X1962          <dbl> NA, NA, NA, NA, NA, NA, NA, -0.9, NA, NA, NA, 1.3, 2.6…
## $ X1963          <dbl> NA, NA, NA, NA, NA, NA, NA, -5.3, NA, NA, NA, 6.2, 4.1…
## $ X1964          <dbl> NA, NA, NA, NA, NA, NA, NA, 10.1, NA, NA, NA, 7.0, 6.1…
## $ X1965          <dbl> NA, NA, NA, NA, NA, NA, NA, 10.6, NA, NA, NA, 6.0, 3.5…
## $ X1966          <dbl> NA, NA, NA, NA, NA, NA, NA, -0.7, NA, NA, NA, 2.4, 5.6…
## $ X1967          <dbl> NA, NA, NA, NA, NA, NA, NA, 3.2, NA, NA, NA, 6.3, 3.0,…
## $ X1968          <dbl> NA, NA, NA, NA, NA, NA, NA, 4.8, NA, NA, NA, 5.1, 4.5,…
## $ X1969          <dbl> NA, NA, NA, NA, NA, NA, NA, 9.7, NA, NA, NA, 7.0, 6.3,…
## $ X1970          <dbl> NA, NA, NA, NA, NA, NA, NA, 3.0, NA, NA, NA, 7.2, 6.3,…
## $ X1971          <dbl> NA, NA, NA, NA, 4.6, NA, NA, 5.7, NA, NA, NA, 4.0, 5.1…
## $ X1972          <dbl> NA, NA, NA, NA, 8.1, NA, NA, 1.6, NA, NA, NA, 3.9, 6.2…
## $ X1973          <dbl> NA, NA, NA, NA, 7.8, NA, NA, 2.8, NA, NA, NA, 2.6, 4.9…
## $ X1974          <dbl> NA, NA, NA, NA, 5.6, NA, NA, 5.5, NA, NA, NA, 4.1, 3.9…
## $ X1975          <dbl> NA, NA, NA, NA, 0.5, NA, NA, 0.0, NA, NA, NA, 1.3, -0.…
## $ X1976          <dbl> NA, NA, NA, NA, 3.3, 15.7, 16.5, -2.0, NA, NA, NA, 2.6…
## $ X1977          <dbl> NA, NA, NA, NA, 2.8, 8.2, 21.4, 6.9, NA, NA, NA, 3.6, …
## $ X1978          <dbl> NA, NA, NA, NA, 1.5, -0.7, -1.6, -4.5, NA, NA, 4.4, 0.…
## $ X1979          <dbl> NA, NA, NA, NA, 0.0, 11.7, 20.9, 10.2, NA, NA, 8.1, 4.…
## $ X1980          <dbl> NA, NA, NA, NA, 2.2, 9.1, 23.9, 1.5, NA, NA, 8.2, 3.0,…
## $ X1981          <dbl> NA, NA, -4.4, 5.7, -0.1, 3.0, 4.7, -5.2, NA, NA, 3.8, …
## $ X1982          <dbl> NA, NA, 0.0, 2.9, 1.2, -9.1, -6.7, -0.7, NA, NA, -0.1,…
## $ X1983          <dbl> NA, NA, 4.2, 1.1, 1.8, -6.9, -4.7, 4.3, NA, NA, 5.4, -…
## $ X1984          <dbl> NA, NA, 6.0, -1.3, 1.8, 1.3, 4.0, 1.6, NA, NA, 10.2, 4…
## $ X1985          <dbl> NA, NA, 3.5, 1.8, 2.3, -2.4, -3.6, -5.2, NA, NA, 7.6, …
## $ X1986          <dbl> NA, NA, 2.9, 5.6, 3.3, 4.7, -15.0, 6.2, NA, NA, 11.5, …
## $ X1987          <dbl> 16.1, NA, 4.1, -0.8, 5.5, -0.5, 3.4, 2.7, NA, NA, 6.6,…
## $ X1988          <dbl> 18.6, NA, 6.1, -1.4, 5.1, 5.7, -2.6, -1.1, NA, NA, 5.2…
## $ X1989          <dbl> 12.1, NA, 0.0, 9.8, 4.8, 2.4, 12.3, -7.2, NA, NA, 5.3,…
## $ X1990          <dbl> 4.0, NA, -3.5, -9.6, 3.8, 13.1, 18.3, -2.5, NA, NA, 3.…
## $ X1991          <dbl> 8.0, NA, 1.0, -28.0, 2.5, 1.5, 0.9, 9.1, -11.7, NA, 2.…
## $ X1992          <dbl> 5.9, NA, -5.8, -7.2, 0.9, 4.9, 3.3, 7.9, -41.8, NA, 1.…
## $ X1993          <dbl> 7.3, NA, -24.0, 9.6, -1.0, 3.3, 1.3, 8.2, -8.8, NA, 5.…
## $ X1994          <dbl> 8.2, NA, 1.3, 8.3, 2.4, 3.2, 6.9, 5.8, 5.4, NA, 6.7, 4…
## $ X1995          <dbl> 2.5, NA, 15.0, 13.3, 2.8, 2.7, 6.7, -2.8, 6.9, NA, -4.…
## $ X1996          <dbl> 1.2, NA, 13.5, 9.1, 4.6, 4.6, 5.8, 5.5, 5.9, NA, 6.6, …
## $ X1997          <dbl> 7.0, NA, 7.3, -10.9, 9.1, 4.2, 8.2, 8.1, 3.3, NA, 5.5,…
## $ X1998          <dbl> 2.0, NA, 4.7, 8.8, 3.2, 5.3, 0.3, 3.9, 7.3, NA, 4.7, 4…
## $ X1999          <dbl> 1.2, NA, 2.2, 12.9, 4.1, 1.8, 2.9, -3.4, 3.3, NA, 3.7,…
## $ X2000          <dbl> 7.6, NA, 3.1, 7.0, 3.5, 5.5, 10.9, -0.8, 5.9, NA, 6.2,…
## $ X2001          <dbl> -3.0, NA, 4.2, 8.3, 8.1, 1.6, 1.4, -4.4, 9.6, NA, -4.5…
## $ X2002          <dbl> -3.3, NA, 13.7, 4.5, 4.5, 0.6, 2.4, -10.9, 13.2, NA, 1…
## $ X2003          <dbl> 2.0, 8.8, 3.0, 5.5, 8.7, 5.3, 8.8, 8.8, 14.0, 0.8, 6.1…
## $ X2004          <dbl> 7.9, 1.4, 11.0, 5.5, 8.1, 9.4, 9.6, 9.0, 10.5, 0.5, 5.…
## $ X2005          <dbl> 1.2, 11.2, 15.0, 5.5, 5.4, 5.7, 4.9, 8.9, 13.9, -0.4, …
## $ X2006          <dbl> 1.1, 5.4, 11.5, 5.9, 4.8, 6.5, 9.8, 8.0, 13.2, -4.2, 1…
## $ X2007          <dbl> 1.8, 13.8, 14.0, 6.0, 1.6, 4.5, 3.2, 9.0, 13.7, 2.0, 9…
## $ X2008          <dbl> -0.1, 3.9, 11.2, 7.5, -5.6, 5.8, 3.2, 4.1, 6.9, -2.6, …
## $ X2009          <dbl> -10.5, 21.4, 0.9, 3.4, -5.3, 0.4, -5.2, -5.9, -14.1, -…
## $ X2010          <dbl> -3.7, 14.4, 4.4, 3.7, -2.0, 4.7, 1.6, 10.1, 2.2, 0.4, …
## $ X2011          <dbl> 3.4, 0.4, 3.5, 2.5, 0.0, 3.7, 6.9, 6.0, 4.7, 0.3, -2.0…
## $ X2012          <dbl> -1.4, 12.8, 8.5, 1.4, -5.0, 6.7, 4.5, -1.0, 7.2, -4.4,…
## $ X2013          <dbl> 4.2, 5.6, 5.0, 1.0, -3.5, 3.3, 5.1, 2.4, 3.3, -2.8, -0…
## $ X2014          <dbl> 0.3, 2.7, 4.8, 1.8, 2.5, 2.5, 4.3, -2.5, 3.6, 0.9, 3.8…
## $ X2015          <dbl> 5.7, 1.5, 0.9, 2.2, 1.4, 3.2, 5.1, 2.7, 3.2, 1.4, 3.8,…
## $ X2016          <dbl> 2.1, 2.3, -2.6, 3.3, 3.7, 3.5, 3.1, -2.1, 0.2, -2.8, 5…
## $ X2017          <dbl> 2.0, 2.6, -0.1, 3.8, 0.3, 0.8, 2.4, 2.8, 7.5, -5.9, 3.…
## $ X2018          <dbl> NA, 1.2, -2.0, 4.1, 1.6, 2.1, 1.2, -2.6, 5.2, 2.2, 7.4…
## $ X2019          <dbl> NA, 3.9, -0.6, 2.2, 1.8, 1.5, 1.7, -2.1, 7.6, NA, 4.7,…

2. drop columns

we are going to see the status of GDP in the past ten years. So, we need to drop some unnecessary columns. For example, [Country.Code], I am not familiar with country code :(. [Indicator.Name] & [Indicator.Code], no calculated values. [X1960 : X2009], beyond the analysis.

gdp <- gdp %>% select(-c(2:54))
colnames(gdp)
##  [1] "Country.Name" "X2010"        "X2011"        "X2012"        "X2013"       
##  [6] "X2014"        "X2015"        "X2016"        "X2017"        "X2018"       
## [11] "X2019"

3. drop rows

there are some countries that don’t have data of gdp in the past ten years. We only care about the countries with gdp growth, so it no need to keep these countries in side the data set.

gdp <- gdp %>% drop_na()
head(gdp)
##           Country.Name X2010 X2011 X2012 X2013 X2014 X2015 X2016 X2017 X2018
## 1          Afghanistan  14.4   0.4  12.8   5.6   2.7   1.5   2.3   2.6   1.2
## 2               Angola   4.4   3.5   8.5   5.0   4.8   0.9  -2.6  -0.1  -2.0
## 3              Albania   3.7   2.5   1.4   1.0   1.8   2.2   3.3   3.8   4.1
## 4              Andorra  -2.0   0.0  -5.0  -3.5   2.5   1.4   3.7   0.3   1.6
## 5           Arab World   4.7   3.7   6.7   3.3   2.5   3.2   3.5   0.8   2.1
## 6 United Arab Emirates   1.6   6.9   4.5   5.1   4.3   5.1   3.1   2.4   1.2
##   X2019
## 1   3.9
## 2  -0.6
## 3   2.2
## 4   1.8
## 5   1.5
## 6   1.7

4. transfrom data

Even through the data set looks cleaner than before, it is still hard for us to plot or analyze it, as a result, I am going to transform data into [country name]: character [years]: numeric [gdp]: numeric

df.long <- pivot_longer(gdp, cols=2:11, names_to = "years", values_to = "gdp")
head(df.long)
## # A tibble: 6 x 3
##   Country.Name years   gdp
##   <chr>        <chr> <dbl>
## 1 Afghanistan  X2010  14.4
## 2 Afghanistan  X2011   0.4
## 3 Afghanistan  X2012  12.8
## 4 Afghanistan  X2013   5.6
## 5 Afghanistan  X2014   2.7
## 6 Afghanistan  X2015   1.5

Perform analysis

1. what is the mean and median of each countries’ GDP?

summary_df<-df.long %>% 
  group_by(Country.Name)%>%
  summarize(mean = mean(gdp), median = median(gdp))
head(summary_df)
## # A tibble: 6 x 3
##   Country.Name         mean median
##   <chr>               <dbl>  <dbl>
## 1 Afghanistan          4.74   2.65
## 2 Albania              2.6    2.35
## 3 Algeria              2.67   3.05
## 4 Andorra              0.08   0.85
## 5 Angola               2.18   2.2 
## 6 Antigua and Barbuda  2.13   3.6

2. how many countries have negative gdp growth?

summary_df %>% 
  filter(mean<0)%>%
  count(Country.Name)%>%
  summarize(number_of_country_have_negative_GDP = sum(n))
## # A tibble: 1 x 1
##   number_of_country_have_negative_GDP
##                                 <int>
## 1                                  10

3. which country has highest GDP growth in 2019?

that surprise me, the country which has the highest GDP growth in 2019 is Timor-Leste. the gdp of this country in 2018 is -1.1, and gdp of 2019 is 18.7.

growth18 <- df.long%>%
  filter(years == "X2018")%>%
  select(Country.Name, gdp)


growth19 <-df.long%>%
  filter(years == "X2019")%>%
  select(Country.Name, gdp)

highest19<-left_join(growth18, growth19, by = "Country.Name")

highest19 <- highest19 %>% 
  mutate(growth = (gdp.y - gdp.x))

highest19 %>% filter(growth == max(growth))
## # A tibble: 1 x 4
##   Country.Name gdp.x gdp.y growth
##   <chr>        <dbl> <dbl>  <dbl>
## 1 Timor-Leste   -1.1  18.7   19.8

let’s take la close look at the graph, Timor-Leste’s GDP had been going down from 2010 to 2018. However, before 2016, the growth is showing positive which means that it still make some profits. from 2016 to 2018, the GDP broke the freezing point. That must be something serious happpen.

tl<-df.long%>%filter(Country.Name == "Timor-Leste")

ggplot(tl, aes(x = years, y = gdp)) + geom_bar(stat = "identity")

4. what about countries with GDP growth like Timor-Leste?

Countries that are not major economies have increased their GDP by at least 1 percentage point between 2018 and 2019, surprisingly, basically no countries in the global economy have appeared.

greater1<-highest19 %>% filter(growth >1)
ggplot(greater1, aes(x = Country.Name, y = growth, fill = Country.Name)) + geom_bar(stat = "identity") + coord_flip()

5. What about countries like United States, China, Japan, Germany and France? (just list some countries’ names on the top of my head now)

The graph of China has similar trends to those mentioned above, the percentage of GDP has declined slightly since 2018. The plot of the United States is very close to uniform distribution, that the economy of this country is very stable.

listed_countries <- df.long %>% 
  filter(Country.Name == "United States" | 
           Country.Name == "China" | 
           Country.Name == "Japan" |
           Country.Name == "Germany" |
           Country.Name == "France")

ggplot(listed_countries, aes(x = years, y = gdp)) + geom_bar(stat = "identity") + facet_wrap(~Country.Name)