Statistics with R

class: center, middle, inverse, title-slide

# Statistics with R
## R for Actuarial Students

---

### Data

Consider the data set ‘Covid_2019.csv’.The first row of the csv file
contains the headings for the columns. Import it into the R environment as <tt>covid19</tt>

---

#### Exercises

1. Print the number of missing values in each of the col umns and create a new data set
‘<tt>covid19_1</tt>’ by removing all the missing values. From <tt>covid19_1</tt>, use the columns from Population Density (8th column) to Life Expectancy(17th column) to answer the following questions.

2. Create a new data frame “<tt>Covid_Cluster</tt>” containing only the above mentioned
columns. Normalize all the columns of the data frame using the scale function.

3. Classify the countries into five groups by using the values obtained from Part 2 applying
K-Means clustering algorithm. It is mandatory to set a seed value of 100 before
executing the algorithm. Print the number of countries in each cluster.

4. What proportion of total countries in each cluster are severe with respect to COVID-
19? You can use the “Severe” column from the original dataset.

5. Print the total number of cases and total number of deaths for each cluster.

---

### Part 1

```r
covid19 <- read.csv("Covid_2019.csv")
dim(covid19)
```

```
## [1] 208  18
```

```r
### Summary of sum of the columns
summary( covid19[ , 14:17] )
```

```
##  female_smokers   male_smokers   hospital_beds_per_thousand life_expectancy
##  Min.   : 0.10   Min.   : 7.70   Min.   : 0.100             Min.   :53.28  
##  1st Qu.: 1.90   1st Qu.:21.40   1st Qu.: 1.300             1st Qu.:69.02  
##  Median : 5.90   Median :31.20   Median : 2.320             Median :75.05  
##  Mean   :10.32   Mean   :32.63   Mean   : 3.013             Mean   :73.43  
##  3rd Qu.:18.95   3rd Qu.:41.30   3rd Qu.: 3.930             3rd Qu.:78.92  
##  Max.   :44.00   Max.   :78.10   Max.   :13.800             Max.   :86.75  
##  NA's   :69      NA's   :71      NA's   :45                 NA's   :3
```

---

```r
missingvalues<-sapply(covid19,FUN = function(x)sum(is.na(x))) 
missingvalues 
```

```
##                  Continent                    Country 
##                          0                          0 
##                total_cases               total_deaths 
##                          0                          0 
##    total_cases_per_million   total_deaths_per_million 
##                          0                          0 
##                 population         population_density 
##                          0                         11 
##                 median_age              aged_65_older 
##                         24                         27 
##             gdp_per_capita      cardiovasc_death_rate 
##                         27                         24 
##        diabetes_prevalence             female_smokers 
##                         17                         69 
##               male_smokers hospital_beds_per_thousand 
##                         71                         45 
##            life_expectancy                     Severe 
##                          3                          0
```
---

### Part 1

```r
covid19_1 <- covid19[complete.cases(covid19),]
dim(covid19_1)
```

```
## [1] 126  18
```

```r
### Summary of sum of the columns
summary(covid19_1[ , 8:11] )
```

```
##  population_density   median_age    aged_65_older    gdp_per_capita    
##  Min.   :   1.98    Min.   :15.10   Min.   : 1.144   Min.   :   752.8  
##  1st Qu.:  43.10    1st Qu.:26.35   1st Qu.: 4.424   1st Qu.:  6404.7  
##  Median :  87.25    Median :32.40   Median : 8.213   Median : 15827.4  
##  Mean   : 227.30    Mean   :32.69   Mean   :10.084   Mean   : 22357.3  
##  3rd Qu.: 205.50    3rd Qu.:40.67   3rd Qu.:15.390   3rd Qu.: 32558.2  
##  Max.   :7915.73    Max.   :48.20   Max.   :27.049   Max.   :116935.6
```

---

#### Part 2

```r
Covid_Cluster<-covid19_1[,8:17]
names(Covid_Cluster)
```

```
##  [1] "population_density"         "median_age"                
##  [3] "aged_65_older"              "gdp_per_capita"            
##  [5] "cardiovasc_death_rate"      "diabetes_prevalence"       
##  [7] "female_smokers"             "male_smokers"              
##  [9] "hospital_beds_per_thousand" "life_expectancy"
```

```r
Covid_Cluster<-scale(Covid_Cluster)
```

---

#### Part 3

Classify the countries into five groups by using the values obtained from Part 2 applying
K-Means clustering algorithm. It is mandatory to set a seed value of 100 before
executing the algorithm. Print the number of countries in each cluster.

```r
set.seed(100)
cluster1 <- kmeans(Covid_Cluster,centers = 5)
cluster1$size
```

```
## [1] 24 26 20 23 33
```

---

#### Part 4

What proportion of total countries in each cluster are severe with respect to COVID-
19? You can use the “Severe” column from the original dataset.

```r
covid19_1$cluster<-cluster1$cluster
```

```r
table(covid19_1$cluster,covid19_1$Severe)
```

```
##    
##     No Yes
##   1 11  13
##   2 11  15
##   3 15   5
##   4 17   6
##   5 29   4
```

---

#### Part 4

```r
prop.table(table(covid19_1$cluster,covid19_1$Severe),margin = 1)
```

```
##    
##            No       Yes
##   1 0.4583333 0.5416667
##   2 0.4230769 0.5769231
##   3 0.7500000 0.2500000
##   4 0.7391304 0.2608696
##   5 0.8787879 0.1212121
```

---

#### Part 5

```r
cbind(
  aggregate(total_cases~cluster,data = covid19_1, FUN = "sum"),

aggregate(total_deaths~cluster,data = covid19_1, FUN = "sum")
)
```

```
##   cluster total_cases cluster total_deaths
## 1       1     4833767       1       181850
## 2       2     6558014       2       318579
## 3       3     1446595       3        33944
## 4       4     1238502       4        32231
## 5       5     2643637       5        56431
```

---

#### Using {tidyverse}

```r
library(tidyverse)
covid19_1 %>% group_by(cluster)%>%
  summarize(total_cases = sum(total_cases),
            total_deaths = sum(total_deaths)
            )
```

```
## # A tibble: 5 x 3
##   cluster total_cases total_deaths
##     <int>       <int>        <int>
## 1       1     4833767       181850
## 2       2     6558014       318579
## 3       3     1446595        33944
## 4       4     1238502        32231
## 5       5     2643637        56431
```

---