class: center, middle, inverse, title-slide # Statistics with R ## R for Actuarial Students --- ### Data Consider the data set ‘Covid_2019.csv’.The first row of the csv file contains the headings for the columns. Import it into the R environment as <tt>covid19</tt> --- #### Exercises 1. Print the number of missing values in each of the col umns and create a new data set ‘<tt>covid19_1</tt>’ by removing all the missing values. From <tt>covid19_1</tt>, use the columns from Population Density (8th column) to Life Expectancy(17th column) to answer the following questions. 2. Create a new data frame “<tt>Covid_Cluster</tt>” containing only the above mentioned columns. Normalize all the columns of the data frame using the scale function. 3. Classify the countries into five groups by using the values obtained from Part 2 applying K-Means clustering algorithm. It is mandatory to set a seed value of 100 before executing the algorithm. Print the number of countries in each cluster. 4. What proportion of total countries in each cluster are severe with respect to COVID- 19? You can use the “Severe” column from the original dataset. 5. Print the total number of cases and total number of deaths for each cluster. --- ### Part 1 ```r covid19 <- read.csv("Covid_2019.csv") dim(covid19) ``` ``` ## [1] 208 18 ``` ```r ### Summary of sum of the columns summary( covid19[ , 14:17] ) ``` ``` ## female_smokers male_smokers hospital_beds_per_thousand life_expectancy ## Min. : 0.10 Min. : 7.70 Min. : 0.100 Min. :53.28 ## 1st Qu.: 1.90 1st Qu.:21.40 1st Qu.: 1.300 1st Qu.:69.02 ## Median : 5.90 Median :31.20 Median : 2.320 Median :75.05 ## Mean :10.32 Mean :32.63 Mean : 3.013 Mean :73.43 ## 3rd Qu.:18.95 3rd Qu.:41.30 3rd Qu.: 3.930 3rd Qu.:78.92 ## Max. :44.00 Max. :78.10 Max. :13.800 Max. :86.75 ## NA's :69 NA's :71 NA's :45 NA's :3 ``` --- ```r missingvalues<-sapply(covid19,FUN = function(x)sum(is.na(x))) missingvalues ``` ``` ## Continent Country ## 0 0 ## total_cases total_deaths ## 0 0 ## total_cases_per_million total_deaths_per_million ## 0 0 ## population population_density ## 0 11 ## median_age aged_65_older ## 24 27 ## gdp_per_capita cardiovasc_death_rate ## 27 24 ## diabetes_prevalence female_smokers ## 17 69 ## male_smokers hospital_beds_per_thousand ## 71 45 ## life_expectancy Severe ## 3 0 ``` --- ### Part 1 ```r covid19_1 <- covid19[complete.cases(covid19),] dim(covid19_1) ``` ``` ## [1] 126 18 ``` ```r ### Summary of sum of the columns summary(covid19_1[ , 8:11] ) ``` ``` ## population_density median_age aged_65_older gdp_per_capita ## Min. : 1.98 Min. :15.10 Min. : 1.144 Min. : 752.8 ## 1st Qu.: 43.10 1st Qu.:26.35 1st Qu.: 4.424 1st Qu.: 6404.7 ## Median : 87.25 Median :32.40 Median : 8.213 Median : 15827.4 ## Mean : 227.30 Mean :32.69 Mean :10.084 Mean : 22357.3 ## 3rd Qu.: 205.50 3rd Qu.:40.67 3rd Qu.:15.390 3rd Qu.: 32558.2 ## Max. :7915.73 Max. :48.20 Max. :27.049 Max. :116935.6 ``` --- #### Part 2 ```r Covid_Cluster<-covid19_1[,8:17] names(Covid_Cluster) ``` ``` ## [1] "population_density" "median_age" ## [3] "aged_65_older" "gdp_per_capita" ## [5] "cardiovasc_death_rate" "diabetes_prevalence" ## [7] "female_smokers" "male_smokers" ## [9] "hospital_beds_per_thousand" "life_expectancy" ``` ```r Covid_Cluster<-scale(Covid_Cluster) ``` --- #### Part 3 Classify the countries into five groups by using the values obtained from Part 2 applying K-Means clustering algorithm. It is mandatory to set a seed value of 100 before executing the algorithm. Print the number of countries in each cluster. ```r set.seed(100) cluster1 <- kmeans(Covid_Cluster,centers = 5) cluster1$size ``` ``` ## [1] 24 26 20 23 33 ``` --- #### Part 4 What proportion of total countries in each cluster are severe with respect to COVID- 19? You can use the “Severe” column from the original dataset. ```r covid19_1$cluster<-cluster1$cluster ``` ```r table(covid19_1$cluster,covid19_1$Severe) ``` ``` ## ## No Yes ## 1 11 13 ## 2 11 15 ## 3 15 5 ## 4 17 6 ## 5 29 4 ``` --- #### Part 4 ```r prop.table(table(covid19_1$cluster,covid19_1$Severe),margin = 1) ``` ``` ## ## No Yes ## 1 0.4583333 0.5416667 ## 2 0.4230769 0.5769231 ## 3 0.7500000 0.2500000 ## 4 0.7391304 0.2608696 ## 5 0.8787879 0.1212121 ``` --- #### Part 5 ```r cbind( aggregate(total_cases~cluster,data = covid19_1, FUN = "sum"), aggregate(total_deaths~cluster,data = covid19_1, FUN = "sum") ) ``` ``` ## cluster total_cases cluster total_deaths ## 1 1 4833767 1 181850 ## 2 2 6558014 2 318579 ## 3 3 1446595 3 33944 ## 4 4 1238502 4 32231 ## 5 5 2643637 5 56431 ``` --- #### Using {tidyverse} ```r library(tidyverse) covid19_1 %>% group_by(cluster)%>% summarize(total_cases = sum(total_cases), total_deaths = sum(total_deaths) ) ``` ``` ## # A tibble: 5 x 3 ## cluster total_cases total_deaths ## <int> <int> <int> ## 1 1 4833767 181850 ## 2 2 6558014 318579 ## 3 3 1446595 33944 ## 4 4 1238502 32231 ## 5 5 2643637 56431 ``` --- ---