Reading in the Data from Deliverable 1 Will take a subset of this data which includes country, health expenditure levels in 2018, life expectancy in 2018, and the country’s gini coefficient (measuring inequality in the country, with higher scores meaning higher inequality.)

NOTE FOR WHOLE DELIVERABLE: R instance could not load rgl, a required package for many of these sections (got errors with neverending loading.) The section taking out only the poorly scoring countries based on silhouettes would not run.

allData <- read.csv ("https://raw.githubusercontent.com/rhsu4/542_Deliv1/main/allDataWide.csv")
Warning messages:
1:  Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330 
2: Trying without OpenGL... 
row.names(allData) = NULL

#check data types
str(allData)
'data.frame':   193 obs. of  9 variables:
 $ Country    : chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
 $ HE2002     : num  78 314 335 2196 119 ...
 $ HE2010     : num  138 452 648 2771 168 ...
 $ HE2018     : num  186 697 963 3607 165 ...
 $ GiniPercent: num  NA 33.2 27.6 NA 51.3 NA 41.4 34.4 34.4 29.7 ...
 $ GiniYear   : num  NA 2017 2011 NA 2018 ...
 $ LE2002     : chr  "56.784" "74.579" "71.605" ".." ...
 $ LE2010     : chr  "61.028" "76.562" "74.938" ".." ...
 $ LE2018     : chr  "64.486" "78.458" "76.693" ".." ...
#Have to make LE2018 numeric
allData$LE2018<-as.numeric(allData$LE2018)
NAs introduced by coercion
str(allData)
'data.frame':   193 obs. of  9 variables:
 $ Country    : chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
 $ HE2002     : num  78 314 335 2196 119 ...
 $ HE2010     : num  138 452 648 2771 168 ...
 $ HE2018     : num  186 697 963 3607 165 ...
 $ GiniPercent: num  NA 33.2 27.6 NA 51.3 NA 41.4 34.4 34.4 29.7 ...
 $ GiniYear   : num  NA 2017 2011 NA 2018 ...
 $ LE2002     : chr  "56.784" "74.579" "71.605" ".." ...
 $ LE2010     : chr  "61.028" "76.562" "74.938" ".." ...
 $ LE2018     : num  64.5 78.5 76.7 NA 60.8 ...

Preparing Data for clustering

#Setting row names as country names doesn't work because countries are repeating
selection=c("Country", "GiniPercent", "HE2018", "LE2018")
dataToCluster=allData[,selection]

#set labels as row index

row.names(dataToCluster) = dataToCluster$Country
dataToCluster$Country=NULL

#This will give you an error if the country is repeated (because row index names need to be unique)

#Decide if data needs to be transformed
boxplot(dataToCluster, horizontal=T)

#Yes, has to be transformed, on different scales

Standardizing Data, because on a different scale

#### standardizing
DataToClusterStd<-as.data.frame(scale(dataToCluster))

### or smoothing
#log(DataToClusterStd)

boxplot(DataToClusterStd)

  1. Compute the DISTANCE MATRIX
set.seed(999) # this is for replicability of results
dtcsFinal <- na.omit(DataToClusterStd)
Warning messages:
1:  Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330 
2: Trying without OpenGL... 
#This includes only instances that have no missing values (153 obs left)
library(cluster)
dtcsFinal_DM = daisy (x=dtcsFinal, metric = "gower")

Compute Clusters - Computer suggestions Using function fviz_nbclust from the library factoextra we can see how many clustered are suggested.

#for partitioning
library(factoextra)
Loading required package: ggplot2
Registered S3 methods overwritten by 'rgl':
  method               from
  knit_print.rglId         
  knit_print.rglOpen3d     
  sew.rglRecordedplot      
Error in dyn.load(dynlib <- getDynlib(dir)) : 
  unable to load shared object '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so':
  dlopen(/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so, 0x0006): Library not loaded: /opt/X11/lib/libGLU.1.dylib
  Referenced from: /Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so
  Reason: tried: '/opt/X11/lib/libGLU.1.dylib' (no such file), '/Library/Frameworks/R.framework/Resources/lib/libGLU.1.dylib' (no such file), '/Users/Becca/lib/libGLU.1.dylib' (no such file), '/usr/local/lib/libGLU.1.dylib' (no such file), '/usr/lib/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/lib/libGLU.1.dylib' (no such file), '/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server/libGLU.1.dylib' (no such file)
Warning:    Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330
Warning: Trying without OpenGL...
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
#Missing values - have to either add imputed (mean) values, or remove variables that have missing data
##Life Expectancy - have to replace ".." with "NaN" (in python?)
##health expectancy - maybe just use this for now, and drop missing countries


fviz_nbclust(dtcsFinal, 
             pam,
             diss=dataToCluster_DM,
             method = "gap_stat",
             k.max = 10,verbose = F)


#Optimal number of clusters = 9
#hierarchical - agglomerative 
fviz_nbclust(dtcsFinal, 
             hcut,
             diss=dataToCluster_DM,
             method = "gap_stat",
             k.max = 10,
             verbose = F,
             hc_func = "agnes")

#agglomerative optimal # of clusters = 4
#hierarchical (divisive)
fviz_nbclust(dtcsFinal, 
             hcut,
             diss=dataToCluster_DM,
             method = "gap_stat",
             k.max = 10,
             verbose = F,
             hc_func = "diana")


#divisive clusters = 2

Apply function a priori clusters: (partitioning? first kind) = 9 agglomerative = 4 (or 8, test both) divisive = 7 (or 2, test both)

Because the agglomerative and divisive approaches had local maximums, I tested both the suggested number of clusters and the later maximums to see which gave the highest average silhouette width.

NumberOfClusterDesired=4

#For clusters = 9, highest avg silhouette width = .33
#For clusters = 4, highest avg silhouette width = .37 (pam and agnes)
#For clusters = 2, highest avg silhouette width = .38 (pam and diana)

#For clusters = 7, highest avg silhouette width = .34 (agnes)
#For clusters = 8, highest avg silhouette width = .34 (diana)


#FLAG: Is 2 clusters enough clusters?
#Will go with 4, highest avg silhouette width while > 2 clusters

# Partitioning technique
res.pam = pam(x=dataToCluster_DM,
              k = NumberOfClusterDesired,
              cluster.only = F)

# Hierarchical technique- agglomerative approach

#library(factoextra)
res.agnes= hcut(dataToCluster_DM, 
                k = NumberOfClusterDesired,
                isdiss=TRUE,
                hc_func='agnes',
                hc_method = "ward.D2")

# Hierarchical technique- divisive approach
res.diana= hcut(dataToCluster_DM, 
                k = NumberOfClusterDesired,
                isdiss=TRUE,
                hc_func='diana',
                hc_method = "ward.D2")
#Is recoding supposed to put these in order?
library(dplyr)

Attaching package: ‘dplyr’

The following object is masked from ‘package:car’:

    recode

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union
  1. Clustering Results Creating clusters using the different techniques and with the optimal cluster number (4).
#2.1 Add results to original data frame:
dtcsFinal$pam=as.factor(res.pam$clustering)
dtcsFinal$agn=as.factor(res.agnes$cluster)
dtcsFinal$dia=as.factor(res.diana$cluster)
#Verify ordinality in clusters
#aggregate(data=dtcsFinal,
#          Overallscore~pam, #how do we know what the dependent variable should be?
#          FUN=mean)

aggregate(data=dtcsFinal,
          LE2018~pam,
          FUN=mean)
#FLAG: Why is this not ordering low to high?
aggregate(data=dtcsFinal,
         LE2018~agn,
          FUN=mean)
aggregate(data=dtcsFinal,
          LE2018~dia,
          FUN=mean)

Reordering the clusters based on above

dtcsFinal$pam=dplyr::recode_factor(dtcsFinal$pam, 
                  `1` = '2',`2`='1',`3`='3',`4`='4')
dtcsFinal$agn=dplyr::recode_factor(dtcsFinal$agn, 
                  `1` = '2',`2`='4',`3`='4',`4`='3')
dtcsFinal$dia=dplyr::recode_factor(dtcsFinal$dia, 
                  `1` = '3',`2`='2',`3`='4',`4`='1')
  1. Evaluate Results

fviz_silhouette(res.agnes)
NA

library(factoextra)
fviz_silhouette(res.diana)
NA

3.2 Detecting badly clustered cases

save individual silhouettes

head(data.frame(res.pam$silinfo$widths),10)

#Reordering cluster indices doesn't seeem to affect nearest neighbor. Why reorder?

Keeping only the negative clusters

pamEval=data.frame(res.pam$silinfo$widths)
agnEval=data.frame(res.agnes$silinfo$widths)
diaEval=data.frame(res.diana$silinfo$widths)

pamPoor=rownames(pamEval[pamEval$sil_width<0,])
agnPoor=rownames(agnEval[agnEval$sil_width<0,])
diaPoor=rownames(diaEval[diaEval$sil_width<0,])
#install."qpcr"
library("qpcR")

^^NOTE for homework: could not get “qpcr” to work because rgl wouldn’t work. Tried installing pckgs separately, uninstalling whole thing, didn’t work

#If I can ever get rgl to work - see which instances are badly clustered
bap_Clus=as.data.frame(qpcR:::cbind.na(sort(pamPoor), sort(agnPoor),sort(diaPoor)))
Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) : 
  there is no package called ‘rgl’

HOW TO COMPARE CLUSTERING

Prepare a bidimensional map. The function cmdscale can produce a two dimension map of points using the distance matrix:

projectedData = cmdscale(dataToCluster_DM, k=2)

The object projectedData is saving coordinates for each element in the data:

#
# save coordinates to original data frame:
dtcsFinal$dim1 = projectedData[,1]
dtcsFinal$dim2 = projectedData[,2]

# see some:

dtcsFinal[,c('dim1','dim2')][1:10,]

Use those points and see the “map”:

#base= ggplot(data=dtcsFinal,
#             aes(x=dim1, y=dim2,
#                 label=Country)) 
#base + geom_text(size=2)

base= ggplot(data=dtcsFinal,
             aes(x=dim1, y=dim2,
                label = rownames(dtcsFinal))) 
base + geom_text(size=2)

Color the map using the labels from PAM:

pamPlot=base + labs(title = "PAM") + geom_point(size=2,
                                              aes(color=pam),
                                              show.legend = T) 

Color the map using the labels from Hierarchical AGNES:

agnPlot=base + labs(title = "AGNES") + geom_point(size=2,
                                              aes(color=agn),
                                              show.legend = T) 

Color the map using the labels from Hierarchical DIANA:

diaPlot=base + labs(title = "DIANA") + geom_point(size=2,
                                              aes(color=dia),
                                              show.legend = T) 

Compare visually:

library(ggpubr)

ggarrange(pamPlot, agnPlot, diaPlot,ncol = 3,common.legend = T)

Annotating outliers:

Prepare labels FLAG: need to have country var I think, this code doesn’t work without it. Or rowname?

# If name of country in black list, use it, else get rid of it
#LABELpam=ifelse(dtcsFinal$rowname%in%pamPoor,dtcsFinal$rowname,"")
#LABELdia=ifelse(dtcsFinal$Country%in%diaPoor,dtcsFinal$Country,"")
#LABELagn=ifelse(dtcsFinal$Country%in%agnPoor,dtcsFinal$Country,"")

LABELpam=ifelse(dtcsFinal$rowname%in%pamPoor,dtcsFinal$rowname,"")
LABELdia=ifelse(dtcsFinal$rowname%in%diaPoor,dtcsFinal$rowname,"")
LABELagn=ifelse(dtcsFinal$rowname%in%agnPoor,dtcsFinal$rowname,"")
library(ggrepel)
pamPlot + geom_text_repel(aes(label=LABELpam))
Error in `check_aesthetics()`:
! Aesthetics must be either length 1 or the same as the data (153): label
Backtrace:
 1. base `<fn>`(x)
 2. ggplot2:::print.ggplot(x)
 4. ggplot2:::ggplot_build.ggplot(x)
 5. ggplot2 by_layer(function(l, d) l$compute_aesthetics(d, plot))
 6. ggplot2 f(l = layers[[i]], d = data[[i]])
 7. l$compute_aesthetics(d, plot)
 8. ggplot2 f(..., self = self)
 9. ggplot2:::check_aesthetics(evaled, n)

diaPlot + geom_text_repel(aes(label=LABELdia))
Error in `check_aesthetics()`:
! Aesthetics must be either length 1 or the same as the data (153): label
Backtrace:
 1. base `<fn>`(x)
 2. ggplot2:::print.ggplot(x)
 4. ggplot2:::ggplot_build.ggplot(x)
 5. ggplot2 by_layer(function(l, d) l$compute_aesthetics(d, plot))
 6. ggplot2 f(l = layers[[i]], d = data[[i]])
 7. l$compute_aesthetics(d, plot)
 8. ggplot2 f(..., self = self)
 9. ggplot2:::check_aesthetics(evaled, n)
Warning messages:
1:  Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330 
2: Trying without OpenGL... 

agnPlot + geom_text_repel(aes(label=LABELagn))

NOTE: Above section doesn’t work because it depends on rgl (in qpcR) which won’t load or install

The Dendogram (for hierarchical approaches)

fviz_dend(res.agnes,k=NumberOfClusterDesired, cex = 0.45, horiz = T,main = "AGNES approach")
Registered S3 methods overwritten by 'rgl':
  method               from
  knit_print.rglId         
  knit_print.rglOpen3d     
  sew.rglRecordedplot      
Error in dyn.load(dynlib <- getDynlib(dir)) : 
  unable to load shared object '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so':
  dlopen(/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so, 0x0006): Library not loaded: /opt/X11/lib/libGLU.1.dylib
  Referenced from: /Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so
  Reason: tried: '/opt/X11/lib/libGLU.1.dylib' (no such file), '/Library/Frameworks/R.framework/Resources/lib/libGLU.1.dylib' (no such file), '/Users/Becca/lib/libGLU.1.dylib' (no such file), '/usr/local/lib/libGLU.1.dylib' (no such file), '/usr/lib/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/lib/libGLU.1.dylib' (no such file), '/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server/libGLU.1.dylib' (no such file)
In addition: Warning messages:
1:  Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330 
2: Trying without OpenGL... 
Warning:    Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330
Warning: Trying without OpenGL...
`guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.

fviz_dend(res.diana,k=NumberOfClusterDesired, cex = 0.45, horiz = T,main = "DIANA approach")
Registered S3 methods overwritten by 'rgl':
  method               from
  knit_print.rglId         
  knit_print.rglOpen3d     
  sew.rglRecordedplot      
Error in dyn.load(dynlib <- getDynlib(dir)) : 
  unable to load shared object '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so':
  dlopen(/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so, 0x0006): Library not loaded: /opt/X11/lib/libGLU.1.dylib
  Referenced from: /Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so
  Reason: tried: '/opt/X11/lib/libGLU.1.dylib' (no such file), '/Library/Frameworks/R.framework/Resources/lib/libGLU.1.dylib' (no such file), '/Users/Becca/lib/libGLU.1.dylib' (no such file), '/usr/local/lib/libGLU.1.dylib' (no such file), '/usr/lib/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/lib/libGLU.1.dylib' (no such file), '/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server/libGLU.1.dylib' (no such file)
Warning:    Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330
Warning: Trying without OpenGL...
`guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.

Factor Analysis - Latent Concepts

#Creating FA data
Warning messages:
1:  Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330 
2: Trying without OpenGL... 
#selection
dataForFA = dtcsFinal
names(dataForFA)
[1] "GiniPercent" "HE2018"      "LE2018"     
library(lavaan)
model = 'healthsys=~ HE2018 + LE2018 + GiniPercent'
Warning messages:
1:  Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330 
2: Trying without OpenGL... 
fit <- cfa(model, data = dataForFA, std.lv = TRUE)
indexCFA=lavPredict(fit)
indexCFA[1:10]
 [1]  0.47060846  0.47371795 -1.30874907  0.35562305  0.22450684  1.39878912  1.52759672  0.14741568
 [9] -0.07564263  0.34613209
Warning messages:
1:  Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330 
2: Trying without OpenGL... 
#Rescale to 0:10

library(scales)
indexCFANorm=rescale(as.vector(indexCFA), 
                     to = c(0, 10))
indexCFANorm[1:10]
 [1] 6.275947 6.283759 1.805721 5.987073 5.657673 8.607787 8.931387 5.463999 4.903617 5.963229
#This is our index

dataForFA$demo_FA=indexCFANorm

Comparing new index with original score

base=ggplot(data=dataForFA,
            aes(x=demo_FA,y=LE2018))
base+geom_point()


evalCFA1=parameterEstimates(fit, standardized =TRUE)

evalCFA1[evalCFA1$op=="=~",c('rhs','std.all','pvalue')]
NA
NA

Some coefficients

evalCFA2=as.list(fitMeasures(fit))

You want p.value of chi-square greater than .05

evalCFA2[c("chisq", "df", "pvalue")] 
$chisq
[1] 1.358913e-13

$df
[1] 0

$pvalue
[1] NA
Warning messages:
1:  Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330 
2: Trying without OpenGL... 

This is not a greater p-value than .05

You want Tucker-Lewis > .9

evalCFA2$tli # > 0.90
[1] 1

This is does satisfy Tucker-Lewis criteria

You want RMSEA < 0.05:

evalCFA2[c( 'rmsea.ci.lower','rmsea','rmsea.ci.upper')] 
$rmsea.ci.lower
[1] 0

$rmsea
[1] 0

$rmsea.ci.upper
[1] 0
Warning messages:
1:  Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330 
2: Trying without OpenGL... 

Does satisfy RMSEA critera

See how it looks:

#install.packages("semPlot")
Warning messages:
1:  Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330 
2: Trying without OpenGL... 
library(semPlot)
Registered S3 method overwritten by 'data.table':
  method           from
  print.data.table     
Registered S3 methods overwritten by 'rgl':
  method               from
  knit_print.rglId         
  knit_print.rglOpen3d     
  sew.rglRecordedplot      
Error in dyn.load(dynlib <- getDynlib(dir)) : 
  unable to load shared object '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so':
  dlopen(/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so, 0x0006): Library not loaded: /opt/X11/lib/libGLU.1.dylib
  Referenced from: /Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so
  Reason: tried: '/opt/X11/lib/libGLU.1.dylib' (no such file), '/Library/Frameworks/R.framework/Resources/lib/libGLU.1.dylib' (no such file), '/Users/Becca/lib/libGLU.1.dylib' (no such file), '/usr/local/lib/libGLU.1.dylib' (no such file), '/usr/lib/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/lib/libGLU.1.dylib' (no such file), '/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server/libGLU.1.dylib' (no such file)
Warning:    Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330
Warning: Trying without OpenGL...
semPaths(fit, what='std', nCharNodes=0, sizeMan=12,
         edge.label.cex=1.5, fade=T,residuals = F)

Health expenditure and Life expectancy are positively associated with model, gini percent is negatively associated. Life expectancy is most associated with the health system

REGRESSION

Outcome variable as Life Expectancy, HE2018 and GiniPercent as independent variables?

regData = dtcsFinal
Warning messages:
1:  Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330 
2: Trying without OpenGL... 
  1. State hypothesis: Health Expenditure and the Gini Index of a country have an impact on the life expectancy in that country

Prepare your hypothesis:

#FLAG -  Gini is negatively coded (higher gini = higher inequality)
Warning messages:
1:  Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330 
2: Trying without OpenGL... 
# hypothesis 1: 2018LE increases as gini percent DECREASES (higher equality):
hypo1=formula(LE2018~ GiniPercent)

# hypothesis 2: 2018 LE increases as health expenditure of a country increases 

hypo2=formula(LE2018~ HE2018)

#hypothesis 3: 2018 LE increases with higher health expenditure and lower gini score
# -  does this mean needing to make gini negative?
hypo3=formula(LE2018~ HE2018 + GiniPercent)
  1. Compute regression models
#
# results
gauss1=glm(hypo1,
           data = regData,
           family = 'gaussian')

gauss2=glm(hypo2,
           data = regData,
           family = 'gaussian')


gauss3=glm(hypo3,
           data = regData,
           family = 'gaussian')
  1. See Results

Hypothesis 1:

summary(gauss1)

Call:
glm(formula = hypo1, family = "gaussian", data = regData)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5296  -0.7054   0.2065   0.7032   1.5589  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.02443    0.07543  -0.324    0.746    
GiniPercent -0.42332    0.07531  -5.621 8.93e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.8704231)

    Null deviance: 158.93  on 152  degrees of freedom
Residual deviance: 131.43  on 151  degrees of freedom
AIC: 416.95

Number of Fisher Scoring iterations: 2
Warning messages:
1:  Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330 
2: Trying without OpenGL... 

– yes, negative coeff on gini percent, p<.05

Hypothesis 2

summary(gauss2)

Call:
glm(formula = hypo2, family = "gaussian", data = regData)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.2330  -0.3632   0.2508   0.5781   1.1319  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.02489    0.06767  -0.368    0.714    
HE2018       0.56127    0.06444   8.710 4.97e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.7005489)

    Null deviance: 158.93  on 152  degrees of freedom
Residual deviance: 105.78  on 151  degrees of freedom
AIC: 383.73

Number of Fisher Scoring iterations: 2

coefficient on health expenditure is positive, p>.05

summary(gauss3)

Call:
glm(formula = hypo3, family = "gaussian", data = regData)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.8734  -0.3875   0.1500   0.5842   1.4118  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.02272    0.06516  -0.349 0.727837    
HE2018       0.47876    0.06618   7.234 2.25e-11 ***
GiniPercent -0.24870    0.06940  -3.584 0.000457 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.6496024)

    Null deviance: 158.93  on 152  degrees of freedom
Residual deviance:  97.44  on 150  degrees of freedom
AIC: 373.16

Number of Fisher Scoring iterations: 2

Hypothesis as expected, HE 2018 pos coeff & significant, GiniPercent negative coeff and significant

  1. Search for a better model
anova(gauss1, gauss2, test="Chisq")
Analysis of Deviance Table

Model 1: LE2018 ~ GiniPercent
Model 2: LE2018 ~ HE2018
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       151     131.43                     
2       151     105.78  0   25.651         
anova(gauss1, gauss3, test="Chisq")
Analysis of Deviance Table

Model 1: LE2018 ~ GiniPercent
Model 2: LE2018 ~ HE2018 + GiniPercent
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       151     131.43                          
2       150      97.44  1   33.994 4.692e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova(gauss2, gauss3, test="Chisq")
Analysis of Deviance Table

Model 1: LE2018 ~ HE2018
Model 2: LE2018 ~ HE2018 + GiniPercent
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       151     105.78                          
2       150      97.44  1   8.3425 0.0003388 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Model for the third hypothesis is chosen (combined HE + gini)

#install.packages("rsq")
library(rsq)
rsq(gauss3, adj=T)
[1] 0.3787309

Model 3 has the highest r2 at .379

  1. Verify the situation of the chosen model:

5.1 Linearity between dependent variable and predictors is assumed, then these dots should follow a linear and horizontal trend

plot(gauss3,1)
Warning messages:
1:  Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330 
2: Trying without OpenGL... 

Linear relationship doesn’t hold

5.2 Normality of residuals is assumed Visual exploration

plot(gauss3,2)
Warning messages:
1:  Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330 
2: Trying without OpenGL... 

Mathematical exploration

# The data is normal if the p-value is above 0.05
shapiro.test(gauss3$residuals)

    Shapiro-Wilk normality test

data:  gauss3$residuals
W = 0.92613, p-value = 4.317e-07

so, data is not normally distributed

5.3. Homoscedasticity is assumed, so you need to check if residuals are spread equally along the ranges of predictors:

Visual exploration

plot(gauss2, 3)
Warning messages:
1:  Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330 
2: Trying without OpenGL... 

Mathematical exploration

#install.packages("lmtest")
library(lmtest)
Loading required package: zoo

Attaching package: ‘zoo’

The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric
#pvalue<0.05 you cannot assume Homoscedasticity
Warning messages:
1:  Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330 
2: Trying without OpenGL... 
bptest(gauss3) 

    studentized Breusch-Pagan test

data:  gauss3
BP = 17.152, df = 2, p-value = 0.0001886

Cannot assume homoskedacity; p<.05

5.4. We assume that there is no colinearity, that is, that the predictors are not correlated.

library(car)
Loading required package: carData
Warning messages:
1:  Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330 
2: Trying without OpenGL... 
vif(gauss3)
     HE2018 GiniPercent 
   1.137649    1.137649 
Warning messages:
1:  Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330 
2: Trying without OpenGL... 

yes, lower than 5 - collinearity is not a problem

5.5. Analize the effect of atypical values. Determine if outliers (points that are far from the rest, but still in the trend) or high-leverage points (far from the trend but close to the rest) are influential:

Visual exploration

plot(gauss3,5)
Warning messages:
1:  Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330 
2: Trying without OpenGL... 

??What does this mean

Querying

gaussInf=as.data.frame(influence.measures(gauss3)$is.inf)
Warning messages:
1:  Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330 
2: Trying without OpenGL... 
gaussInf[gaussInf$cook.d,]
  1. Finally, a nice summary plot of your work:
#install.packages("sjPlot")
library(sjPlot)
plot_models(gauss3,vline.color="grey")
Registered S3 methods overwritten by 'rgl':
  method               from
  knit_print.rglId         
  knit_print.rglOpen3d     
  sew.rglRecordedplot      
Error in dyn.load(dynlib <- getDynlib(dir)) : 
  unable to load shared object '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so':
  dlopen(/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so, 0x0006): Library not loaded: /opt/X11/lib/libGLU.1.dylib
  Referenced from: /Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so
  Reason: tried: '/opt/X11/lib/libGLU.1.dylib' (no such file), '/Library/Frameworks/R.framework/Resources/lib/libGLU.1.dylib' (no such file), '/Users/Becca/lib/libGLU.1.dylib' (no such file), '/usr/local/lib/libGLU.1.dylib' (no such file), '/usr/lib/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/lib/libGLU.1.dylib' (no such file), '/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server/libGLU.1.dylib' (no such file)
Warning:    Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330
Warning: Trying without OpenGL...

PREDICTIVE APPROACH 1. Split the data set

Registered S3 methods overwritten by 'rgl':
  method               from
  knit_print.rglId         
  knit_print.rglOpen3d     
  sew.rglRecordedplot      
Error in dyn.load(dynlib <- getDynlib(dir)) : 
  unable to load shared object '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so':
  dlopen(/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so, 0x0006): Library not loaded: /opt/X11/lib/libGLU.1.dylib
  Referenced from: /Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so
  Reason: tried: '/opt/X11/lib/libGLU.1.dylib' (no such file), '/Library/Frameworks/R.framework/Resources/lib/libGLU.1.dylib' (no such file), '/Users/Becca/lib/libGLU.1.dylib' (no such file), '/usr/local/lib/libGLU.1.dylib' (no such file), '/usr/lib/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/lib/libGLU.1.dylib' (no such file), '/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server/libGLU.1.dylib' (no such file)
In addition: Warning messages:
1:  Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330 
2: Trying without OpenGL... 
Warning:    Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330
Warning: Trying without OpenGL...
Registered S3 methods overwritten by 'rgl':
  method               from
  knit_print.rglId         
  knit_print.rglOpen3d     
  sew.rglRecordedplot      
Error in dyn.load(dynlib <- getDynlib(dir)) : 
  unable to load shared object '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so':
  dlopen(/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so, 0x0006): Library not loaded: /opt/X11/lib/libGLU.1.dylib
  Referenced from: /Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so
  Reason: tried: '/opt/X11/lib/libGLU.1.dylib' (no such file), '/Library/Frameworks/R.framework/Resources/lib/libGLU.1.dylib' (no such file), '/Users/Becca/lib/libGLU.1.dylib' (no such file), '/usr/local/lib/libGLU.1.dylib' (no such file), '/usr/lib/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/lib/libGLU.1.dylib' (no such file), '/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server/libGLU.1.dylib' (no such file)
Warning:    Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330
Warning: Trying without OpenGL...
#install.packages("caret")
Registered S3 methods overwritten by 'rgl':
  method               from
  knit_print.rglId         
  knit_print.rglOpen3d     
  sew.rglRecordedplot      
Error in dyn.load(dynlib <- getDynlib(dir)) : 
  unable to load shared object '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so':
  dlopen(/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so, 0x0006): Library not loaded: /opt/X11/lib/libGLU.1.dylib
  Referenced from: /Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so
  Reason: tried: '/opt/X11/lib/libGLU.1.dylib' (no such file), '/Library/Frameworks/R.framework/Resources/lib/libGLU.1.dylib' (no such file), '/Users/Becca/lib/libGLU.1.dylib' (no such file), '/usr/local/lib/libGLU.1.dylib' (no such file), '/usr/lib/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/lib/libGLU.1.dylib' (no such file), '/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server/libGLU.1.dylib' (no such file)
Warning:    Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330
Warning: Trying without OpenGL...
set.seed(123)
Warning messages:
1:  Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330 
2: Trying without OpenGL... 
selection = createDataPartition(regData$LE2018,
                                p = 0.75,
                                list = FALSE)
#
trainGauss = regData[ selection, ]
#
testGauss  = regData[-selection, ]
ctrl = trainControl(method = 'cv',number = 5)

gauss3CV = train(hypo3,
                 data = trainGauss, 
                 method = 'glm',
                 trControl = ctrl)
non-uniform 'Rounding' sampler used
gauss3CV
Generalized Linear Model 

117 samples
  2 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 93, 94, 94, 93, 94 
Resampling results:

  RMSE       Rsquared  MAE     
  0.8785279  0.393698  0.638966

R2 = .394

  1. Evaluate performance
predictedVal<-predict(gauss3CV,testGauss)
Warning messages:
1:  Loading rgl's DLL failed. 
    This build of rgl depends on XQuartz, which failed to load.
 See the discussion in https://stackoverflow.com/a/66127391/2554330 
2: Trying without OpenGL... 
postResample(obs = testGauss$LE2018,
             pred=predictedVal)
     RMSE  Rsquared       MAE 
0.7520189 0.5470718 0.5711473 

From the information above, you do noy a better prediction: the Rsquared from postResample is below 0.5.

BINARY OUTCOME Ignore all below here, data is not binary If you have a binary dependent variable (this data is not binary)

#regData$LE2018dico=ifelse(regData$LE2018>median(regData$LE2018,
#                                       na.rm = T),
#                     1,0)

*EXPLANATORY APPROACH – IGNORE, data is not binary. Just including for practice 1. State hypothesis

hypoDico1=formula(LE2018dico~ GiniPercent)
hypoDico2=formula(LE2018dico~ HE2018 + GiniPercent)
  1. Reformat
demoidh$LE2018dico=factor(regData$LE2018dico)

Compute regression models

Log1=glm(hypoDico1,data = regData,
         family="binomial")

Logi2=glm(hypoDico2, data=regData,
          family="binomial")

See Results - First Hypothesis

summary(Logi1)
summary(Logi2)

Search for a better model:

lrtest(Logi1,Logi2)

Verify situation of a chosen model

9.1 Linearity assumption (Box-Tidwell test)

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

---
title: "Deliverable 2: Cluster Analysis, Factor Analysis, Regression"
output: html_notebook
---
Reading in the Data from Deliverable 1
Will take a subset of this data which includes country, health expenditure levels in 2018, life expectancy in 2018, and the country's gini coefficient (measuring inequality in the country, with higher scores meaning higher inequality.)

NOTE FOR WHOLE DELIVERABLE: R instance could not load rgl, a required package for many of these sections (got errors with neverending loading.) The section taking out only the poorly scoring countries based on silhouettes would not run.
```{r}
allData <- read.csv ("https://raw.githubusercontent.com/rhsu4/542_Deliv1/main/allDataWide.csv")

row.names(allData) = NULL

#check data types
str(allData)

#Have to make LE2018 numeric
allData$LE2018<-as.numeric(allData$LE2018)

str(allData)

```
Preparing Data for clustering

```{r}
#Setting row names as country names doesn't work because countries are repeating
selection=c("Country", "GiniPercent", "HE2018", "LE2018")
dataToCluster=allData[,selection]

#set labels as row index

row.names(dataToCluster) = dataToCluster$Country
dataToCluster$Country=NULL

#This will give you an error if the country is repeated (because row index names need to be unique)

#Decide if data needs to be transformed
boxplot(dataToCluster, horizontal=T)
#Yes, has to be transformed, on different scales

```
Standardizing Data, because on a different scale

```{r}
#### standardizing
DataToClusterStd<-as.data.frame(scale(dataToCluster))

### or smoothing
#log(DataToClusterStd)

boxplot(DataToClusterStd)
```
II. Compute the DISTANCE MATRIX

```{r}
set.seed(999) # this is for replicability of results
```


```{r}
dtcsFinal <- na.omit(DataToClusterStd)
#This includes only instances that have no missing values (153 obs left)
```

```{r}
library(cluster)
dtcsFinal_DM = daisy (x=dtcsFinal, metric = "gower")
#This creates the distance metric
```


Compute Clusters - Computer suggestions
Using function fviz_nbclust from the library factoextra we can see how many clustered are suggested.


```{r}
#for partitioning
library(factoextra)
```

```{r}
#Missing values - have to either add imputed (mean) values, or remove variables that have missing data
#Here, removing NAs

fviz_nbclust(dtcsFinal, 
             pam,
             diss=dataToCluster_DM,
             method = "gap_stat",
             k.max = 10,verbose = F)

#Optimal number of clusters = 9
```

```{r}
#hierarchical - agglomerative 
fviz_nbclust(dtcsFinal, 
             hcut,
             diss=dataToCluster_DM,
             method = "gap_stat",
             k.max = 10,
             verbose = F,
             hc_func = "agnes")
#agglomerative optimal # of clusters = 4? 4 is a local average though, so maybe 8 or 9 is better
```

```{r}
#hierarchical (divisive)
fviz_nbclust(dtcsFinal, 
             hcut,
             diss=dataToCluster_DM,
             method = "gap_stat",
             k.max = 10,
             verbose = F,
             hc_func = "diana")

#divisive clusters = 2 with local average. Actual ideal might be 7
```

Apply function
a priori clusters:
(partitioning? first kind) = 9
agglomerative = 4 (or 8, test both)
divisive = 7 (or 2, test both)

Because the agglomerative and divisive approaches had local maximums, I tested both the suggested number of clusters and the later maximums to see which gave the highest average silhouette width.

```{r}
NumberOfClusterDesired=4

#For clusters = 9, highest avg silhouette width = .33
#For clusters = 4, highest avg silhouette width = .37 (pam and agnes)
#For clusters = 2, highest avg silhouette width = .38 (pam and diana)

#For clusters = 7, highest avg silhouette width = .34 (agnes)
#For clusters = 8, highest avg silhouette width = .34 (diana)


#FLAG: Is 2 clusters enough clusters?
#Will go with 4, highest avg silhouette width while > 2 clusters

# Partitioning technique
res.pam = pam(x=dataToCluster_DM,
              k = NumberOfClusterDesired,
              cluster.only = F)

# Hierarchical technique- agglomerative approach

#library(factoextra)
res.agnes= hcut(dataToCluster_DM, 
                k = NumberOfClusterDesired,
                isdiss=TRUE,
                hc_func='agnes',
                hc_method = "ward.D2")

# Hierarchical technique- divisive approach
res.diana= hcut(dataToCluster_DM, 
                k = NumberOfClusterDesired,
                isdiss=TRUE,
                hc_func='diana',
                hc_method = "ward.D2")

```

```{r}
#Is recoding supposed to put these in order?
library(dplyr)
```

2. Clustering Results
Creating clusters using the different techniques and with the optimal cluster number (4).

```{r}
#2.1 Add results to original data frame:
dtcsFinal$pam=as.factor(res.pam$clustering)
dtcsFinal$agn=as.factor(res.agnes$cluster)
dtcsFinal$dia=as.factor(res.diana$cluster)
```


```{r}
#Verify ordinality in clusters
#aggregate(data=dtcsFinal,
#          Overallscore~pam, #how do we know what the dependent variable should be?
#          FUN=mean)

aggregate(data=dtcsFinal,
          LE2018~pam,
          FUN=mean)
#These are not ordinal data
```

```{r}
aggregate(data=dtcsFinal,
         LE2018~agn,
          FUN=mean)
```

```{r}
aggregate(data=dtcsFinal,
          LE2018~dia,
          FUN=mean)
```

Reordering the clusters based on above
```{r}
dtcsFinal$pam=dplyr::recode_factor(dtcsFinal$pam, 
                  `1` = '2',`2`='1',`3`='3',`4`='4')
dtcsFinal$agn=dplyr::recode_factor(dtcsFinal$agn, 
                  `1` = '2',`2`='4',`3`='4',`4`='3')
dtcsFinal$dia=dplyr::recode_factor(dtcsFinal$dia, 
                  `1` = '3',`2`='2',`3`='4',`4`='1')
```



3. Evaluate Results
```{r}
fviz_silhouette(res.pam)

```



```{r}
fviz_silhouette(res.agnes)

```

```{r}
library(factoextra)
fviz_silhouette(res.diana)

```

3.2 Detecting badly clustered cases

save individual silhouettes

```{r}
head(data.frame(res.pam$silinfo$widths),10)
```
Keeping only the negative clusters

```{r}
pamEval=data.frame(res.pam$silinfo$widths)
agnEval=data.frame(res.agnes$silinfo$widths)
diaEval=data.frame(res.diana$silinfo$widths)

pamPoor=rownames(pamEval[pamEval$sil_width<0,])
agnPoor=rownames(agnEval[agnEval$sil_width<0,])
diaPoor=rownames(diaEval[diaEval$sil_width<0,])
```

```{r}
#install."qpcr"
library("qpcR")

```

^^NOTE for homework: could not get "qpcr" to work because rgl wouldn't work. Tried installing pckgs separately, uninstalling whole thing, didn't work

```{r}
#If I can ever get rgl to work - see which instances are badly clustered
bap_Clus=as.data.frame(qpcR:::cbind.na(sort(pamPoor), sort(agnPoor),sort(diaPoor)))
names(bap_Clus)=c("pam","agn","dia")
bap_Clus

```


HOW TO COMPARE CLUSTERING

Prepare a bidimensional map. The function cmdscale can produce a two dimension map of points using the distance matrix:

```{r}
projectedData = cmdscale(dataToCluster_DM, k=2)

```

The object projectedData is saving coordinates for each element in the data:

```{r}
#
# save coordinates to original data frame:
dtcsFinal$dim1 = projectedData[,1]
dtcsFinal$dim2 = projectedData[,2]

# see some:

dtcsFinal[,c('dim1','dim2')][1:10,]
```

Use those points and see the “map”:

```{r}
#base= ggplot(data=dtcsFinal,
#             aes(x=dim1, y=dim2,
#                 label=Country)) 
#base + geom_text(size=2)

base= ggplot(data=dtcsFinal,
             aes(x=dim1, y=dim2,
                label = rownames(dtcsFinal))) 
base + geom_text(size=2)
```

Color the map using the labels from PAM:

```{r}
pamPlot=base + labs(title = "PAM") + geom_point(size=2,
                                              aes(color=pam),
                                              show.legend = T) 
```

Color the map using the labels from Hierarchical AGNES:

```{r}
agnPlot=base + labs(title = "AGNES") + geom_point(size=2,
                                              aes(color=agn),
                                              show.legend = T) 
```

Color the map using the labels from Hierarchical DIANA:
```{r}
diaPlot=base + labs(title = "DIANA") + geom_point(size=2,
                                              aes(color=dia),
                                              show.legend = T) 
```

Compare visually:

```{r}
library(ggpubr)

ggarrange(pamPlot, agnPlot, diaPlot,ncol = 3,common.legend = T)
```

Annotating outliers:

Prepare labels
FLAG: need to have country var I think, this code doesn't work without it. Or rowname?
```{r}
# If name of country in black list, use it, else get rid of it
#LABELpam=ifelse(dtcsFinal$rowname%in%pamPoor,dtcsFinal$rowname,"")
#LABELdia=ifelse(dtcsFinal$Country%in%diaPoor,dtcsFinal$Country,"")
#LABELagn=ifelse(dtcsFinal$Country%in%agnPoor,dtcsFinal$Country,"")

#HOMEWORK NOTE: This isn't working because rgl package won't work
LABELpam=ifelse(dtcsFinal$rowname%in%pamPoor,dtcsFinal$rowname,"")
LABELdia=ifelse(dtcsFinal$rowname%in%diaPoor,dtcsFinal$rowname,"")
LABELagn=ifelse(dtcsFinal$rowname%in%agnPoor,dtcsFinal$rowname,"")
```

```{r}
library(ggrepel)
pamPlot + geom_text_repel(aes(label=LABELpam))
```

```{r}
diaPlot + geom_text_repel(aes(label=LABELdia))
```

```{r}
agnPlot + geom_text_repel(aes(label=LABELagn))
```

NOTE: Above section doesn't work because it depends on rgl (in qpcR) which won't load or install

The Dendogram (for hierarchical approaches)
```{r}
fviz_dend(res.agnes,k=NumberOfClusterDesired, cex = 0.45, horiz = T,main = "AGNES approach")
```

```{r}
fviz_dend(res.diana,k=NumberOfClusterDesired, cex = 0.45, horiz = T,main = "DIANA approach")

```




Factor Analysis
- Latent Concepts
```{r}
#Creating FA data
#selection
dataForFA = dtcsFinal
names(dataForFA)

```

```{r}
library(lavaan)
```

```{r}
model = 'healthsys=~ HE2018 + LE2018 + GiniPercent'

fit <- cfa(model, data = dataForFA, std.lv = TRUE)
indexCFA=lavPredict(fit)

```

```{r}
indexCFA[1:10]

#Rescale to 0:10

library(scales)
indexCFANorm=rescale(as.vector(indexCFA), 
                     to = c(0, 10))
indexCFANorm[1:10]


#This is our index

dataForFA$demo_FA=indexCFANorm
```
 
Comparing new index with original score
```{r}
base=ggplot(data=dataForFA,
            aes(x=demo_FA,y=LE2018))
base+geom_point()
```

```{r}

evalCFA1=parameterEstimates(fit, standardized =TRUE)

evalCFA1[evalCFA1$op=="=~",c('rhs','std.all','pvalue')]


```

Some coefficients
```{r}
evalCFA2=as.list(fitMeasures(fit))
```


You want p.value of chi-square greater than .05
```{r}
evalCFA2[c("chisq", "df", "pvalue")] 
```
This is not a greater p-value than .05

You want Tucker-Lewis > .9
```{r}
evalCFA2$tli # > 0.90
```
This is does satisfy Tucker-Lewis criteria

You want RMSEA < 0.05:
```{r}
evalCFA2[c( 'rmsea.ci.lower','rmsea','rmsea.ci.upper')] 
```

Does satisfy RMSEA critera

See how it looks: 
```{r}
#install.packages("semPlot")
library(semPlot)

```


```{r}
semPaths(fit, what='std', nCharNodes=0, sizeMan=12,
         edge.label.cex=1.5, fade=T,residuals = F)
```
Health expenditure and Life expectancy are positively associated with model, gini percent is negatively associated. Life expectancy is most associated with the health system


REGRESSION 

Outcome variable as Life Expectancy, HE2018 and GiniPercent as independent variables?

```{r}
regData = dtcsFinal
`
```

* EXPLANATORY APPROACH
1. State hypothesis: Health Expenditure and the Gini Index of a country have an impact on the life expectancy in that country

Prepare your hypothesis:
```{r}
#FLAG -  Gini is negatively coded (higher gini = higher inequality)
# hypothesis 1: 2018LE increases as gini percent DECREASES (higher equality):
hypo1=formula(LE2018~ GiniPercent)

# hypothesis 2: 2018 LE increases as health expenditure of a country increases 

hypo2=formula(LE2018~ HE2018)

#hypothesis 3: 2018 LE increases with higher health expenditure and lower gini score
# -  does this mean needing to make gini negative?
hypo3=formula(LE2018~ HE2018 + GiniPercent)

```

2. Compute regression models

```{r}
#
# results
gauss1=glm(hypo1,
           data = regData,
           family = 'gaussian')

gauss2=glm(hypo2,
           data = regData,
           family = 'gaussian')


gauss3=glm(hypo3,
           data = regData,
           family = 'gaussian')
```

3. See Results

Hypothesis 1:
```{r}
summary(gauss1)

```

-- yes, negative coeff on gini percent, p<.05

Hypothesis 2

```{r}
summary(gauss2)
```

coefficient on health expenditure is positive, p>.05

```{r}
summary(gauss3)
```
Hypothesis as expected, HE 2018 pos coeff & significant, GiniPercent negative coeff and significant

4. Search for a better model
```{r}
anova(gauss1, gauss2, test="Chisq")
anova(gauss1, gauss3, test="Chisq")
anova(gauss2, gauss3, test="Chisq")

```

Model for the third hypothesis is chosen (combined HE + gini)
```{r}
#install.packages("rsq")
library(rsq)
```
```{r}
rsq(gauss3, adj=T)
```

Model 3 has the highest r2 at .379

5. Verify the situation of the chosen model:

5.1 Linearity between dependent variable and predictors is assumed, then these dots should follow a linear and horizontal trend
```{r}
plot(gauss3,1)
```
Linear relationship doesn't hold

5.2 Normality of residuals is assumed
Visual exploration

```{r}
plot(gauss3,2)
```
Mathematical exploration
```{r}
# The data is normal if the p-value is above 0.05
shapiro.test(gauss3$residuals)
```

so, data is not normally distributed


5.3. Homoscedasticity is assumed, so you need to check if residuals are spread equally along the ranges of predictors:

Visual exploration

```{r}
plot(gauss2, 3)
```

Mathematical exploration
```{r}
#install.packages("lmtest")
library(lmtest)
```

```{r}
#pvalue<0.05 you cannot assume Homoscedasticity
bptest(gauss3) 
```
Cannot assume homoskedacity; p<.05

5.4. We assume that there is no colinearity, that is, that the predictors are not correlated.

```{r}
library(car)
```

```{r}
vif(gauss3)
```
yes, lower than 5 - collinearity is not a problem

5.5. Analize the effect of atypical values. Determine if outliers (points that are far from the rest, but still in the trend) or high-leverage points (far from the trend but close to the rest) are influential:

Visual exploration
```{r}
plot(gauss3,5)
```
??What does this mean

Querying

```{r}
gaussInf=as.data.frame(influence.measures(gauss3)$is.inf)
gaussInf[gaussInf$cook.d,]
```
6. Finally, a nice summary plot of your work:
```{r}
#install.packages("sjPlot")
library(sjPlot)
```

```{r}
plot_models(gauss3,vline.color="grey")
```
PREDICTIVE APPROACH
1. Split the data set
```{r}
#install.packages("caret")
library(caret)
```

```{r}
set.seed(123)

selection = createDataPartition(regData$LE2018,
                                p = 0.75,
                                list = FALSE)
#
trainGauss = regData[ selection, ]
#
testGauss  = regData[-selection, ]
```


```{r}
ctrl = trainControl(method = 'cv',number = 5)

gauss3CV = train(hypo3,
                 data = trainGauss, 
                 method = 'glm',
                 trControl = ctrl)

gauss3CV
```

R2 = .394

3. Evaluate performance

```{r}
predictedVal<-predict(gauss3CV,testGauss)

postResample(obs = testGauss$LE2018,
             pred=predictedVal)
```

From the information above, you do noy a better prediction: the Rsquared from postResample is below 0.5.

BINARY OUTCOME
Ignore all below here, data is not binary
If you have a binary dependent variable (this data is not binary)

```{r}
#regData$LE2018dico=ifelse(regData$LE2018>median(regData$LE2018,
#                                       na.rm = T),
#                     1,0)
```

*EXPLANATORY APPROACH -- IGNORE, data is not binary. Just including for practice
1. State hypothesis
```{r}
hypoDico1=formula(LE2018dico~ GiniPercent)
hypoDico2=formula(LE2018dico~ HE2018 + GiniPercent)
```

2. Reformat
```{r}
demoidh$LE2018dico=factor(regData$LE2018dico)
```

Compute regression models
```{r}
Log1=glm(hypoDico1,data = regData,
         family="binomial")

Logi2=glm(hypoDico2, data=regData,
          family="binomial")

```
See Results
- First Hypothesis

```{r}
summary(Logi1)
```

- Second Hypothesis
```{r}
summary(Logi2)
```

Search for a better model:
```{r}
lrtest(Logi1,Logi2)
```

Verify situation of a chosen model

9.1 Linearity assumption (Box-Tidwell test)

Add a new chunk by clicking the *Insert Chunk* button on the toolbar or by pressing *Cmd+Option+I*.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the *Preview* button or press *Cmd+Shift+K* to preview the HTML file). 

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike *Knit*, *Preview* does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

