Reading in the Data from Deliverable 1 Will take a subset of this data which includes country, health expenditure levels in 2018, life expectancy in 2018, and the country’s gini coefficient (measuring inequality in the country, with higher scores meaning higher inequality.)
NOTE FOR WHOLE DELIVERABLE: R instance could not load rgl, a required package for many of these sections (got errors with neverending loading.) The section taking out only the poorly scoring countries based on silhouettes would not run.
allData <- read.csv ("https://raw.githubusercontent.com/rhsu4/542_Deliv1/main/allDataWide.csv")
Warning messages:
1: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
2: Trying without OpenGL...
row.names(allData) = NULL
#check data types
str(allData)
'data.frame': 193 obs. of 9 variables:
$ Country : chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
$ HE2002 : num 78 314 335 2196 119 ...
$ HE2010 : num 138 452 648 2771 168 ...
$ HE2018 : num 186 697 963 3607 165 ...
$ GiniPercent: num NA 33.2 27.6 NA 51.3 NA 41.4 34.4 34.4 29.7 ...
$ GiniYear : num NA 2017 2011 NA 2018 ...
$ LE2002 : chr "56.784" "74.579" "71.605" ".." ...
$ LE2010 : chr "61.028" "76.562" "74.938" ".." ...
$ LE2018 : chr "64.486" "78.458" "76.693" ".." ...
#Have to make LE2018 numeric
allData$LE2018<-as.numeric(allData$LE2018)
NAs introduced by coercion
str(allData)
'data.frame': 193 obs. of 9 variables:
$ Country : chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
$ HE2002 : num 78 314 335 2196 119 ...
$ HE2010 : num 138 452 648 2771 168 ...
$ HE2018 : num 186 697 963 3607 165 ...
$ GiniPercent: num NA 33.2 27.6 NA 51.3 NA 41.4 34.4 34.4 29.7 ...
$ GiniYear : num NA 2017 2011 NA 2018 ...
$ LE2002 : chr "56.784" "74.579" "71.605" ".." ...
$ LE2010 : chr "61.028" "76.562" "74.938" ".." ...
$ LE2018 : num 64.5 78.5 76.7 NA 60.8 ...
Preparing Data for clustering
#Setting row names as country names doesn't work because countries are repeating
selection=c("Country", "GiniPercent", "HE2018", "LE2018")
dataToCluster=allData[,selection]
#set labels as row index
row.names(dataToCluster) = dataToCluster$Country
dataToCluster$Country=NULL
#This will give you an error if the country is repeated (because row index names need to be unique)
#Decide if data needs to be transformed
boxplot(dataToCluster, horizontal=T)
#Yes, has to be transformed, on different scales
Standardizing Data, because on a different scale
#### standardizing
DataToClusterStd<-as.data.frame(scale(dataToCluster))
### or smoothing
#log(DataToClusterStd)
boxplot(DataToClusterStd)
set.seed(999) # this is for replicability of results
dtcsFinal <- na.omit(DataToClusterStd)
Warning messages:
1: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
2: Trying without OpenGL...
#This includes only instances that have no missing values (153 obs left)
library(cluster)
dtcsFinal_DM = daisy (x=dtcsFinal, metric = "gower")
Compute Clusters - Computer suggestions Using function fviz_nbclust from the library factoextra we can see how many clustered are suggested.
#for partitioning
library(factoextra)
Loading required package: ggplot2
Registered S3 methods overwritten by 'rgl':
method from
knit_print.rglId
knit_print.rglOpen3d
sew.rglRecordedplot
Error in dyn.load(dynlib <- getDynlib(dir)) :
unable to load shared object '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so':
dlopen(/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so, 0x0006): Library not loaded: /opt/X11/lib/libGLU.1.dylib
Referenced from: /Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so
Reason: tried: '/opt/X11/lib/libGLU.1.dylib' (no such file), '/Library/Frameworks/R.framework/Resources/lib/libGLU.1.dylib' (no such file), '/Users/Becca/lib/libGLU.1.dylib' (no such file), '/usr/local/lib/libGLU.1.dylib' (no such file), '/usr/lib/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/lib/libGLU.1.dylib' (no such file), '/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server/libGLU.1.dylib' (no such file)
Warning: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
Warning: Trying without OpenGL...
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
#Missing values - have to either add imputed (mean) values, or remove variables that have missing data
##Life Expectancy - have to replace ".." with "NaN" (in python?)
##health expectancy - maybe just use this for now, and drop missing countries
fviz_nbclust(dtcsFinal,
pam,
diss=dataToCluster_DM,
method = "gap_stat",
k.max = 10,verbose = F)
#Optimal number of clusters = 9
#hierarchical - agglomerative
fviz_nbclust(dtcsFinal,
hcut,
diss=dataToCluster_DM,
method = "gap_stat",
k.max = 10,
verbose = F,
hc_func = "agnes")
#agglomerative optimal # of clusters = 4
#hierarchical (divisive)
fviz_nbclust(dtcsFinal,
hcut,
diss=dataToCluster_DM,
method = "gap_stat",
k.max = 10,
verbose = F,
hc_func = "diana")
#divisive clusters = 2
Apply function a priori clusters: (partitioning? first kind) = 9 agglomerative = 4 (or 8, test both) divisive = 7 (or 2, test both)
Because the agglomerative and divisive approaches had local maximums, I tested both the suggested number of clusters and the later maximums to see which gave the highest average silhouette width.
NumberOfClusterDesired=4
#For clusters = 9, highest avg silhouette width = .33
#For clusters = 4, highest avg silhouette width = .37 (pam and agnes)
#For clusters = 2, highest avg silhouette width = .38 (pam and diana)
#For clusters = 7, highest avg silhouette width = .34 (agnes)
#For clusters = 8, highest avg silhouette width = .34 (diana)
#FLAG: Is 2 clusters enough clusters?
#Will go with 4, highest avg silhouette width while > 2 clusters
# Partitioning technique
res.pam = pam(x=dataToCluster_DM,
k = NumberOfClusterDesired,
cluster.only = F)
# Hierarchical technique- agglomerative approach
#library(factoextra)
res.agnes= hcut(dataToCluster_DM,
k = NumberOfClusterDesired,
isdiss=TRUE,
hc_func='agnes',
hc_method = "ward.D2")
# Hierarchical technique- divisive approach
res.diana= hcut(dataToCluster_DM,
k = NumberOfClusterDesired,
isdiss=TRUE,
hc_func='diana',
hc_method = "ward.D2")
#Is recoding supposed to put these in order?
library(dplyr)
Attaching package: ‘dplyr’
The following object is masked from ‘package:car’:
recode
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
#2.1 Add results to original data frame:
dtcsFinal$pam=as.factor(res.pam$clustering)
dtcsFinal$agn=as.factor(res.agnes$cluster)
dtcsFinal$dia=as.factor(res.diana$cluster)
#Verify ordinality in clusters
#aggregate(data=dtcsFinal,
# Overallscore~pam, #how do we know what the dependent variable should be?
# FUN=mean)
aggregate(data=dtcsFinal,
LE2018~pam,
FUN=mean)
#FLAG: Why is this not ordering low to high?
aggregate(data=dtcsFinal,
LE2018~agn,
FUN=mean)
aggregate(data=dtcsFinal,
LE2018~dia,
FUN=mean)
Reordering the clusters based on above
dtcsFinal$pam=dplyr::recode_factor(dtcsFinal$pam,
`1` = '2',`2`='1',`3`='3',`4`='4')
dtcsFinal$agn=dplyr::recode_factor(dtcsFinal$agn,
`1` = '2',`2`='4',`3`='4',`4`='3')
dtcsFinal$dia=dplyr::recode_factor(dtcsFinal$dia,
`1` = '3',`2`='2',`3`='4',`4`='1')
fviz_silhouette(res.agnes)
NA
library(factoextra)
fviz_silhouette(res.diana)
NA
3.2 Detecting badly clustered cases
save individual silhouettes
head(data.frame(res.pam$silinfo$widths),10)
#Reordering cluster indices doesn't seeem to affect nearest neighbor. Why reorder?
Keeping only the negative clusters
pamEval=data.frame(res.pam$silinfo$widths)
agnEval=data.frame(res.agnes$silinfo$widths)
diaEval=data.frame(res.diana$silinfo$widths)
pamPoor=rownames(pamEval[pamEval$sil_width<0,])
agnPoor=rownames(agnEval[agnEval$sil_width<0,])
diaPoor=rownames(diaEval[diaEval$sil_width<0,])
#install."qpcr"
library("qpcR")
^^NOTE for homework: could not get “qpcr” to work because rgl wouldn’t work. Tried installing pckgs separately, uninstalling whole thing, didn’t work
#If I can ever get rgl to work - see which instances are badly clustered
bap_Clus=as.data.frame(qpcR:::cbind.na(sort(pamPoor), sort(agnPoor),sort(diaPoor)))
Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) :
there is no package called ‘rgl’
HOW TO COMPARE CLUSTERING
Prepare a bidimensional map. The function cmdscale can produce a two dimension map of points using the distance matrix:
projectedData = cmdscale(dataToCluster_DM, k=2)
The object projectedData is saving coordinates for each element in the data:
#
# save coordinates to original data frame:
dtcsFinal$dim1 = projectedData[,1]
dtcsFinal$dim2 = projectedData[,2]
# see some:
dtcsFinal[,c('dim1','dim2')][1:10,]
Use those points and see the “map”:
#base= ggplot(data=dtcsFinal,
# aes(x=dim1, y=dim2,
# label=Country))
#base + geom_text(size=2)
base= ggplot(data=dtcsFinal,
aes(x=dim1, y=dim2,
label = rownames(dtcsFinal)))
base + geom_text(size=2)
Color the map using the labels from PAM:
pamPlot=base + labs(title = "PAM") + geom_point(size=2,
aes(color=pam),
show.legend = T)
Color the map using the labels from Hierarchical AGNES:
agnPlot=base + labs(title = "AGNES") + geom_point(size=2,
aes(color=agn),
show.legend = T)
Color the map using the labels from Hierarchical DIANA:
diaPlot=base + labs(title = "DIANA") + geom_point(size=2,
aes(color=dia),
show.legend = T)
Compare visually:
library(ggpubr)
ggarrange(pamPlot, agnPlot, diaPlot,ncol = 3,common.legend = T)
Annotating outliers:
Prepare labels FLAG: need to have country var I think, this code doesn’t work without it. Or rowname?
# If name of country in black list, use it, else get rid of it
#LABELpam=ifelse(dtcsFinal$rowname%in%pamPoor,dtcsFinal$rowname,"")
#LABELdia=ifelse(dtcsFinal$Country%in%diaPoor,dtcsFinal$Country,"")
#LABELagn=ifelse(dtcsFinal$Country%in%agnPoor,dtcsFinal$Country,"")
LABELpam=ifelse(dtcsFinal$rowname%in%pamPoor,dtcsFinal$rowname,"")
LABELdia=ifelse(dtcsFinal$rowname%in%diaPoor,dtcsFinal$rowname,"")
LABELagn=ifelse(dtcsFinal$rowname%in%agnPoor,dtcsFinal$rowname,"")
library(ggrepel)
pamPlot + geom_text_repel(aes(label=LABELpam))
Error in `check_aesthetics()`:
! Aesthetics must be either length 1 or the same as the data (153): label
Backtrace:
1. base `<fn>`(x)
2. ggplot2:::print.ggplot(x)
4. ggplot2:::ggplot_build.ggplot(x)
5. ggplot2 by_layer(function(l, d) l$compute_aesthetics(d, plot))
6. ggplot2 f(l = layers[[i]], d = data[[i]])
7. l$compute_aesthetics(d, plot)
8. ggplot2 f(..., self = self)
9. ggplot2:::check_aesthetics(evaled, n)
diaPlot + geom_text_repel(aes(label=LABELdia))
Error in `check_aesthetics()`:
! Aesthetics must be either length 1 or the same as the data (153): label
Backtrace:
1. base `<fn>`(x)
2. ggplot2:::print.ggplot(x)
4. ggplot2:::ggplot_build.ggplot(x)
5. ggplot2 by_layer(function(l, d) l$compute_aesthetics(d, plot))
6. ggplot2 f(l = layers[[i]], d = data[[i]])
7. l$compute_aesthetics(d, plot)
8. ggplot2 f(..., self = self)
9. ggplot2:::check_aesthetics(evaled, n)
Warning messages:
1: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
2: Trying without OpenGL...
agnPlot + geom_text_repel(aes(label=LABELagn))
NOTE: Above section doesn’t work because it depends on rgl (in qpcR) which won’t load or install
The Dendogram (for hierarchical approaches)
fviz_dend(res.agnes,k=NumberOfClusterDesired, cex = 0.45, horiz = T,main = "AGNES approach")
Registered S3 methods overwritten by 'rgl':
method from
knit_print.rglId
knit_print.rglOpen3d
sew.rglRecordedplot
Error in dyn.load(dynlib <- getDynlib(dir)) :
unable to load shared object '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so':
dlopen(/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so, 0x0006): Library not loaded: /opt/X11/lib/libGLU.1.dylib
Referenced from: /Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so
Reason: tried: '/opt/X11/lib/libGLU.1.dylib' (no such file), '/Library/Frameworks/R.framework/Resources/lib/libGLU.1.dylib' (no such file), '/Users/Becca/lib/libGLU.1.dylib' (no such file), '/usr/local/lib/libGLU.1.dylib' (no such file), '/usr/lib/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/lib/libGLU.1.dylib' (no such file), '/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server/libGLU.1.dylib' (no such file)
In addition: Warning messages:
1: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
2: Trying without OpenGL...
Warning: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
Warning: Trying without OpenGL...
`guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
fviz_dend(res.diana,k=NumberOfClusterDesired, cex = 0.45, horiz = T,main = "DIANA approach")
Registered S3 methods overwritten by 'rgl':
method from
knit_print.rglId
knit_print.rglOpen3d
sew.rglRecordedplot
Error in dyn.load(dynlib <- getDynlib(dir)) :
unable to load shared object '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so':
dlopen(/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so, 0x0006): Library not loaded: /opt/X11/lib/libGLU.1.dylib
Referenced from: /Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so
Reason: tried: '/opt/X11/lib/libGLU.1.dylib' (no such file), '/Library/Frameworks/R.framework/Resources/lib/libGLU.1.dylib' (no such file), '/Users/Becca/lib/libGLU.1.dylib' (no such file), '/usr/local/lib/libGLU.1.dylib' (no such file), '/usr/lib/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/lib/libGLU.1.dylib' (no such file), '/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server/libGLU.1.dylib' (no such file)
Warning: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
Warning: Trying without OpenGL...
`guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
Factor Analysis - Latent Concepts
#Creating FA data
Warning messages:
1: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
2: Trying without OpenGL...
#selection
dataForFA = dtcsFinal
names(dataForFA)
[1] "GiniPercent" "HE2018" "LE2018"
library(lavaan)
model = 'healthsys=~ HE2018 + LE2018 + GiniPercent'
Warning messages:
1: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
2: Trying without OpenGL...
fit <- cfa(model, data = dataForFA, std.lv = TRUE)
indexCFA=lavPredict(fit)
indexCFA[1:10]
[1] 0.47060846 0.47371795 -1.30874907 0.35562305 0.22450684 1.39878912 1.52759672 0.14741568
[9] -0.07564263 0.34613209
Warning messages:
1: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
2: Trying without OpenGL...
#Rescale to 0:10
library(scales)
indexCFANorm=rescale(as.vector(indexCFA),
to = c(0, 10))
indexCFANorm[1:10]
[1] 6.275947 6.283759 1.805721 5.987073 5.657673 8.607787 8.931387 5.463999 4.903617 5.963229
#This is our index
dataForFA$demo_FA=indexCFANorm
Comparing new index with original score
base=ggplot(data=dataForFA,
aes(x=demo_FA,y=LE2018))
base+geom_point()
evalCFA1=parameterEstimates(fit, standardized =TRUE)
evalCFA1[evalCFA1$op=="=~",c('rhs','std.all','pvalue')]
NA
NA
Some coefficients
evalCFA2=as.list(fitMeasures(fit))
You want p.value of chi-square greater than .05
evalCFA2[c("chisq", "df", "pvalue")]
$chisq
[1] 1.358913e-13
$df
[1] 0
$pvalue
[1] NA
Warning messages:
1: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
2: Trying without OpenGL...
This is not a greater p-value than .05
You want Tucker-Lewis > .9
evalCFA2$tli # > 0.90
[1] 1
This is does satisfy Tucker-Lewis criteria
You want RMSEA < 0.05:
evalCFA2[c( 'rmsea.ci.lower','rmsea','rmsea.ci.upper')]
$rmsea.ci.lower
[1] 0
$rmsea
[1] 0
$rmsea.ci.upper
[1] 0
Warning messages:
1: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
2: Trying without OpenGL...
Does satisfy RMSEA critera
See how it looks:
#install.packages("semPlot")
Warning messages:
1: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
2: Trying without OpenGL...
library(semPlot)
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Registered S3 methods overwritten by 'rgl':
method from
knit_print.rglId
knit_print.rglOpen3d
sew.rglRecordedplot
Error in dyn.load(dynlib <- getDynlib(dir)) :
unable to load shared object '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so':
dlopen(/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so, 0x0006): Library not loaded: /opt/X11/lib/libGLU.1.dylib
Referenced from: /Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so
Reason: tried: '/opt/X11/lib/libGLU.1.dylib' (no such file), '/Library/Frameworks/R.framework/Resources/lib/libGLU.1.dylib' (no such file), '/Users/Becca/lib/libGLU.1.dylib' (no such file), '/usr/local/lib/libGLU.1.dylib' (no such file), '/usr/lib/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/lib/libGLU.1.dylib' (no such file), '/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server/libGLU.1.dylib' (no such file)
Warning: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
Warning: Trying without OpenGL...
semPaths(fit, what='std', nCharNodes=0, sizeMan=12,
edge.label.cex=1.5, fade=T,residuals = F)
Health expenditure and Life expectancy are positively associated with model, gini percent is negatively associated. Life expectancy is most associated with the health system
REGRESSION
Outcome variable as Life Expectancy, HE2018 and GiniPercent as independent variables?
regData = dtcsFinal
Warning messages:
1: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
2: Trying without OpenGL...
Prepare your hypothesis:
#FLAG - Gini is negatively coded (higher gini = higher inequality)
Warning messages:
1: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
2: Trying without OpenGL...
# hypothesis 1: 2018LE increases as gini percent DECREASES (higher equality):
hypo1=formula(LE2018~ GiniPercent)
# hypothesis 2: 2018 LE increases as health expenditure of a country increases
hypo2=formula(LE2018~ HE2018)
#hypothesis 3: 2018 LE increases with higher health expenditure and lower gini score
# - does this mean needing to make gini negative?
hypo3=formula(LE2018~ HE2018 + GiniPercent)
#
# results
gauss1=glm(hypo1,
data = regData,
family = 'gaussian')
gauss2=glm(hypo2,
data = regData,
family = 'gaussian')
gauss3=glm(hypo3,
data = regData,
family = 'gaussian')
Hypothesis 1:
summary(gauss1)
Call:
glm(formula = hypo1, family = "gaussian", data = regData)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5296 -0.7054 0.2065 0.7032 1.5589
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.02443 0.07543 -0.324 0.746
GiniPercent -0.42332 0.07531 -5.621 8.93e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.8704231)
Null deviance: 158.93 on 152 degrees of freedom
Residual deviance: 131.43 on 151 degrees of freedom
AIC: 416.95
Number of Fisher Scoring iterations: 2
Warning messages:
1: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
2: Trying without OpenGL...
– yes, negative coeff on gini percent, p<.05
Hypothesis 2
summary(gauss2)
Call:
glm(formula = hypo2, family = "gaussian", data = regData)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.2330 -0.3632 0.2508 0.5781 1.1319
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.02489 0.06767 -0.368 0.714
HE2018 0.56127 0.06444 8.710 4.97e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.7005489)
Null deviance: 158.93 on 152 degrees of freedom
Residual deviance: 105.78 on 151 degrees of freedom
AIC: 383.73
Number of Fisher Scoring iterations: 2
coefficient on health expenditure is positive, p>.05
summary(gauss3)
Call:
glm(formula = hypo3, family = "gaussian", data = regData)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.8734 -0.3875 0.1500 0.5842 1.4118
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.02272 0.06516 -0.349 0.727837
HE2018 0.47876 0.06618 7.234 2.25e-11 ***
GiniPercent -0.24870 0.06940 -3.584 0.000457 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.6496024)
Null deviance: 158.93 on 152 degrees of freedom
Residual deviance: 97.44 on 150 degrees of freedom
AIC: 373.16
Number of Fisher Scoring iterations: 2
Hypothesis as expected, HE 2018 pos coeff & significant, GiniPercent negative coeff and significant
anova(gauss1, gauss2, test="Chisq")
Analysis of Deviance Table
Model 1: LE2018 ~ GiniPercent
Model 2: LE2018 ~ HE2018
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 151 131.43
2 151 105.78 0 25.651
anova(gauss1, gauss3, test="Chisq")
Analysis of Deviance Table
Model 1: LE2018 ~ GiniPercent
Model 2: LE2018 ~ HE2018 + GiniPercent
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 151 131.43
2 150 97.44 1 33.994 4.692e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova(gauss2, gauss3, test="Chisq")
Analysis of Deviance Table
Model 1: LE2018 ~ HE2018
Model 2: LE2018 ~ HE2018 + GiniPercent
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 151 105.78
2 150 97.44 1 8.3425 0.0003388 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Model for the third hypothesis is chosen (combined HE + gini)
#install.packages("rsq")
library(rsq)
rsq(gauss3, adj=T)
[1] 0.3787309
Model 3 has the highest r2 at .379
5.1 Linearity between dependent variable and predictors is assumed, then these dots should follow a linear and horizontal trend
plot(gauss3,1)
Warning messages:
1: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
2: Trying without OpenGL...
Linear relationship doesn’t hold
5.2 Normality of residuals is assumed Visual exploration
plot(gauss3,2)
Warning messages:
1: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
2: Trying without OpenGL...
Mathematical exploration
# The data is normal if the p-value is above 0.05
shapiro.test(gauss3$residuals)
Shapiro-Wilk normality test
data: gauss3$residuals
W = 0.92613, p-value = 4.317e-07
so, data is not normally distributed
5.3. Homoscedasticity is assumed, so you need to check if residuals are spread equally along the ranges of predictors:
Visual exploration
plot(gauss2, 3)
Warning messages:
1: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
2: Trying without OpenGL...
Mathematical exploration
#install.packages("lmtest")
library(lmtest)
Loading required package: zoo
Attaching package: ‘zoo’
The following objects are masked from ‘package:base’:
as.Date, as.Date.numeric
#pvalue<0.05 you cannot assume Homoscedasticity
Warning messages:
1: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
2: Trying without OpenGL...
bptest(gauss3)
studentized Breusch-Pagan test
data: gauss3
BP = 17.152, df = 2, p-value = 0.0001886
Cannot assume homoskedacity; p<.05
5.4. We assume that there is no colinearity, that is, that the predictors are not correlated.
library(car)
Loading required package: carData
Warning messages:
1: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
2: Trying without OpenGL...
vif(gauss3)
HE2018 GiniPercent
1.137649 1.137649
Warning messages:
1: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
2: Trying without OpenGL...
yes, lower than 5 - collinearity is not a problem
5.5. Analize the effect of atypical values. Determine if outliers (points that are far from the rest, but still in the trend) or high-leverage points (far from the trend but close to the rest) are influential:
Visual exploration
plot(gauss3,5)
Warning messages:
1: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
2: Trying without OpenGL...
??What does this mean
Querying
gaussInf=as.data.frame(influence.measures(gauss3)$is.inf)
Warning messages:
1: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
2: Trying without OpenGL...
gaussInf[gaussInf$cook.d,]
#install.packages("sjPlot")
library(sjPlot)
plot_models(gauss3,vline.color="grey")
Registered S3 methods overwritten by 'rgl':
method from
knit_print.rglId
knit_print.rglOpen3d
sew.rglRecordedplot
Error in dyn.load(dynlib <- getDynlib(dir)) :
unable to load shared object '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so':
dlopen(/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so, 0x0006): Library not loaded: /opt/X11/lib/libGLU.1.dylib
Referenced from: /Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so
Reason: tried: '/opt/X11/lib/libGLU.1.dylib' (no such file), '/Library/Frameworks/R.framework/Resources/lib/libGLU.1.dylib' (no such file), '/Users/Becca/lib/libGLU.1.dylib' (no such file), '/usr/local/lib/libGLU.1.dylib' (no such file), '/usr/lib/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/lib/libGLU.1.dylib' (no such file), '/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server/libGLU.1.dylib' (no such file)
Warning: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
Warning: Trying without OpenGL...
PREDICTIVE APPROACH 1. Split the data set
Registered S3 methods overwritten by 'rgl':
method from
knit_print.rglId
knit_print.rglOpen3d
sew.rglRecordedplot
Error in dyn.load(dynlib <- getDynlib(dir)) :
unable to load shared object '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so':
dlopen(/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so, 0x0006): Library not loaded: /opt/X11/lib/libGLU.1.dylib
Referenced from: /Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so
Reason: tried: '/opt/X11/lib/libGLU.1.dylib' (no such file), '/Library/Frameworks/R.framework/Resources/lib/libGLU.1.dylib' (no such file), '/Users/Becca/lib/libGLU.1.dylib' (no such file), '/usr/local/lib/libGLU.1.dylib' (no such file), '/usr/lib/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/lib/libGLU.1.dylib' (no such file), '/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server/libGLU.1.dylib' (no such file)
In addition: Warning messages:
1: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
2: Trying without OpenGL...
Warning: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
Warning: Trying without OpenGL...
Registered S3 methods overwritten by 'rgl':
method from
knit_print.rglId
knit_print.rglOpen3d
sew.rglRecordedplot
Error in dyn.load(dynlib <- getDynlib(dir)) :
unable to load shared object '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so':
dlopen(/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so, 0x0006): Library not loaded: /opt/X11/lib/libGLU.1.dylib
Referenced from: /Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so
Reason: tried: '/opt/X11/lib/libGLU.1.dylib' (no such file), '/Library/Frameworks/R.framework/Resources/lib/libGLU.1.dylib' (no such file), '/Users/Becca/lib/libGLU.1.dylib' (no such file), '/usr/local/lib/libGLU.1.dylib' (no such file), '/usr/lib/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/lib/libGLU.1.dylib' (no such file), '/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server/libGLU.1.dylib' (no such file)
Warning: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
Warning: Trying without OpenGL...
#install.packages("caret")
Registered S3 methods overwritten by 'rgl':
method from
knit_print.rglId
knit_print.rglOpen3d
sew.rglRecordedplot
Error in dyn.load(dynlib <- getDynlib(dir)) :
unable to load shared object '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so':
dlopen(/Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so, 0x0006): Library not loaded: /opt/X11/lib/libGLU.1.dylib
Referenced from: /Library/Frameworks/R.framework/Versions/4.1/Resources/library/rgl/libs/rgl.so
Reason: tried: '/opt/X11/lib/libGLU.1.dylib' (no such file), '/Library/Frameworks/R.framework/Resources/lib/libGLU.1.dylib' (no such file), '/Users/Becca/lib/libGLU.1.dylib' (no such file), '/usr/local/lib/libGLU.1.dylib' (no such file), '/usr/lib/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/libGLU.1.dylib' (no such file), '/lib/libGLU.1.dylib' (no such file), '/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server/libGLU.1.dylib' (no such file)
Warning: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
Warning: Trying without OpenGL...
set.seed(123)
Warning messages:
1: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
2: Trying without OpenGL...
selection = createDataPartition(regData$LE2018,
p = 0.75,
list = FALSE)
#
trainGauss = regData[ selection, ]
#
testGauss = regData[-selection, ]
ctrl = trainControl(method = 'cv',number = 5)
gauss3CV = train(hypo3,
data = trainGauss,
method = 'glm',
trControl = ctrl)
non-uniform 'Rounding' sampler used
gauss3CV
Generalized Linear Model
117 samples
2 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 93, 94, 94, 93, 94
Resampling results:
RMSE Rsquared MAE
0.8785279 0.393698 0.638966
R2 = .394
predictedVal<-predict(gauss3CV,testGauss)
Warning messages:
1: Loading rgl's DLL failed.
This build of rgl depends on XQuartz, which failed to load.
See the discussion in https://stackoverflow.com/a/66127391/2554330
2: Trying without OpenGL...
postResample(obs = testGauss$LE2018,
pred=predictedVal)
RMSE Rsquared MAE
0.7520189 0.5470718 0.5711473
From the information above, you do noy a better prediction: the Rsquared from postResample is below 0.5.
BINARY OUTCOME Ignore all below here, data is not binary If you have a binary dependent variable (this data is not binary)
#regData$LE2018dico=ifelse(regData$LE2018>median(regData$LE2018,
# na.rm = T),
# 1,0)
*EXPLANATORY APPROACH – IGNORE, data is not binary. Just including for practice 1. State hypothesis
hypoDico1=formula(LE2018dico~ GiniPercent)
hypoDico2=formula(LE2018dico~ HE2018 + GiniPercent)
demoidh$LE2018dico=factor(regData$LE2018dico)
Compute regression models
Log1=glm(hypoDico1,data = regData,
family="binomial")
Logi2=glm(hypoDico2, data=regData,
family="binomial")
See Results - First Hypothesis
summary(Logi1)
summary(Logi2)
Search for a better model:
lrtest(Logi1,Logi2)
Verify situation of a chosen model
9.1 Linearity assumption (Box-Tidwell test)
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.