Week 11

Getting started: Load libraries

The data used in this analysis are from the Human Freedom Index 2023

Importing the dataset

data <- read_csv("https://raw.githubusercontent.com/Heleinef/Data-Science-Master_Heleine/main/Human%20Freedom%20Index_2023.csv")

## Rows: 495 Columns: 146
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): iso, countries, region, ef_government_tax_income_data, ef_governm...
## dbl (139): year, hf_score, hf_rank, hf_quartile, pf_rol_procedural, pf_rol_c...
## lgl   (2): pf_identity_inheritance_widows, pf_identity_inheritance_daughters
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

hfi <- data
hfi

## # A tibble: 495 × 146
##     year iso   countries   region hf_score hf_rank hf_quartile pf_rol_procedural
##    <dbl> <chr> <chr>       <chr>     <dbl>   <dbl>       <dbl>             <dbl>
##  1  2021 ALB   Albania     Easte…     7.67      49           2                NA
##  2  2021 DZA   Algeria     Middl…     4.82     155           4                NA
##  3  2021 AGO   Angola      Sub-S…     5.76     122           3                NA
##  4  2021 ARG   Argentina   Latin…     6.85      77           2                NA
##  5  2021 ARM   Armenia     Cauca…     7.99      33           1                NA
##  6  2021 AUS   Australia   Ocean…     8.52      14           1                NA
##  7  2021 AUT   Austria     Weste…     8.24      24           1                NA
##  8  2021 AZE   Azerbaijan  Cauca…     5.65     126           4                NA
##  9  2021 BHS   Bahamas, T… Latin…     7.82      41           2                NA
## 10  2021 BHR   Bahrain     Middl…     5.47     137           4                NA
## # ℹ 485 more rows
## # ℹ 138 more variables: pf_rol_civil <dbl>, pf_rol_criminal <dbl>,
## #   pf_rol_vdem <dbl>, pf_rol <dbl>, pf_ss_homicide <dbl>,
## #   pf_ss_homicide_data <dbl>, pf_ss_disappearances_disap <dbl>,
## #   pf_ss_disappearances_violent <dbl>,
## #   pf_ss_disappearances_violent_data <dbl>,
## #   pf_ss_disappearances_organized <dbl>, …

Data exploration

dim(hfi)

## [1] 495 146

head(hfi)

## # A tibble: 6 × 146
##    year iso   countries region    hf_score hf_rank hf_quartile pf_rol_procedural
##   <dbl> <chr> <chr>     <chr>        <dbl>   <dbl>       <dbl>             <dbl>
## 1  2021 ALB   Albania   Eastern …     7.67      49           2                NA
## 2  2021 DZA   Algeria   Middle E…     4.82     155           4                NA
## 3  2021 AGO   Angola    Sub-Saha…     5.76     122           3                NA
## 4  2021 ARG   Argentina Latin Am…     6.85      77           2                NA
## 5  2021 ARM   Armenia   Caucasus…     7.99      33           1                NA
## 6  2021 AUS   Australia Oceania       8.52      14           1                NA
## # ℹ 138 more variables: pf_rol_civil <dbl>, pf_rol_criminal <dbl>,
## #   pf_rol_vdem <dbl>, pf_rol <dbl>, pf_ss_homicide <dbl>,
## #   pf_ss_homicide_data <dbl>, pf_ss_disappearances_disap <dbl>,
## #   pf_ss_disappearances_violent <dbl>,
## #   pf_ss_disappearances_violent_data <dbl>,
## #   pf_ss_disappearances_organized <dbl>,
## #   pf_ss_disappearances_fatalities <dbl>, …

names(hfi)

##   [1] "year"                                
##   [2] "iso"                                 
##   [3] "countries"                           
##   [4] "region"                              
##   [5] "hf_score"                            
##   [6] "hf_rank"                             
##   [7] "hf_quartile"                         
##   [8] "pf_rol_procedural"                   
##   [9] "pf_rol_civil"                        
##  [10] "pf_rol_criminal"                     
##  [11] "pf_rol_vdem"                         
##  [12] "pf_rol"                              
##  [13] "pf_ss_homicide"                      
##  [14] "pf_ss_homicide_data"                 
##  [15] "pf_ss_disappearances_disap"          
##  [16] "pf_ss_disappearances_violent"        
##  [17] "pf_ss_disappearances_violent_data"   
##  [18] "pf_ss_disappearances_organized"      
##  [19] "pf_ss_disappearances_fatalities"     
##  [20] "pf_ss_disappearances_fatalities_data"
##  [21] "pf_ss_disappearances_injuries"       
##  [22] "pf_ss_disappearances_injuries_data"  
##  [23] "pf_ss_disappearances_torture"        
##  [24] "pf_ss_killings"                      
##  [25] "pf_ss_disappearances"                
##  [26] "pf_ss"                               
##  [27] "pf_movement_vdem_foreign"            
##  [28] "pf_movement_vdem_men"                
##  [29] "pf_movement_vdem_women"              
##  [30] "pf_movement_vdem"                    
##  [31] "pf_movement_cld"                     
##  [32] "pf_movement"                         
##  [33] "pf_religion_freedom_vdem"            
##  [34] "pf_religion_freedom_cld"             
##  [35] "pf_religion_freedom"                 
##  [36] "pf_religion_suppression"             
##  [37] "pf_religion"                         
##  [38] "pf_assembly_entry"                   
##  [39] "pf_assembly_freedom_house"           
##  [40] "pf_assembly_freedom_bti"             
##  [41] "pf_assembly_freedom_cld"             
##  [42] "pf_assembly_freedom"                 
##  [43] "pf_assembly_parties_barriers"        
##  [44] "pf_assembly_parties_bans"            
##  [45] "pf_assembly_parties_auton"           
##  [46] "pf_assembly_parties"                 
##  [47] "pf_assembly_civil"                   
##  [48] "pf_assembly"                         
##  [49] "pf_expression_direct_killed"         
##  [50] "pf_expression_direct_killed_data"    
##  [51] "pf_expression_direct_jailed"         
##  [52] "pf_expression_direct_jailed_data"    
##  [53] "pf_expression_direct"                
##  [54] "pf_expression_vdem_cultural"         
##  [55] "pf_expression_vdem_harass"           
##  [56] "pf_expression_vdem_gov"              
##  [57] "pf_expression_vdem_internet"         
##  [58] "pf_expression_vdem_selfcens"         
##  [59] "pf_expression_vdem"                  
##  [60] "pf_expression_house"                 
##  [61] "pf_expression_bti"                   
##  [62] "pf_expression_cld"                   
##  [63] "pf_expression"                       
##  [64] "pf_identity_same_m"                  
##  [65] "pf_identity_same_f"                  
##  [66] "pf_identity_same"                    
##  [67] "pf_identity_divorce"                 
##  [68] "pf_identity_inheritance_widows"      
##  [69] "pf_identity_inheritance_daughters"   
##  [70] "pf_identity_inheritance"             
##  [71] "pf_identity_fgm"                     
##  [72] "pf_identity"                         
##  [73] "pf_score"                            
##  [74] "pf_rank"                             
##  [75] "ef_government_consumption"           
##  [76] "ef_government_consumption_data"      
##  [77] "ef_government_transfers"             
##  [78] "ef_government_transfers_data"        
##  [79] "ef_government_investment"            
##  [80] "ef_government_investment_data"       
##  [81] "ef_government_tax_income"            
##  [82] "ef_government_tax_income_data"       
##  [83] "ef_government_tax_payroll"           
##  [84] "ef_government_tax_payroll_data"      
##  [85] "ef_government_tax"                   
##  [86] "ef_government_soa"                   
##  [87] "ef_government"                       
##  [88] "ef_legal_judicial"                   
##  [89] "ef_legal_courts"                     
##  [90] "ef_legal_protection"                 
##  [91] "ef_legal_military"                   
##  [92] "ef_legal_integrity"                  
##  [93] "ef_legal_enforcement"                
##  [94] "ef_legal_regulatory"                 
##  [95] "ef_legal_police"                     
##  [96] "ef_gender"                           
##  [97] "ef_legal"                            
##  [98] "ef_money_growth"                     
##  [99] "ef_money_growth_data"                
## [100] "ef_money_sd"                         
## [101] "ef_money_sd_data"                    
## [102] "ef_money_inflation"                  
## [103] "ef_money_inflation_data"             
## [104] "ef_money_currency"                   
## [105] "ef_money"                            
## [106] "ef_trade_tariffs_revenue"            
## [107] "ef_trade_tariffs_revenue_data"       
## [108] "ef_trade_tariffs_mean"               
## [109] "ef_trade_tariffs_mean_data"          
## [110] "ef_trade_tariffs_sd"                 
## [111] "ef_trade_tariffs_sd_data"            
## [112] "ef_trade_tariffs"                    
## [113] "ef_trade_regulatory_nontariff"       
## [114] "ef_trade_regulatory_costs"           
## [115] "ef_trade_regulatory"                 
## [116] "ef_trade_black"                      
## [117] "ef_trade_movement_open"              
## [118] "ef_trade_movement_capital"           
## [119] "ef_trade_movement_visit"             
## [120] "ef_trade_movement_assets"            
## [121] "ef_trade_movement"                   
## [122] "ef_trade"                            
## [123] "ef_regulation_credit_ownership"      
## [124] "ef_regulation_credit_private"        
## [125] "ef_regulation_credit_interest"       
## [126] "ef_regulation_credit"                
## [127] "ef_regulation_labor_minwage"         
## [128] "ef_regulation_labor_firing"          
## [129] "ef_regulation_labor_bargain"         
## [130] "ef_regulation_labor_hours"           
## [131] "ef_regulation_labor_dismissal"       
## [132] "ef_regulation_labor_conscription"    
## [133] "ef_regulation_labor_foreign"         
## [134] "ef_regulation_labor"                 
## [135] "ef_regulation_business_burden"       
## [136] "ef_regulation_business_costs"        
## [137] "ef_regulation_business_impartial"    
## [138] "ef_regulation_business_compliance"   
## [139] "ef_regulation_business"              
## [140] "ef_regulation_enter_openness"        
## [141] "ef_regulation_enter_permits"         
## [142] "ef_regulation_enter_distortion"      
## [143] "ef_regulation_enter"                 
## [144] "ef_regulation"                       
## [145] "ef_score"                            
## [146] "ef_rank"

Summary statistics

Correlation between human freedom score and police

Scatter plot:

# Scatter plot
ggplot(hfi, aes(x = pf_score, y = ef_legal_police))+
  geom_point(color = 4)

Finding the strength and direction of the Correlation

# correlation strength and direction
hfi %>%
  summarise(cor(hfi$pf_score, hfi$ef_legal_police , use = "complete.obs"))

## # A tibble: 1 × 1
##   `cor(hfi$pf_score, hfi$ef_legal_police, use = "complete.obs")`
##                                                            <dbl>
## 1                                                          0.381

Linear model

# Fitting the regression line
model <- lm(ef_legal_police ~ pf_score , data = hfi)
model

## 
## Call:
## lm(formula = ef_legal_police ~ pf_score, data = hfi)
## 
## Coefficients:
## (Intercept)     pf_score  
##      1.2836       0.5593

# Summary statistics of the model
summary(model)

## 
## Call:
## lm(formula = ef_legal_police ~ pf_score, data = hfi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7577 -1.6531  0.0122  1.3590  6.4086 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.2836     0.4370   2.937  0.00347 ** 
## pf_score      0.5593     0.0612   9.138  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.244 on 493 degrees of freedom
## Multiple R-squared:  0.1449, Adjusted R-squared:  0.1431 
## F-statistic: 83.51 on 1 and 493 DF,  p-value: < 2.2e-16

Prediction and prediction errors : the 3 essential plots

To assess whether our linear model is reliable, we need to check for linearity, the near normal distribution of the residuals, and the constance in the variability, aka homoscedasticiticy.

Let’s create a scatter plot with the least squares line for our model laid on top.

# Fit our regression line
ggplot(data = hfi, aes(x = ef_legal_police, y = pf_score)) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula = 'y ~ x'

Let’s build a residuals vs. fitted (predicted) values plot

model <- lm(ef_legal_police ~ pf_score , data = hfi)
# Get fitted values and residuals from the model
fitted_values <- fitted(model)
residuals <- resid(model)

# Create residual plot

ggplot(data = NULL, aes(x = fitted_values, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

3.Let’s create a quantile vs.quantile plot

# Fit a linear regression model
model <- lm(ef_legal_police ~ pf_score , data = hfi)

# Create Q-Q plot of residuals
qqnorm(resid(model))
qqline(resid(model))

model <- lm(ef_legal_police ~ pf_score , data = hfi)
model.lm <- model
# Set up the plot layout (to plot one graph at a time)
par(mfrow=c(1,1))
# Plot diagnostics for the linear regression model
plot(model.lm)

Our conclusion:

The model is not appropriate and needs to be improved because it does not meet all of the 3 fundamental assumptions for a linear model,and especially linearity and homoscedasticity. There is also a weak correlation coefficient between the two variables.

The scatter plot shows that many values are clustered far away from (but somehow around) the regression line. This indicates that our model may not be perfect and may not explain everything.
The homoscedasticity of the model cannot be confirmed either, although the residuals versus fitted values plot shows that the variability across the different values of x is almost about the same. However, that variability is not perfect and needs to be improved before we move ahead with inferential statistics.
Nonetheless, the quantile vs quantile plot shows residuals forming an almost straight line, which indicates a normal distribution with no outliers. As a result the model does not need any log transformation for normalization.

Week 11

Heleine Fouda

2024-04-05

Getting started: Load libraries

Importing the dataset

Data exploration

Summary statistics

Correlation between human freedom score and police

Scatter plot:

Finding the strength and direction of the Correlation

Linear model

Prediction and prediction errors : the 3 essential plots

To assess whether our linear model is reliable, we need to check for linearity, the near normal distribution of the residuals, and the constance in the variability, aka homoscedasticiticy.

Our conclusion:

The model is not appropriate and needs to be improved because it does not meet all of the 3 fundamental assumptions for a linear model,and especially linearity and homoscedasticity. There is also a weak correlation coefficient between the two variables.