Lab 1: Intro to R

library(tidyverse)
library(openintro)
library(statsr)
library(broom)
library(ggplot2)

Exercise 1

There are 1458 rows and 123 columns, each row represents a single country in a specific year.

data(hfi)
dim(hfi)

## [1] 1458  123

Exercise 2

hfi_2016 <- hfi %>%
  filter(year == 2016) %>%
  select(pf_score, pf_expression_control, hf_score, region, countries, ISO_code)
head(hfi_2016)

## # A tibble: 6 × 6
##   pf_score pf_expression_control hf_score region              countries ISO_code
##      <dbl>                 <dbl>    <dbl> <chr>               <chr>     <chr>   
## 1     7.60                  5.25     7.57 Eastern Europe      Albania   ALB     
## 2     5.28                  4        5.14 Middle East & Nort… Algeria   DZA     
## 3     6.11                  2.5      5.64 Sub-Saharan Africa  Angola    AGO     
## 4     8.10                  5.5      6.47 Latin America & th… Argentina ARG     
## 5     6.91                  4.25     7.24 Caucasus & Central… Armenia   ARM     
## 6     9.18                  7.75     8.58 Oceania             Australia AUS

Exercise 3

The form of the relationship is linear, the direction is positive, and the strength is strong, with a correlation of 0.845. There are no significant unusual observations, and the data points are fairly tightly clustered along the trend line, supporting the use of a linear model.

ggplot(hfi_2016, aes(x = pf_expression_control, y = pf_score)) +
  geom_point() +
  labs(title = "pf_score vs pf_expression_control")

hfi_2016 %>%
  summarise(correlation = cor(pf_expression_control, pf_score, use = "complete.obs"))

## # A tibble: 1 × 1
##   correlation
##         <dbl>
## 1       0.845

Exercise 4

The closer the fitted line is to the actual data points, the lower the sum of squared residuals (SSR).

plot_ss(x = pf_expression_control, y = pf_score, data = hfi_2016, showSquares = TRUE)

## Click two points to make a line.                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##      4.2838       0.5418  
## 
## Sum of Squares:  102.213

Exercise 5

The regression model shows that for each 1-unit increase in expression control, the personal freedom score increases by about 0.542 points. This indicates a strong, positive linear relationship between the two variables.

m1 <- lm(pf_score ~ pf_expression_control, data = hfi_2016)
tidy(m1)

## # A tibble: 2 × 5
##   term                  estimate std.error statistic  p.value
##   <chr>                    <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)              4.28     0.149       28.8 4.23e-65
## 2 pf_expression_control    0.542    0.0271      20.0 2.31e-45

glance(m1)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.714         0.712 0.799      400. 2.31e-45     1  -193.  391.  400.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Exercise 6

The predicted personal freedom score for a country with a pf_expression_control of 3 is approximately 5.91. If the actual score is, for example, 5.20, then the residual is −0.71, indicating the model overestimated the true value. This helps us assess how accurate the model is for specific observations.

ggplot(hfi_2016, aes(x = pf_expression_control, y = pf_score)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula = 'y ~ x'

predicted <- predict(m1, newdata = data.frame(pf_expression_control = 3))
actual <- hfi_2016$pf_score[hfi_2016$pf_expression_control == 3]
residual <- actual - predicted

predict(m1, newdata = data.frame(pf_expression_control = 3))

##        1 
## 5.909351

print(actual)

## [1] 5.465632

print(residual)

##          1 
## -0.4437186

Exercise 7

Using the regression line, the predicted personal freedom score for a country with a pf_expression_control score of 3 is approximately 5.91. If the actual score is 5.20, the model overestimates the value by 0.71. This results in a residual of −0.71, which indicates that the model’s prediction was too high.

m1_aug <- augment(m1)

ggplot(m1_aug, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  xlab("Fitted values") +
  ylab("Residuals")

Exercise 8

There is no apparent skew or strong pattern in the histogram of residuals, and the shape appears roughly symmetric and bell-shaped. This suggests that the residuals are approximately normally distributed, which supports the validity of the linearity assumption in the regression model.

ggplot(data = m1_aug, aes(x = .resid)) +
  geom_histogram(binwidth = 0.25, fill = "skyblue") +
  labs(title = "Histogram of Residuals", x = "Residuals", y = "Count")

Exercise 9

The histogram looks roughly symmetric and bell-shaped, so the nearly normal residuals condition is satisfied.

Exercise 10

Based on the residuals vs. fitted values plot, there is no clear pattern or funnel shape, and the spread of residuals appears fairly consistent across all fitted values. This suggests that the constant variability condition is not violated, and the linear regression model satisfies the assumption of homoscedasticity.

m2 <- lm(pf_score ~ hf_score, data = hfi_2016)
glance(m2)$r.squared

## [1] 0.8978487

LS0tCnRpdGxlOiAiTGFiIDE6IEludHJvIHRvIFIiCmF1dGhvcjogIkF1dGhvciBOYW1lIgpkYXRlOiAiYHIgU3lzLkRhdGUoKWAiCm91dHB1dDogb3BlbmludHJvOjpsYWJfcmVwb3J0Ci0tLQoKYGBge3IgbG9hZC1wYWNrYWdlcywgbWVzc2FnZT1GQUxTRX0KbGlicmFyeSh0aWR5dmVyc2UpCmxpYnJhcnkob3BlbmludHJvKQpsaWJyYXJ5KHN0YXRzcikKbGlicmFyeShicm9vbSkKbGlicmFyeShnZ3Bsb3QyKQpgYGAKCiMjIyBFeGVyY2lzZSAxCgpUaGVyZSBhcmUgMTQ1OCByb3dzIGFuZCAxMjMgY29sdW1ucywgZWFjaCByb3cgcmVwcmVzZW50cyBhIHNpbmdsZSBjb3VudHJ5IGluIGEgc3BlY2lmaWMgeWVhci4KCmBgYHtyIHZpZXctZ2lybHMtY291bnRzfQpkYXRhKGhmaSkKZGltKGhmaSkKYGBgCgoKIyMjIEV4ZXJjaXNlIDIKCmBgYHtyIHRyZW5kLWdpcmxzfQpoZmlfMjAxNiA8LSBoZmkgJT4lCiAgZmlsdGVyKHllYXIgPT0gMjAxNikgJT4lCiAgc2VsZWN0KHBmX3Njb3JlLCBwZl9leHByZXNzaW9uX2NvbnRyb2wsIGhmX3Njb3JlLCByZWdpb24sIGNvdW50cmllcywgSVNPX2NvZGUpCmhlYWQoaGZpXzIwMTYpCmBgYAoKCiMjIyBFeGVyY2lzZSAzCgpUaGUgZm9ybSBvZiB0aGUgcmVsYXRpb25zaGlwIGlzIGxpbmVhciwgdGhlIGRpcmVjdGlvbiBpcyBwb3NpdGl2ZSwgYW5kIHRoZSBzdHJlbmd0aCBpcyBzdHJvbmcsIHdpdGggYSBjb3JyZWxhdGlvbiBvZiAwLjg0NS4gVGhlcmUgYXJlIG5vIHNpZ25pZmljYW50IHVudXN1YWwgb2JzZXJ2YXRpb25zLCBhbmQgdGhlIGRhdGEgcG9pbnRzIGFyZSBmYWlybHkgdGlnaHRseSBjbHVzdGVyZWQgYWxvbmcgdGhlIHRyZW5kIGxpbmUsIHN1cHBvcnRpbmcgdGhlIHVzZSBvZiBhIGxpbmVhciBtb2RlbC4KCmBgYHtyIHBsb3QtcHJvcC1ib3lzLWFyYnV0aG5vdH0KZ2dwbG90KGhmaV8yMDE2LCBhZXMoeCA9IHBmX2V4cHJlc3Npb25fY29udHJvbCwgeSA9IHBmX3Njb3JlKSkgKwogIGdlb21fcG9pbnQoKSArCiAgbGFicyh0aXRsZSA9ICJwZl9zY29yZSB2cyBwZl9leHByZXNzaW9uX2NvbnRyb2wiKQoKaGZpXzIwMTYgJT4lCiAgc3VtbWFyaXNlKGNvcnJlbGF0aW9uID0gY29yKHBmX2V4cHJlc3Npb25fY29udHJvbCwgcGZfc2NvcmUsIHVzZSA9ICJjb21wbGV0ZS5vYnMiKSkKYGBgCgoKIyMjIEV4ZXJjaXNlIDQKClRoZSBjbG9zZXIgdGhlIGZpdHRlZCBsaW5lIGlzIHRvIHRoZSBhY3R1YWwgZGF0YSBwb2ludHMsIHRoZSBsb3dlciB0aGUgc3VtIG9mIHNxdWFyZWQgcmVzaWR1YWxzIChTU1IpLgoKYGBge3IgZGltLXByZXNlbnR9CnBsb3Rfc3MoeCA9IHBmX2V4cHJlc3Npb25fY29udHJvbCwgeSA9IHBmX3Njb3JlLCBkYXRhID0gaGZpXzIwMTYsIHNob3dTcXVhcmVzID0gVFJVRSkKYGBgCgoKIyMjIEV4ZXJjaXNlIDUKClRoZSByZWdyZXNzaW9uIG1vZGVsIHNob3dzIHRoYXQgZm9yIGVhY2ggMS11bml0IGluY3JlYXNlIGluIGV4cHJlc3Npb24gY29udHJvbCwgdGhlIHBlcnNvbmFsIGZyZWVkb20gc2NvcmUgaW5jcmVhc2VzIGJ5IGFib3V0IDAuNTQyIHBvaW50cy4gVGhpcyBpbmRpY2F0ZXMgYSBzdHJvbmcsIHBvc2l0aXZlIGxpbmVhciByZWxhdGlvbnNoaXAgYmV0d2VlbiB0aGUgdHdvIHZhcmlhYmxlcy4KCmBgYHtyIGNvdW50LWNvbXBhcmV9Cm0xIDwtIGxtKHBmX3Njb3JlIH4gcGZfZXhwcmVzc2lvbl9jb250cm9sLCBkYXRhID0gaGZpXzIwMTYpCnRpZHkobTEpCmdsYW5jZShtMSkKYGBgCgoKIyMjIEV4ZXJjaXNlIDYKClRoZSBwcmVkaWN0ZWQgcGVyc29uYWwgZnJlZWRvbSBzY29yZSBmb3IgYSBjb3VudHJ5IHdpdGggYSBwZl9leHByZXNzaW9uX2NvbnRyb2wgb2YgMyBpcyBhcHByb3hpbWF0ZWx5IDUuOTEuIElmIHRoZSBhY3R1YWwgc2NvcmUgaXMsIGZvciBleGFtcGxlLCA1LjIwLCB0aGVuIHRoZSByZXNpZHVhbCBpcyDiiJIwLjcxLCBpbmRpY2F0aW5nIHRoZSBtb2RlbCBvdmVyZXN0aW1hdGVkIHRoZSB0cnVlIHZhbHVlLiBUaGlzIGhlbHBzIHVzIGFzc2VzcyBob3cgYWNjdXJhdGUgdGhlIG1vZGVsIGlzIGZvciBzcGVjaWZpYyBvYnNlcnZhdGlvbnMuCgpgYGB7ciBwbG90LXByb3AtYm95cy1wcmVzZW50fQpnZ3Bsb3QoaGZpXzIwMTYsIGFlcyh4ID0gcGZfZXhwcmVzc2lvbl9jb250cm9sLCB5ID0gcGZfc2NvcmUpKSArCiAgZ2VvbV9wb2ludCgpICsKICBnZW9tX3Ntb290aChtZXRob2QgPSAibG0iLCBzZSA9IEZBTFNFKQoKcHJlZGljdGVkIDwtIHByZWRpY3QobTEsIG5ld2RhdGEgPSBkYXRhLmZyYW1lKHBmX2V4cHJlc3Npb25fY29udHJvbCA9IDMpKQphY3R1YWwgPC0gaGZpXzIwMTYkcGZfc2NvcmVbaGZpXzIwMTYkcGZfZXhwcmVzc2lvbl9jb250cm9sID09IDNdCnJlc2lkdWFsIDwtIGFjdHVhbCAtIHByZWRpY3RlZAoKcHJlZGljdChtMSwgbmV3ZGF0YSA9IGRhdGEuZnJhbWUocGZfZXhwcmVzc2lvbl9jb250cm9sID0gMykpCgpwcmludChhY3R1YWwpCnByaW50KHJlc2lkdWFsKQpgYGAKCgojIyMgRXhlcmNpc2UgNwoKVXNpbmcgdGhlIHJlZ3Jlc3Npb24gbGluZSwgdGhlIHByZWRpY3RlZCBwZXJzb25hbCBmcmVlZG9tIHNjb3JlIGZvciBhIGNvdW50cnkgd2l0aCBhIHBmX2V4cHJlc3Npb25fY29udHJvbCBzY29yZSBvZiAzIGlzIGFwcHJveGltYXRlbHkgNS45MS4gSWYgdGhlIGFjdHVhbCBzY29yZSBpcyA1LjIwLCB0aGUgbW9kZWwgb3ZlcmVzdGltYXRlcyB0aGUgdmFsdWUgYnkgMC43MS4gVGhpcyByZXN1bHRzIGluIGEgcmVzaWR1YWwgb2Yg4oiSMC43MSwgd2hpY2ggaW5kaWNhdGVzIHRoYXQgdGhlIG1vZGVsJ3MgcHJlZGljdGlvbiB3YXMgdG9vIGhpZ2guCgpgYGB7ciBmaW5kLW1heC10b3RhbH0KbTFfYXVnIDwtIGF1Z21lbnQobTEpCgpnZ3Bsb3QobTFfYXVnLCBhZXMoeCA9IC5maXR0ZWQsIHkgPSAucmVzaWQpKSArCiAgZ2VvbV9wb2ludCgpICsKICBnZW9tX2hsaW5lKHlpbnRlcmNlcHQgPSAwLCBsaW5ldHlwZSA9ICJkYXNoZWQiLCBjb2xvciA9ICJyZWQiKSArCiAgeGxhYigiRml0dGVkIHZhbHVlcyIpICsKICB5bGFiKCJSZXNpZHVhbHMiKQpgYGAKCgojIyMgRXhlcmNpc2UgOCAKClRoZXJlIGlzIG5vIGFwcGFyZW50IHNrZXcgb3Igc3Ryb25nIHBhdHRlcm4gaW4gdGhlIGhpc3RvZ3JhbSBvZiByZXNpZHVhbHMsIGFuZCB0aGUgc2hhcGUgYXBwZWFycyByb3VnaGx5IHN5bW1ldHJpYyBhbmQgYmVsbC1zaGFwZWQuIFRoaXMgc3VnZ2VzdHMgdGhhdCB0aGUgcmVzaWR1YWxzIGFyZSBhcHByb3hpbWF0ZWx5IG5vcm1hbGx5IGRpc3RyaWJ1dGVkLCB3aGljaCBzdXBwb3J0cyB0aGUgdmFsaWRpdHkgb2YgdGhlIGxpbmVhcml0eSBhc3N1bXB0aW9uIGluIHRoZSByZWdyZXNzaW9uIG1vZGVsLgoKYGBge3J9CmdncGxvdChkYXRhID0gbTFfYXVnLCBhZXMoeCA9IC5yZXNpZCkpICsKICBnZW9tX2hpc3RvZ3JhbShiaW53aWR0aCA9IDAuMjUsIGZpbGwgPSAic2t5Ymx1ZSIpICsKICBsYWJzKHRpdGxlID0gIkhpc3RvZ3JhbSBvZiBSZXNpZHVhbHMiLCB4ID0gIlJlc2lkdWFscyIsIHkgPSAiQ291bnQiKQpgYGAKCgojIyMgRXhlcmNpc2UgOQoKVGhlIGhpc3RvZ3JhbSBsb29rcyByb3VnaGx5IHN5bW1ldHJpYyBhbmQgYmVsbC1zaGFwZWQsIHNvIHRoZSBuZWFybHkgbm9ybWFsIHJlc2lkdWFscyBjb25kaXRpb24gaXMgc2F0aXNmaWVkLgoKIyMjIEV4ZXJjaXNlIDEwCgpCYXNlZCBvbiB0aGUgcmVzaWR1YWxzIHZzLiBmaXR0ZWQgdmFsdWVzIHBsb3QsIHRoZXJlIGlzIG5vIGNsZWFyIHBhdHRlcm4gb3IgZnVubmVsIHNoYXBlLCBhbmQgdGhlIHNwcmVhZCBvZiByZXNpZHVhbHMgYXBwZWFycyBmYWlybHkgY29uc2lzdGVudCBhY3Jvc3MgYWxsIGZpdHRlZCB2YWx1ZXMuIFRoaXMgc3VnZ2VzdHMgdGhhdCB0aGUgY29uc3RhbnQgdmFyaWFiaWxpdHkgY29uZGl0aW9uIGlzIG5vdCB2aW9sYXRlZCwgYW5kIHRoZSBsaW5lYXIgcmVncmVzc2lvbiBtb2RlbCBzYXRpc2ZpZXMgdGhlIGFzc3VtcHRpb24gb2YgaG9tb3NjZWRhc3RpY2l0eS4KCmBgYHtyfQptMiA8LSBsbShwZl9zY29yZSB+IGhmX3Njb3JlLCBkYXRhID0gaGZpXzIwMTYpCmdsYW5jZShtMikkci5zcXVhcmVkCmBgYA==