library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.0      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(statsr)
## Warning: package 'statsr' was built under R version 4.2.2
## Loading required package: BayesFactor
## Warning: package 'BayesFactor' was built under R version 4.2.2
## Loading required package: coda
## Warning: package 'coda' was built under R version 4.2.2
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.2.2
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## ************
## Welcome to BayesFactor 0.9.12-4.4. If you have questions, please contact Richard Morey (richarddmorey@gmail.com).
## 
## Type BFManual() to open the manual.
## ************
## 
## Attaching package: 'statsr'
## 
## The following objects are masked from 'package:openintro':
## 
##     calc_streak, evals, nycflights, present
data('hfi', package='openintro')

Exercise 1

dim(hfi)
## [1] 1458  123

The dimensions of the dataset are 1428 rows and 123 columns.

Exercise 2

plot(hfi$pf_score ~ hfi$pf_expression_control)

I would use a scatter plot to display the relationship between the two. The relationship looks somewhat linear and I would be comfortable to say that the lower the expression control, then the higher the personal freedom score.

Exercise 3

There seems to be a positive correlation between expression control and personal freedom. As expression control increases, or decresases as 0 is the highest, so does the personal freedom score. There does seem to be a fair degree of variation though it seems to lessen at the upper extremes.

Exercise 4

m1 <- lm(pf_score ~ pf_expression_control, data = hfi)

plot_ss(x = pf_expression_control, y = pf_score, data = hfi, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##      4.6171       0.4914  
## 
## Sum of Squares:  952.153

The smallest sum of squares is 952.153

Exercise 5

lm_ec_score <- lm(hf_score ~ pf_expression_control, data =hfi)
summary(lm_ec_score)
## 
## Call:
## lm(formula = hf_score ~ pf_expression_control, data = hfi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6198 -0.4908  0.1031  0.4703  2.2933 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           5.153687   0.046070  111.87   <2e-16 ***
## pf_expression_control 0.349862   0.008067   43.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.667 on 1376 degrees of freedom
##   (80 observations deleted due to missingness)
## Multiple R-squared:  0.5775, Adjusted R-squared:  0.5772 
## F-statistic:  1881 on 1 and 1376 DF,  p-value: < 2.2e-16

The equation of the regression line is y = .349862x + 5.1537 where x is the expression control and y is the hf_score. The slope tells us with every unit increase in expression control, there is a .34 increase in the human preedom score.

Exercise 6

ggplot(data = hfi, aes(x = pf_expression_control, y = pf_score)) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 80 rows containing non-finite values (stat_smooth).
## Warning: Removed 80 rows containing missing values (geom_point).

lm_pf_expression_control_pf_score<- lm(pf_score~ pf_expression_control, data = hfi)

summary(lm_pf_expression_control_pf_score)
## 
## Call:
## lm(formula = pf_score ~ pf_expression_control, data = hfi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8467 -0.5704  0.1452  0.6066  3.2060 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.61707    0.05745   80.36   <2e-16 ***
## pf_expression_control  0.49143    0.01006   48.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8318 on 1376 degrees of freedom
##   (80 observations deleted due to missingness)
## Multiple R-squared:  0.6342, Adjusted R-squared:  0.634 
## F-statistic:  2386 on 1 and 1376 DF,  p-value: < 2.2e-16
filteredhfi<-filter(hfi, pf_expression_control == 6.75)
mean(filteredhfi$pf_score)
## [1] 8.006315

Based on this regression line, the personal freedom score with a 6.7 for expression control would be .49143*6.7 + 4.617 or 7.90871. Since the countries only have expression control scores in .25 increments, I filtered those with 6.75 as the closest and average their pf_scores. Doing so I got 8.006315, giving me an underestimate of .1.

Exercise 7

ggplot(data = m1, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

There doesn’t seem to bea pattern in the residuals other than there seems to be more values towards the center which doesn’t tell us much. This indicates a linearity exists between the two variables.

Exercise 8

ggplot(data = m1, aes(x = .resid)) +
  geom_histogram(binwidth = 1) +
  xlab("Residuals")

ggplot(data = m1, aes(sample = .resid)) +
  stat_qq()

Linearity seems to be met. The histogram looks normal enough and the qqplot is almost linear with some very slight deviations near the extremes.

Exercise 9

Based on the residuals vs the fitted plot, it seems that constant variability is met.

Exercise 10

lm2 <- lm(pf_ss_homicide ~pf_ss_disappearances_violent, data = hfi)


ggplot(data = hfi, aes(x = pf_ss_disappearances_violent, y = pf_ss_homicide)) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 80 rows containing non-finite values (stat_smooth).
## Warning: Removed 80 rows containing missing values (geom_point).

I chose the violent disappearances and homicide as two variables that would seem to be connected. Judging by the scatter plot, there seems to be a very low correlation between the two which is surprising.

Exercise 11

summary(m1)
## 
## Call:
## lm(formula = pf_score ~ pf_expression_control, data = hfi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8467 -0.5704  0.1452  0.6066  3.2060 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.61707    0.05745   80.36   <2e-16 ***
## pf_expression_control  0.49143    0.01006   48.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8318 on 1376 degrees of freedom
##   (80 observations deleted due to missingness)
## Multiple R-squared:  0.6342, Adjusted R-squared:  0.634 
## F-statistic:  2386 on 1 and 1376 DF,  p-value: < 2.2e-16
summary(lm2)
## 
## Call:
## lm(formula = pf_ss_homicide ~ pf_ss_disappearances_violent, data = hfi)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.417 -1.020  1.221  2.042  2.509 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  7.325888   0.423633  17.293   <2e-16 ***
## pf_ss_disappearances_violent 0.009149   0.043773   0.209    0.834    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.834 on 1376 degrees of freedom
##   (80 observations deleted due to missingness)
## Multiple R-squared:  3.175e-05,  Adjusted R-squared:  -0.000695 
## F-statistic: 0.04368 on 1 and 1376 DF,  p-value: 0.8345

The relationship between the variables I chose to study were vastly different from the pf_expression_control and pf_score. The \(R^2\) value was 6.34 for the original as opposed to -.000695 for the variables I chose. My independent does not predict my dependent one at all since the \(R^2\) value is near zero.

Exercise 12

lm3 <- lm(pf_religion ~ pf_ss_disappearances_violent, data = hfi)


ggplot(data = hfi, aes(x = pf_religion, y = pf_ss_disappearances_violent)) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 90 rows containing non-finite values (stat_smooth).
## Warning: Removed 90 rows containing missing values (geom_point).

The variables I chose to study were actually the most surprising to me as I would have expected that violent crimes and homicide would go hand in hand in one predicting the other. Though another relationship that I found surprising was religious freedom and disappearances. I would assume that all crime would decrease as soon as relgious freedom was prominent though the scatter plot would suggest a near positive correlation.