library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(statsr)
## Warning: package 'statsr' was built under R version 4.2.2
## Loading required package: BayesFactor
## Warning: package 'BayesFactor' was built under R version 4.2.2
## Loading required package: coda
## Warning: package 'coda' was built under R version 4.2.2
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.2.2
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## ************
## Welcome to BayesFactor 0.9.12-4.4. If you have questions, please contact Richard Morey (richarddmorey@gmail.com).
##
## Type BFManual() to open the manual.
## ************
##
## Attaching package: 'statsr'
##
## The following objects are masked from 'package:openintro':
##
## calc_streak, evals, nycflights, present
data('hfi', package='openintro')
dim(hfi)
## [1] 1458 123
The dimensions of the dataset are 1428 rows and 123 columns.
plot(hfi$pf_score ~ hfi$pf_expression_control)
I would use a scatter plot to display the relationship between the two.
The relationship looks somewhat linear and I would be comfortable to say
that the lower the expression control, then the higher the personal
freedom score.
There seems to be a positive correlation between expression control and personal freedom. As expression control increases, or decresases as 0 is the highest, so does the personal freedom score. There does seem to be a fair degree of variation though it seems to lessen at the upper extremes.
m1 <- lm(pf_score ~ pf_expression_control, data = hfi)
plot_ss(x = pf_expression_control, y = pf_score, data = hfi, showSquares = TRUE)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## 4.6171 0.4914
##
## Sum of Squares: 952.153
The smallest sum of squares is 952.153
lm_ec_score <- lm(hf_score ~ pf_expression_control, data =hfi)
summary(lm_ec_score)
##
## Call:
## lm(formula = hf_score ~ pf_expression_control, data = hfi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6198 -0.4908 0.1031 0.4703 2.2933
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.153687 0.046070 111.87 <2e-16 ***
## pf_expression_control 0.349862 0.008067 43.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.667 on 1376 degrees of freedom
## (80 observations deleted due to missingness)
## Multiple R-squared: 0.5775, Adjusted R-squared: 0.5772
## F-statistic: 1881 on 1 and 1376 DF, p-value: < 2.2e-16
The equation of the regression line is y = .349862x + 5.1537 where x is the expression control and y is the hf_score. The slope tells us with every unit increase in expression control, there is a .34 increase in the human preedom score.
ggplot(data = hfi, aes(x = pf_expression_control, y = pf_score)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 80 rows containing non-finite values (stat_smooth).
## Warning: Removed 80 rows containing missing values (geom_point).
lm_pf_expression_control_pf_score<- lm(pf_score~ pf_expression_control, data = hfi)
summary(lm_pf_expression_control_pf_score)
##
## Call:
## lm(formula = pf_score ~ pf_expression_control, data = hfi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8467 -0.5704 0.1452 0.6066 3.2060
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.61707 0.05745 80.36 <2e-16 ***
## pf_expression_control 0.49143 0.01006 48.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8318 on 1376 degrees of freedom
## (80 observations deleted due to missingness)
## Multiple R-squared: 0.6342, Adjusted R-squared: 0.634
## F-statistic: 2386 on 1 and 1376 DF, p-value: < 2.2e-16
filteredhfi<-filter(hfi, pf_expression_control == 6.75)
mean(filteredhfi$pf_score)
## [1] 8.006315
Based on this regression line, the personal freedom score with a 6.7 for expression control would be .49143*6.7 + 4.617 or 7.90871. Since the countries only have expression control scores in .25 increments, I filtered those with 6.75 as the closest and average their pf_scores. Doing so I got 8.006315, giving me an underestimate of .1.
ggplot(data = m1, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
There doesn’t seem to bea pattern in the residuals other than there
seems to be more values towards the center which doesn’t tell us much.
This indicates a linearity exists between the two variables.
ggplot(data = m1, aes(x = .resid)) +
geom_histogram(binwidth = 1) +
xlab("Residuals")
ggplot(data = m1, aes(sample = .resid)) +
stat_qq()
Linearity seems to be met. The histogram looks normal enough and the
qqplot is almost linear with some very slight deviations near the
extremes.
Based on the residuals vs the fitted plot, it seems that constant variability is met.
lm2 <- lm(pf_ss_homicide ~pf_ss_disappearances_violent, data = hfi)
ggplot(data = hfi, aes(x = pf_ss_disappearances_violent, y = pf_ss_homicide)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 80 rows containing non-finite values (stat_smooth).
## Warning: Removed 80 rows containing missing values (geom_point).
I chose the violent disappearances and homicide as two variables that
would seem to be connected. Judging by the scatter plot, there seems to
be a very low correlation between the two which is surprising.
summary(m1)
##
## Call:
## lm(formula = pf_score ~ pf_expression_control, data = hfi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8467 -0.5704 0.1452 0.6066 3.2060
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.61707 0.05745 80.36 <2e-16 ***
## pf_expression_control 0.49143 0.01006 48.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8318 on 1376 degrees of freedom
## (80 observations deleted due to missingness)
## Multiple R-squared: 0.6342, Adjusted R-squared: 0.634
## F-statistic: 2386 on 1 and 1376 DF, p-value: < 2.2e-16
summary(lm2)
##
## Call:
## lm(formula = pf_ss_homicide ~ pf_ss_disappearances_violent, data = hfi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.417 -1.020 1.221 2.042 2.509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.325888 0.423633 17.293 <2e-16 ***
## pf_ss_disappearances_violent 0.009149 0.043773 0.209 0.834
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.834 on 1376 degrees of freedom
## (80 observations deleted due to missingness)
## Multiple R-squared: 3.175e-05, Adjusted R-squared: -0.000695
## F-statistic: 0.04368 on 1 and 1376 DF, p-value: 0.8345
The relationship between the variables I chose to study were vastly different from the pf_expression_control and pf_score. The \(R^2\) value was 6.34 for the original as opposed to -.000695 for the variables I chose. My independent does not predict my dependent one at all since the \(R^2\) value is near zero.
lm3 <- lm(pf_religion ~ pf_ss_disappearances_violent, data = hfi)
ggplot(data = hfi, aes(x = pf_religion, y = pf_ss_disappearances_violent)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 90 rows containing non-finite values (stat_smooth).
## Warning: Removed 90 rows containing missing values (geom_point).
The variables I chose to study were actually the most surprising to me as I would have expected that violent crimes and homicide would go hand in hand in one predicting the other. Though another relationship that I found surprising was religious freedom and disappearances. I would assume that all crime would decrease as soon as relgious freedom was prominent though the scatter plot would suggest a near positive correlation.