library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(infer)
library(statsr)
## Loading required package: BayesFactor
## Loading required package: coda
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## ************
## Welcome to BayesFactor 0.9.12-4.4. If you have questions, please contact Richard Morey (richarddmorey@gmail.com).
##
## Type BFManual() to open the manual.
## ************
##
## Attaching package: 'statsr'
##
## The following object is masked from 'package:infer':
##
## rep_sample_n
##
## The following objects are masked from 'package:openintro':
##
## calc_streak, evals, nycflights, present
data('hfi', package='openintro')
dim(hfi)
## [1] 1458 123
Looking at the scatterplot, I would say the relationship looks linear and thus a linear model can predict the personal freedom score.
hfi %>%
summarise(cor(pf_expression_control, pf_score, use = "complete.obs"))
plot(hfi$pf_score ~ hfi$pf_expression_control,
xlab = "Expression control", ylab = "Pf score")
hfi1 <- hfi[c("pf_score", "pf_expression_control")] %>%
filter(!is.na(pf_expression_control)|!is.na(pf_score))
# Drop NAs
hfi1 <- drop_na(hfi1)
row.names <- NULL
DATA606::plot_ss(x = hfi1$pf_expression_control, y = hfi1$pf_score)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## 4.6171 0.4914
##
## Sum of Squares: 952.153
Practically, there are some massive outliers, but for the most part it appears to be a positive linear relationship. The strength appears to be high, as they are very tightly correlated.
plot_ss(x = pf_expression_control, y = pf_score, data = hfi1)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## 4.6171 0.4914
##
## Sum of Squares: 952.153
The Smallest sum of square I received was 952!
reverse <- lm(hf_score ~ pf_expression_control, data = hfi)
summary(reverse)
##
## Call:
## lm(formula = hf_score ~ pf_expression_control, data = hfi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6198 -0.4908 0.1031 0.4703 2.2933
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.153687 0.046070 111.87 <2e-16 ***
## pf_expression_control 0.349862 0.008067 43.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.667 on 1376 degrees of freedom
## (80 observations deleted due to missingness)
## Multiple R-squared: 0.5775, Adjusted R-squared: 0.5772
## F-statistic: 1881 on 1 and 1376 DF, p-value: < 2.2e-16
The slope tells us that human freedom and the amount of political pressure are directly related! The Equation is “4.61707 + 0.349862 * (pf_expression_control)
target <- 6.7
result <- 4.61707 + 0.349862 * target
result
## [1] 6.961145
hfi %>%
group_by(pf_score) %>%
filter(pf_expression_control > 6.50) %>%
filter(pf_expression_control < 7.00)
Seeing as there no exact 6.7 value, lets use a close value from Guyana
Residual <- 6.954978 - result
Residual
## [1] -0.0061674
The prediction overestimated by 0.0062.
m1 <- lm(pf_score ~ pf_expression_control, data = hfi)
ggplot(data = m1, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
ggplot(data = m1, aes(x = .resid)) +
geom_histogram(binwidth = 0.5) +
xlab("Residuals")
ggplot(data = m1, aes(sample = .resid)) +
stat_qq()
There appears to be no pattern in the residuals, so its likely its a linear relationship.
Yes, the chart demonsratres that the distrubutions are nearly normal.
The residuals vs. fitted plot shows the points scattered around zero, appearing to meet the variability condition!
ggplot(data = hfi, aes(x = year, y = hf_rank)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 80 rows containing non-finite values (stat_smooth).
## Warning: Removed 80 rows containing missing values (geom_point).
The relationship between year and hf_rank is a positive linear relationship.
target <- lm(year ~ hf_rank, data = hfi)
summary(target)
##
## Call:
## lm(formula = year ~ hf_rank, data = hfi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3787 -2.1659 -0.0095 2.1083 4.2053
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.012e+03 1.380e-01 14574.330 < 2e-16 ***
## hf_rank 4.171e-03 1.552e-03 2.687 0.00729 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.563 on 1376 degrees of freedom
## (80 observations deleted due to missingness)
## Multiple R-squared: 0.005221, Adjusted R-squared: 0.004498
## F-statistic: 7.222 on 1 and 1376 DF, p-value: 0.007289
So my R-Squared value is significantly lower than R-Squared value between pf_expression_control and pf_score?
Given the low correlation, my independent variable does not predict my dependent one well.
legal <- lm(hfi$year ~ hfi$ef_legal_protection)
summary(legal)
##
## Call:
## lm(formula = hfi$year ~ hfi$ef_legal_protection)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3014 -2.1748 -0.1263 1.9067 3.9571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2012.35942 0.24023 8376.823 <2e-16 ***
## hfi$ef_legal_protection -0.03408 0.04127 -0.826 0.409
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.547 on 1287 degrees of freedom
## (169 observations deleted due to missingness)
## Multiple R-squared: 0.0005295, Adjusted R-squared: -0.0002471
## F-statistic: 0.6818 on 1 and 1287 DF, p-value: 0.4091
Honestly, I’m finding the relationships of freedom relationships and time to be rather interesting. I really like the idea that legal protections are time independent, but at the same time I was hoping that they would be improving (indicating a positive linear relationship)
This model has an R-Square value akin to my previous time based metric, which seems to indicate that freedom and time do not have a relationship.