Loading in Necessary Packages

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5.9000     ✓ purrr   0.3.4     
## ✓ tibble  3.1.3          ✓ dplyr   1.0.7     
## ✓ tidyr   1.1.3          ✓ stringr 1.4.0     
## ✓ readr   2.0.1          ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(statsr)
## Loading required package: BayesFactor
## Loading required package: coda
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## ************
## Welcome to BayesFactor 0.9.12-4.2. If you have questions, please contact Richard Morey (richarddmorey@gmail.com).
## 
## Type BFManual() to open the manual.
## ************
## 
## Attaching package: 'statsr'
## The following objects are masked from 'package:openintro':
## 
##     calc_streak, evals, nycflights, present
library(broom)

First Set of Questions

Exercise 1

What are the dimensions of the data set?

data(hfi)
dim(hfi)
## [1] 1458  123

Using the “dim()” code, I was able to determine that this data set has 123 variables and a total of 1458 observations

Exercise 2

The dataset spans a lot of years, but we are only interested in data from year 2016. Filter the data hfi data frame for year 2016, select the six variables, and assign the result to a data frame named hfi_2016.

hfi_2016 <- hfi %>%
  filter(year == "2016") %>%
  select(pf_expression_control, pf_score, year, countries, hf_score, pf_ss)

Year: The year in which the personal freedom rankings were collected Countries: The countries in which the personal freedom rankings were collected pf_expression_control: Amount of political pressure on media content score pf_score: Human individual (personal) freedom score hf_score: Human freedom score pf_ss: Security and safety score

Exercise 3

What type of plot would you use to display the relationship between the personal freedom score, pf_score, and pf_expression_control? Plot this relationship using the variable pf_expression_control as the predictor. Does the relationship look linear? If you knew a country’s pf_expression_control, or its score out of 10, with 0 being the most, of political pressures and controls on media content, would you be comfortable using a linear model to predict the personal freedom score?

hfi_2016 %>%
  ggplot(aes(x = pf_expression_control, y = pf_score)) +
  geom_point() +
  labs(x = "Expression Control",
       y = "Pf Score",
       title = "Pf Score as a Function of Pf Expression Control") +
  theme(plot.title = element_text(hjust = 0.50)) +
  scale_x_continuous(breaks = seq(0, 10, 1))

I would use a scatterplot as its primary use is to observe and show relationships between 2 numeric variables. The relationship seems somewhat linear, but many of the points are scattered randomly. As of looking at the graph for the first time, I would not be comfortable with using a linear model to predict personal freedom score.

hfi_2016 %>%
  summarise(cor(pf_expression_control, pf_score))
## # A tibble: 1 × 1
##   `cor(pf_expression_control, pf_score)`
##                                    <dbl>
## 1                                  0.845

Exercise 4

Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations

The plot displays a moderately strong, positive correlation between the two variables. As pf_express_control increases, the pf_score does as well. However, there seems to be a couple outliers in the plot.

Exercise 5

Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got?

plot_ss(x = pf_expression_control, y = pf_score, data = hfi_2016, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##      4.2838       0.5418  
## 
## Sum of Squares:  102.213

The smallest I could get for the Sum of Sqaures was 104.469

Exercise 6

We cant take a quick look at some of the summary stats of the data set by using the commands “tidy()” and “glance()”. Some of these include the y-intercept and slope of the regression line and the \(R^2\) (0.714).

lm1 <- lm(pf_score ~ pf_expression_control, data = hfi_2016)

tidy(lm1)
## # A tibble: 2 × 5
##   term                  estimate std.error statistic  p.value
##   <chr>                    <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)              4.28     0.149       28.8 4.23e-65
## 2 pf_expression_control    0.542    0.0271      20.0 2.31e-45
glance(lm1)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.714         0.712 0.799      400. 2.31e-45     1  -193.  391.  400.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

And through this we can determine that the equation of the regression line is:

\(ŷ = 4.28 + 0.542x\)

This tells us 1. For countries with pf_expression_control of 0, (those with the largest amount of political pressure on media content), the expected mean personal freedom score is 4.28. 2. For every 1 unit increase in pf_expression_control, the expected mean personal freedom score should increase by 0.542 units.

hfi_2016 %>%
  ggplot(aes(x = pf_expression_control, y = pf_score)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Expression Control",
       y = "Pf Score",
       title = "Pf Score as a Function of Pf Expression Control") +
  theme(plot.title = element_text(hjust = 0.50)) +
  scale_x_continuous(breaks = seq(0, 10, 1))
## `geom_smooth()` using formula 'y ~ x'

This new plot only adds the least squares line on top of the original. This was done by adding one line of code - “geom_smooth(). We defined the line by using”lm" or linear model and hid the standard error, “se”, for now

Exercise 7

If someone saw the least squares regression line and not the actual data, how would they predict a country’s personal freedom school for one with a 3 rating for pf_expression_control? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?

pf_exp_control <- 3

pf_score6 <- 4.28 + 0.542 * pf_exp_control
pf_score6
## [1] 5.906

Then we can Check observed values of pf_score with 3.0 rating for pf_expression_control

hfi_2016 %>%
  group_by(pf_score) %>%
  filter(pf_expression_control == 3.0)
## # A tibble: 1 × 6
## # Groups:   pf_score [1]
##   pf_expression_control pf_score  year countries         hf_score pf_ss
##                   <dbl>    <dbl> <dbl> <chr>                <dbl> <dbl>
## 1                     3     5.47  2016 Central Afr. Rep.     5.29  4.37

There is only one observation with a pf_expression_control of 3.0. And the observed pf_score for this is 5.466, which is fairly close to the predicted pf_score of 5.906. Therefore, the residual for this would be \(5.466 - 5.906 = -0.440\). The prediction was an overestimation by 0.44

Exercise 8

Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between the two variables?

The “augment()” function helps us calculate the residuals for each point. All we need to do is create a scatterplot for those residuals.

lm1_aug <- augment(lm1)
ggplot(data = lm1_aug, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(
    x ="Fitted values", 
    y ="Residuals",
    title = "Plot of Residuals") +
  theme(plot.title = element_text(hjust = 0.50))

Overall, there is not an apparent pattern in the residuals plot and this indicates there is a linear relationship between the two variables. The points look randomly scattered.

Exercise 9

Nearly normal residuals: To check this condition, we can look at a histogram of the residuals.

ggplot(data = lm1_aug, aes(x = .resid)) +
  geom_histogram(binwidth = 0.25) +
  labs(
    x = "Residuals",
    y = "Count",
    title = "Histogram of Residuals") +
  theme(plot.title = element_text(hjust = 0.50))

The histogram of residuals shows that the residuals are fairly normally distributed. Although there is some skew to the left. Generally, the normal residuals condition appears to be met.

Exercise 10

Constant Variability: Based on the residuals vs. fitted plot, does the constant variability condition appear to be met?

ggplot(data = lm1, aes(sample = .resid)) +
  stat_qq() +
  labs(
    x = "Residuals",
    y = "Fitted",
    title = "Residuals vs Fitted") +
  theme(plot.title = element_text(hjust = 0.50))

The points residuals vs. fitted plot show that points are scattered around 0, there is a constant variability. Thus, the constant variability condition appear to be met.

Additional Practice

Excerise 1

Choose another freedom variable and a variable you think would strongly correlate with it.. Produce a scatter plot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?

Use pf_ss as the predictor, and predict the hf_rank with it

ggplot(data = hfi, aes(x = pf_ss, y = pf_score)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Security and Safety Score",
       y = "Human Freedom Ranking",
       title = "Human Freedom Ranking as a Function of Security and Safety Score") +
  theme(plot.title = element_text(hjust = 0.50)) +
  scale_x_continuous(breaks = seq(0, 10, 1))
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 80 rows containing non-finite values (stat_smooth).
## Warning: Removed 80 rows containing missing values (geom_point).

At a glance, the relationship between these two variables seems linear. It displays a positive relationship, meaning as pf_ss increases, hf_score increases as well.

Exercise 2

How does this relationship compare to the relationship between pf_score and pf_expression_control? Use the R2 values from the two model summaries to compare. Does your independent variable seem to predict pf_score better? Why or why not?

lmPF <- lm(pf_score ~ pf_expression_control, data = hfi)
summary(lmPF)
## 
## Call:
## lm(formula = pf_score ~ pf_expression_control, data = hfi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8467 -0.5704  0.1452  0.6066  3.2060 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.61707    0.05745   80.36   <2e-16 ***
## pf_expression_control  0.49143    0.01006   48.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8318 on 1376 degrees of freedom
##   (80 observations deleted due to missingness)
## Multiple R-squared:  0.6342, Adjusted R-squared:  0.634 
## F-statistic:  2386 on 1 and 1376 DF,  p-value: < 2.2e-16
lmPF2 <- lm(pf_score ~ pf_ss, hfi)

summary(lmPF2)
## 
## Call:
## lm(formula = pf_score ~ pf_ss, data = hfi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6872 -0.6502  0.1600  0.7018  2.6868 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.10202    0.15058   7.319 4.24e-13 ***
## pf_ss        0.74553    0.01815  41.069  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9219 on 1376 degrees of freedom
##   (80 observations deleted due to missingness)
## Multiple R-squared:  0.5507, Adjusted R-squared:  0.5504 
## F-statistic:  1687 on 1 and 1376 DF,  p-value: < 2.2e-16

From the \(R^2\) values of both models, we have this:

pf_expression_control and pf_score model: 63.42% of the data fits the regression model

pf_ss and pf_score model: 55.07% of the data fits the regression model

The independent variable, pf_ss, does not seem to predict the dependent, pf_score, better than the first model. The \(R^2\) is higher on the first model (63.42% vs 55.07%). This indicates less variation on the first model.

Exercise 3

Check the model diagnostics using appropriate visualizations and evaluate if the model conditions have been met.

First we need to find the residuals using the augment function:

lm2_aug <- augment(lmPF2)

Linearity: We should also verify this condition with a plot of the residuals vs. fitted (predicted) values.

ggplot(data = lm2_aug, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(
    x ="Fitted values", 
    y ="Residuals",
    title = "Plot of Residuals") +
  theme(plot.title = element_text(hjust = 0.50))

Overall, there is not an apparent pattern in the residuals plot and this indicates there is a linear relationship between the two variables. However, the points do become more concentrated in one area as the Fitted Values increase. This can also be seen in the scatterplot (as pf_ss increases). Nevertheless, overall the points look randomly scattered.

Normality of Residuals:

ggplot(data = lm2_aug, aes(x = .resid)) +
  geom_histogram(binwidth = 0.25) +
  labs(
    x = "Residuals",
    y = "Count",
    title = "Histogram of Residuals") +
  theme(plot.title = element_text(hjust = 0.50))

Generally, the histogram of residuals looks fairly normal with most of the residuals being concetrated in the middle.

Exercise 4

---
title: "DATA 101 Final Project - Human Freedom Index"
author: "Justin Park"
date: "8/9/2021"
output: openintro::lab_report
---

## Loading in Necessary Packages

```{r}
library(tidyverse)
library(openintro)
library(statsr)
library(broom)
```

## First Set of Questions

### Exercise 1

What are the dimensions of the data set?

```{r}
data(hfi)
dim(hfi)
```

**Using the "dim()" code, I was able to determine that this data set has 123 variables and a total of 1458 observations**


### Exercise 2

The dataset spans a lot of years, but we are only interested in data from year 2016. Filter the data hfi data frame for year 2016, select the six variables, and assign the result to a data frame named hfi_2016.

```{r}
hfi_2016 <- hfi %>%
  filter(year == "2016") %>%
  select(pf_expression_control, pf_score, year, countries, hf_score, pf_ss)
```

**Year:** The year in which the personal freedom rankings were collected
**Countries:** The countries in which the personal freedom rankings were collected
**pf_expression_control:** Amount of political pressure on media content score
**pf_score:** Human individual (personal) freedom score
**hf_score:** Human freedom score
**pf_ss:** Security and safety score

### Exercise 3

What type of plot would you use to display the relationship between the personal freedom score, pf_score, and pf_expression_control? Plot this relationship using the variable pf_expression_control as the predictor. Does the relationship look linear? If you knew a country’s pf_expression_control, or its score out of 10, with 0 being the most, of political pressures and controls on media content, would you be comfortable using a linear model to predict the personal freedom score?

```{r}
hfi_2016 %>%
  ggplot(aes(x = pf_expression_control, y = pf_score)) +
  geom_point() +
  labs(x = "Expression Control",
       y = "Pf Score",
       title = "Pf Score as a Function of Pf Expression Control") +
  theme(plot.title = element_text(hjust = 0.50)) +
  scale_x_continuous(breaks = seq(0, 10, 1))
```

**I would use a scatterplot as its primary use is to observe and show relationships between 2 numeric variables. The relationship seems somewhat linear, but many of the points are scattered randomly. As of looking at the graph for the first time, I would not be comfortable with using a linear model to predict personal freedom score.**

```{r}
hfi_2016 %>%
  summarise(cor(pf_expression_control, pf_score))
```


### Exercise 4

Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations

**The plot displays a moderately strong, positive correlation between the two variables. As pf_express_control increases, the pf_score does as well. However, there seems to be a couple outliers in the plot.**


### Exercise 5

Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got?

```{r}
plot_ss(x = pf_expression_control, y = pf_score, data = hfi_2016, showSquares = TRUE)
```

**The smallest I could get for the Sum of Sqaures was 104.469**


### Exercise 6

We cant take a quick look at some of the summary stats of the data set by using the commands "tidy()" and "glance()". Some of these include the y-intercept and slope of the regression line and the  $R^2$ (0.714).
```{r}
lm1 <- lm(pf_score ~ pf_expression_control, data = hfi_2016)

tidy(lm1)

glance(lm1)
```

And through this we can determine that the equation of the regression line is:

$ŷ = 4.28 + 0.542x$

This tells us
1. For countries with pf_expression_control of 0, (those with the largest amount of political pressure on media content), the expected mean personal freedom score is 4.28.
2. For every 1 unit increase in pf_expression_control, the expected mean personal freedom score should increase by 0.542 units.

```{r}
hfi_2016 %>%
  ggplot(aes(x = pf_expression_control, y = pf_score)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Expression Control",
       y = "Pf Score",
       title = "Pf Score as a Function of Pf Expression Control") +
  theme(plot.title = element_text(hjust = 0.50)) +
  scale_x_continuous(breaks = seq(0, 10, 1))
```

**This new plot only adds the least squares line on top of the original. This was done by adding one line of code - "geom_smooth(). We defined the line by using "lm" or linear model and hid the standard error, "se", for now**


### Exercise 7

If someone saw the least squares regression line and not the actual data, how would they predict a country’s personal freedom school for one with a 3 rating for pf_expression_control? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?

```{r}
pf_exp_control <- 3

pf_score6 <- 4.28 + 0.542 * pf_exp_control
pf_score6
```

Then we can Check observed values of pf_score with 3.0 rating for `pf_expression_control`

```{r}
hfi_2016 %>%
  group_by(pf_score) %>%
  filter(pf_expression_control == 3.0)
```

**There is only one observation with a pf_expression_control of 3.0. And the observed pf_score for this is 5.466, which is fairly close to the predicted pf_score of 5.906.**
**Therefore, the residual for this would be $5.466 - 5.906 = -0.440$. The prediction was an overestimation by 0.44**


### Exercise 8

Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between the two variables?

**The "augment()" function helps us calculate the residuals for each point. All we need to do is create a scatterplot for those residuals.**

```{r}
lm1_aug <- augment(lm1)
```

```{r}
ggplot(data = lm1_aug, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(
    x ="Fitted values", 
    y ="Residuals",
    title = "Plot of Residuals") +
  theme(plot.title = element_text(hjust = 0.50))
```

**Overall, there is not an apparent pattern in the residuals plot and this indicates there is a linear relationship between the two variables. The points look randomly scattered.**


### Exercise 9

**Nearly normal residuals:** To check this condition, we can look at a histogram of the residuals.

```{r}
ggplot(data = lm1_aug, aes(x = .resid)) +
  geom_histogram(binwidth = 0.25) +
  labs(
    x = "Residuals",
    y = "Count",
    title = "Histogram of Residuals") +
  theme(plot.title = element_text(hjust = 0.50))
```

**The histogram of residuals shows that the residuals are fairly normally distributed. Although there is some skew to the left. Generally, the normal residuals condition appears to be met.**

### Exercise 10

**Constant Variability:** Based on the residuals vs. fitted plot, does the constant variability condition appear to be met?

```{r}
ggplot(data = lm1, aes(sample = .resid)) +
  stat_qq() +
  labs(
    x = "Residuals",
    y = "Fitted",
    title = "Residuals vs Fitted") +
  theme(plot.title = element_text(hjust = 0.50))
```

**The points residuals vs. fitted plot show that points are scattered around 0, there is a constant variability. Thus, the constant variability condition appear to be met.**



## Additional Practice

### Excerise 1

Choose another freedom variable and a variable you think would strongly correlate with it.. Produce a scatter plot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?

**Use pf_ss as the predictor, and predict the hf_rank with it**

```{r}
ggplot(data = hfi, aes(x = pf_ss, y = pf_score)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Security and Safety Score",
       y = "Human Freedom Ranking",
       title = "Human Freedom Ranking as a Function of Security and Safety Score") +
  theme(plot.title = element_text(hjust = 0.50)) +
  scale_x_continuous(breaks = seq(0, 10, 1))
```

**At a glance, the relationship between these two variables seems linear.**
**It displays a positive relationship, meaning as pf_ss increases, hf_score increases as well.**

### Exercise 2

How does this relationship compare to the relationship between pf_score and pf_expression_control? Use the R2 values from the two model summaries to compare. Does your independent variable seem to predict pf_score better? Why or why not?

```{r}
lmPF <- lm(pf_score ~ pf_expression_control, data = hfi)
summary(lmPF)
```

```{r}
lmPF2 <- lm(pf_score ~ pf_ss, hfi)

summary(lmPF2)
```

From the $R^2$ values of both models, we have this:

**pf_expression_control and pf_score model:** 63.42% of the data fits the regression model

**pf_ss and pf_score model:** 55.07% of the data fits the regression model

The independent variable, pf_ss, does not seem to predict the dependent, pf_score, better than the first model. The $R^2$ is higher on the first model (63.42% vs 55.07%). This indicates less variation on the first model.



### Exercise 3

Check the model diagnostics using appropriate visualizations and evaluate if the model conditions have been met.

First we need to find the residuals using the augment function:

```{r}
lm2_aug <- augment(lmPF2)
```

**Linearity:** We should also verify this condition with a plot of the residuals vs. fitted (predicted) values.

```{r}
ggplot(data = lm2_aug, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(
    x ="Fitted values", 
    y ="Residuals",
    title = "Plot of Residuals") +
  theme(plot.title = element_text(hjust = 0.50))
```

**Overall, there is not an apparent pattern in the residuals plot and this indicates there is a linear relationship between the two variables. However, the points do become more concentrated in one area as the Fitted Values increase. This can also be seen in the scatterplot (as pf_ss increases). Nevertheless, overall the points look randomly scattered.**

**Normality of Residuals:** 

```{r}
ggplot(data = lm2_aug, aes(x = .resid)) +
  geom_histogram(binwidth = 0.25) +
  labs(
    x = "Residuals",
    y = "Count",
    title = "Histogram of Residuals") +
  theme(plot.title = element_text(hjust = 0.50))
```

**Generally, the histogram of residuals looks fairly normal with most of the residuals being concetrated in the middle.**

### Exercise 4