library(tidyverse)
library(openintro)
library(broom)
library(statsr)

Exercise 1

What are the dimensions of the dataset? What does each row represent?

dim(hfi)
## [1] 1458  123
nrow(hfi)
## [1] 1458
ncol(hfi)
## [1] 123

Each row of the hfi dataset represents a country. The dataset shows the number of people associated with each variable for each country. The dim function tells us 2 numbers: 1458 and 123. If I use nrow and ncol I can figure out what each number means. 1458 is the total number of entries. 123 is the total number of columns. I can also check my work by using the view function to get a separate page for the hfi dataset, which tells me the number of columns and rows. I could also look at the help page for hfi.

Exercise 2

The dataset spans a lot of years, but we are only interested in data from year 2016. Filter the data hfi data frame for year 2016, select the six variables, and assign the result to a data frame named hfi_2016.

We could say the 6 variables are pf_score, pf_rank, ef_score, ef_rank, hf_score, and hf_rank. But since the exercises only use year, pf_score, pf_expression_control, and hf_score, maybe we should only use those 4.

hfi_2016 <- hfi %>%
  filter(year > 2015) %>%
  select(year, hf_score, pf_expression_control, pf_score)
hfi_2016
## # A tibble: 162 x 4
##     year hf_score pf_expression_control pf_score
##    <dbl>    <dbl>                 <dbl>    <dbl>
##  1  2016     7.57                  5.25     7.60
##  2  2016     5.14                  4        5.28
##  3  2016     5.64                  2.5      6.11
##  4  2016     6.47                  5.5      8.10
##  5  2016     7.24                  4.25     6.91
##  6  2016     8.58                  7.75     9.18
##  7  2016     8.41                  8        9.25
##  8  2016     6.08                  0.25     5.68
##  9  2016     7.40                  7.25     7.45
## 10  2016     6.85                  0.75     6.14
## # … with 152 more rows

Exercise 3

What type of plot would you use to display the relationship between the personal freedom score, pf_score, and pf_expression_control? Plot this relationship using the variable pf_expression_control as the predictor. Does the relationship look linear? If you knew a country’s pf_expression_control, or its score out of 10, with 0 being the most, of political pressures and controls on media content, would you be comfortable using a linear model to predict the personal freedom score?

The predictor variable is x so x is pf_expression_control.

ggplot(data = hfi_2016) +
  geom_point(mapping = aes(x = pf_expression_control, y = pf_score))

Exercise 4

Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.

The graph looks like it has a positive linear relationship between the 2 variables. The size of the residuals get smaller as pf_score and pf_expression_control increase. The correlation value is close to +1 which indicates a strong and positive relationship between the 2 variables. The R^2 is close to +1 which means that there’s a strong and positive linear relationship.

fit_hfi_2016 <- lm(hfi_2016$pf_expression_control ~ hfi_2016$pf_score, data = hfi_2016)
summary(fit_hfi_2016)
## 
## Call:
## lm(formula = hfi_2016$pf_expression_control ~ hfi_2016$pf_score, 
##     data = hfi_2016)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1161 -0.7990  0.1404  0.7753  3.3175 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -4.22101    0.47075  -8.967 7.71e-16 ***
## hfi_2016$pf_score  1.31797    0.06592  19.993  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.247 on 160 degrees of freedom
## Multiple R-squared:  0.7141, Adjusted R-squared:  0.7123 
## F-statistic: 399.7 on 1 and 160 DF,  p-value: < 2.2e-16
cor(hfi_2016$pf_expression_control, hfi_2016$pf_score)
## [1] 0.8450646
Rsq_hfi <- (cor(hfi_2016$pf_expression_control, hfi_2016$pf_score))^2
Rsq_hfi
## [1] 0.7141342

Exercise 5

Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?

attempt 1: Sum of Squares: 162.09 attempt 2: Sum of Squares: 121.158 attempt 3: Sum of Squares: 146.216 attempt 4: Sum of Squares: 156.404 attempt 5: Sum of Squares: 212.288 attempt 6: Sum of Squares: 164.168

The sum of squares that I got from my 6 attempts ranged from 121.2 to 212.3. Looking at the plots, the residuals seem the largest (meaning that I see the most distance between points and the line) for attempt 5 (sum of squares = 212.3) and the smallest for attempt 2 (sum of squares = 121.2). In the code chunks below, I did not pick the line myself, instead R picked it and found a lower sum of squares: 102.2.

plot_ss(x = pf_expression_control, y = pf_score, data = hfi_2016)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##      4.2838       0.5418  
## 
## Sum of Squares:  102.213
plot_ss(x = pf_expression_control, y = pf_score, data = hfi_2016, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##      4.2838       0.5418  
## 
## Sum of Squares:  102.213

Exercise 6

Fit a new model that uses pf_expression_control to predict hf_score, or the total human freedom score. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between human freedom and the amount of political pressure on media content?

The following functions make this equation: y_hat = 5.0534 + 0.36843x. To put the equation in context of the problem, rewrite like this: mean hf_score = 5.0534 + (0.36843 x pf_expression_control). In this dataset, as pf_expression_control gets closer to 0, the amount of political pressure on the media increases. If pf_expression_control is 0, the mean hf_score is 5.0534. As pf_expression_control increases by 1 unit, the mean hf_score increases by a factor of 0.36843.

m1 <- lm(hf_score ~ pf_expression_control, data = hfi_2016)
summary(m1)
## 
## Call:
## lm(formula = hf_score ~ pf_expression_control, data = hfi_2016)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.68164 -0.45467  0.05692  0.46699  1.88128 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            5.05340    0.12293   41.11   <2e-16 ***
## pf_expression_control  0.36843    0.02236   16.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6595 on 160 degrees of freedom
## Multiple R-squared:  0.6291, Adjusted R-squared:  0.6268 
## F-statistic: 271.4 on 1 and 160 DF,  p-value: < 2.2e-16
tidy(m1)
## # A tibble: 2 x 5
##   term                  estimate std.error statistic  p.value
##   <chr>                    <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)              5.05     0.123       41.1 5.97e-87
## 2 pf_expression_control    0.368    0.0224      16.5 2.73e-36
glance(m1)
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.629         0.627 0.660      271. 2.73e-36     1  -161.  329.  338.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Exercise 7

If someone saw the least squares regression line and not the actual data, how would they predict a country’s personal freedom school for one with a 3 rating for pf_expression_control? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?

First, I think Open Intro meant to say “personal freedom score” not “personal freedom school” because the latter doesn’t make sense.

The person could estimate that if pf_expression_control is 3, pf_score might be around 6. If you use the equation and make x = 3, y = 6.15869. If you do residuals(m1), then you see that the residual for x = 3 is -0.333814023. A negative residual means the model overestimates.

Exercise 8

Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between the two variables?

There doesn’t seem to be much of a pattern in the residuals plot. The residuals are just scattered around the line representing 0. This means that we can try to fit a linear model to the data but it is possible that the correlation will be close to 0 since we can’t prove in the below graph that the slope is not equal to 0.

ggplot(data = hfi_2016, aes(x = pf_expression_control, y = pf_score)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

Exercise 9

Based on the histogram, does the nearly normal residuals condition appear to be violated? Why or why not?

I don’t think that the nearly normal residual condition has been violated because there doesn’t seem to be extreme skew here. Additionally the shape of the histogram is similar to the bell curve.

m1_aug <- augment(m1)
ggplot(data = m1_aug, aes(x = .resid)) +
  geom_histogram(binwidth = 0.25) +
  xlab("Residuals")

Exercise 10

Based on the residuals vs. fitted plot, does the constant variability condition appear to be violated? Why or why not?

Using the residuals vs. fitted plot, it seems like the variability is not constant throughout the plot. The variability is greater closer to 5 (left side of graph) and the points are more concentrated closer to 8-9 (right side of graph). This is not unexpected since the scatter plot from exercise 8 showed the same trend: datapoints closer to the line as x&y get higher and datapoints farther from the line as the x&y get lower.

ggplot(data = m1_aug, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  xlab("Fitted values") +
  ylab("Residuals")

---
title: "Final Project - Lab 1: Intro to R"
author: Tara Bhat
date: "`r Sys.Date()`"
output: openintro::lab_report
---



```{r load-packages, message=FALSE}
library(tidyverse)
library(openintro)
library(broom)
library(statsr)
```

### Exercise 1

**What are the dimensions of the dataset? What does each row represent?**


```{r Exercise 1}
dim(hfi)
nrow(hfi)
ncol(hfi)
```


Each row of the hfi dataset represents a country. The dataset shows the number of people associated with each variable for each country. The dim function tells us 2 numbers: 1458 and 123. If I use nrow and ncol I can figure out what each number means. 1458 is the total number of entries. 123 is the total number of columns. I can also check my work by using the view function to get a separate page for the hfi dataset, which tells me the number of columns and rows. I could also look at the help page for hfi.


### Exercise 2

**The dataset spans a lot of years, but we are only interested in data from year 2016. Filter the data hfi data frame for year 2016, select the six variables, and assign the result to a data frame named hfi_2016.**


We could say the 6 variables are pf_score, pf_rank, ef_score, ef_rank, hf_score, and hf_rank. But since the exercises only use year, pf_score, pf_expression_control, and hf_score, maybe we should only use those 4. 

```{r exercise 2}
hfi_2016 <- hfi %>%
  filter(year > 2015) %>%
  select(year, hf_score, pf_expression_control, pf_score)
hfi_2016
```



### Exercise 3

**What type of plot would you use to display the relationship between the personal freedom score, pf_score, and pf_expression_control? Plot this relationship using the variable pf_expression_control as the predictor. Does the relationship look linear? If you knew a country’s pf_expression_control, or its score out of 10, with 0 being the most, of political pressures and controls on media content, would you be comfortable using a linear model to predict the personal freedom score?**


The predictor variable is x so x is pf_expression_control.

```{r exercise 3}
ggplot(data = hfi_2016) +
  geom_point(mapping = aes(x = pf_expression_control, y = pf_score))
```


### Exercise 4

**Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.**


The graph looks like it has a positive linear relationship between the 2 variables. The size of the residuals get smaller as pf_score and pf_expression_control increase. The correlation value is close to +1 which indicates a strong and positive relationship between the 2 variables. The R^2 is close to +1 which means that there's a strong and positive linear relationship.

```{r exercise 4}
fit_hfi_2016 <- lm(hfi_2016$pf_expression_control ~ hfi_2016$pf_score, data = hfi_2016)
summary(fit_hfi_2016)
```

```{r Exercise 4}
cor(hfi_2016$pf_expression_control, hfi_2016$pf_score)
Rsq_hfi <- (cor(hfi_2016$pf_expression_control, hfi_2016$pf_score))^2
Rsq_hfi
```



### Exercise 5

**Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?**


attempt 1: Sum of Squares:  162.09
attempt 2: Sum of Squares:  121.158
attempt 3: Sum of Squares:  146.216
attempt 4: Sum of Squares:  156.404
attempt 5: Sum of Squares:  212.288
attempt 6: Sum of Squares:  164.168


The sum of squares that I got from my 6 attempts ranged from 121.2 to 212.3. Looking at the plots, the residuals seem the largest (meaning that I see the most distance between points and the line) for attempt 5 (sum of squares = 212.3) and the smallest for attempt 2 (sum of squares = 121.2). In the code chunks below, I did not pick the line myself, instead R picked it and found a lower sum of squares: 102.2.

```{r}
plot_ss(x = pf_expression_control, y = pf_score, data = hfi_2016)
```


```{r exercise 5}
plot_ss(x = pf_expression_control, y = pf_score, data = hfi_2016, showSquares = TRUE)
```


### Exercise 6

**Fit a new model that uses pf_expression_control to predict hf_score, or the total human freedom score. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between human freedom and the amount of political pressure on media content?**


The following functions make this equation: y_hat = 5.0534 + 0.36843x. To put the equation in context of the problem, rewrite like this: mean hf_score = 5.0534 + (0.36843 x pf_expression_control). In this dataset, as pf_expression_control gets closer to 0, the amount of political pressure on the media increases. If pf_expression_control is 0, the mean hf_score is 5.0534. As pf_expression_control increases by 1 unit, the mean hf_score increases by a factor of 0.36843.


```{r Exercise 6}
m1 <- lm(hf_score ~ pf_expression_control, data = hfi_2016)
summary(m1)
tidy(m1)
glance(m1)
```


### Exercise 7

**If someone saw the least squares regression line and not the actual data, how would they predict a country’s personal freedom school for one with a 3 rating for pf_expression_control? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?**


First, I think Open Intro meant to say "personal freedom score" not "personal freedom school" because the latter doesn't make sense. 

The person could estimate that if pf_expression_control is 3, pf_score might be around 6. If you use the equation and make x = 3, y = 6.15869. If you do residuals(m1), then you see that the residual for x = 3 is -0.333814023. A negative residual means the model overestimates.


### Exercise 8

**Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between the two variables?**


There doesn't seem to be much of a pattern in the residuals plot. The residuals are just scattered around the line representing 0. This means that we can try to fit a linear model to the data but it is possible that the correlation will be close to 0 since we can't prove in the below graph that the slope is not equal to 0.


```{r exercise 8}
ggplot(data = hfi_2016, aes(x = pf_expression_control, y = pf_score)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
```



### Exercise 9

**Based on the histogram, does the nearly normal residuals condition appear to be violated? Why or why not?**

I don't think that the nearly normal residual condition has been violated because there doesn't seem to be extreme skew here. Additionally the shape of the histogram is similar to the bell curve. 

```{r exercise 9}
m1_aug <- augment(m1)
ggplot(data = m1_aug, aes(x = .resid)) +
  geom_histogram(binwidth = 0.25) +
  xlab("Residuals")
```

### Exercise 10

**Based on the residuals vs. fitted plot, does the constant variability condition appear to be violated? Why or why not?**

Using the residuals vs. fitted plot, it seems like the variability is not constant throughout the plot. The variability is greater closer to 5 (left side of graph) and the points are more concentrated closer to 8-9 (right side of graph). This is not unexpected since the scatter plot from exercise 8 showed the same trend: datapoints closer to the line as x&y get higher and datapoints farther from the line as the x&y get lower.


```{r exercise 10}
ggplot(data = m1_aug, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  xlab("Fitted values") +
  ylab("Residuals")
```

