library(tidyverse)
library(openintro)
Exercise 1
The dimensions of this dataset are 123 columns or variables and 1458 rows or observations.
## [1] 1458 123
Exercise 2
Relationships between 2 numerical variables are usually best shown in scatterplots. The relationship between pf_expression_control and pf_score can be estimated to be linear as paired data points each case can be seen to overall to move in same direction.
Cases where one or the other of these variables included NA were dropped, shaving the dimensions of the dataset down to 1378 cases (dropping 80 rows).
hfi <- hfi %>% filter(!is.na(pf_expression_control), !is.na(pf_score))
ggplot(hfi, aes(pf_expression_control,pf_score)) + geom_point()
The correlation of the data is fairly high at .796. Correlation is the normalized value of covariance and measures how the values of 2 quantitative variables vary together (i.e. joint variability). It is important to visualize the the relationship first in a scatterplot as the correlation coefficient assumes a linear relationship.
hfi %>%
summarise(correlation = cor(pf_expression_control, pf_score))
## # A tibble: 1 x 1
## correlation
## <dbl>
## 1 0.796
Exercise 3
could not get the plot_ss() function to work for me despite dropping the NA values
Exercise 4
model_pf_score <- lm(pf_score ~ pf_expression_control, data=hfi)
summary(model_pf_score)
##
## Call:
## lm(formula = pf_score ~ pf_expression_control, data = hfi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8467 -0.5704 0.1452 0.6066 3.2060
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.61707 0.05745 80.36 <2e-16 ***
## pf_expression_control 0.49143 0.01006 48.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8318 on 1376 degrees of freedom
## Multiple R-squared: 0.6342, Adjusted R-squared: 0.634
## F-statistic: 2386 on 1 and 1376 DF, p-value: < 2.2e-16
Exercise 5
The linear model to predict total human freedom score using personal freedom expression control as the explanatory variable:
y^hat = 5.15368 + .3498 x pf_expression_control
Starting at intercept of 5 when personal freedom expression score is 0, the total human score increases by .35 for every unit of 1 increase in personal freedom of expression score.
Only 57% of variation in total human score is explained by this model.
model_hf_score <- lm(hf_score ~ pf_expression_control, data=hfi)
summary(model_hf_score)
##
## Call:
## lm(formula = hf_score ~ pf_expression_control, data = hfi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6198 -0.4908 0.1031 0.4703 2.2933
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.153687 0.046070 111.87 <2e-16 ***
## pf_expression_control 0.349862 0.008067 43.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.667 on 1376 degrees of freedom
## Multiple R-squared: 0.5775, Adjusted R-squared: 0.5772
## F-statistic: 1881 on 1 and 1376 DF, p-value: < 2.2e-16
Exercise 6
At 6.7 rating for pf_expression_control score, the linear model would predict the personal freedom score as 7.9097.
hfi$predicted_pf_score <- predict(model_pf_score)
hfi$residuals_pf_score <- residuals(model_pf_score)
ggplot(data = hfi, aes(x = pf_expression_control, y = pf_score)) +
geom_segment(aes(xend=pf_expression_control, yend=predicted_pf_score), alpha=.4) +
geom_point(aes(color=residuals_pf_score)) +
scale_color_gradient2(low = "blue", mid = "lightblue", high = "red") +
stat_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
How do you extract the residual for model at a specific value of the explanatory variable within a large dataset?
Exercise 7
No pattern is apparent in the plot of absolute values of the residuals against fitted line.
ggplot(data = model_pf_score, aes(x = .fitted, y = .resid)) +
geom_point(aes(color=.resid)) +
scale_color_gradient2(low = "blue", mid = "lightblue", high = "red") +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")

Exercise 8
Yes, the conditions for nearly normal residuals have been met.
The histogram of the residuals suggest the distribution is left skewed where the peak of the data is right of center and the tail is much longer on the left.
This shows that more residuals tended to be negative than positive or in other words, that the model predicted more personal freedom scores that were higher than their actual observed values.
The skewness does not appear to be greater than 1 and is in the acceptable range to assume data is normally distributed.
ggplot(data = model_pf_score, aes(x = .resid, fill= cut(.resid, 100))) +
geom_histogram(binwidth = .5, show.legend=FALSE) +
xlab("Residuals")
Normal probability plots of the residuals are helpful to see if there are any outliers to the data that skews normality assumptions. We can see there is deviation from the theoretical line where multiple observed data points had lower residuals (higher observed personal freedom scores) than would have been predicted using the linear model under the assumption of normal distribution.
These can be attributed to outliers in the response variable personal freedom score that could not be predicted accurately by the model and the fact that the model has an adjusted R^2 of only 63%.
ggplot(data = model_pf_score, aes(sample = .resid)) +
stat_qq() +
stat_qq_line(line.p = c(0.25, 0.75), color="red",size=1)

Exercise 9
The residuals roughly form a “horizontal band” around the 0 line. This indicates that the variance of the residuals is approximately constant. It should be noted that there are clear outliers observed and these outliers are mostly negative residuals. This indicates that there are more outliers representing low freedom scores in the dataset. One could argue their presence negates the assumption of normally distributed data. However, there are no extreme outliers in the response variable and so normality can still be assumed.
pf_score_Q1 = fivenum(hfi$pf_score)[2]
pf_score_Q3 = fivenum(hfi$pf_score)[4]
pf_score_IQR = IQR(hfi$pf_score)
EOF1 = pf_score_Q1 - 3*pf_score_IQR
EOF2 = pf_score_Q3 + 3*pf_score_IQR
EO_min_count <- sum(hfi$pf_score < EOF1)
EO_max_count <- sum(hfi$pf_score > EOF2)
EO_min_count
## [1] 0
## [1] 0
More Practice
Exercise 1
At a glance, there does appear to be a linear relationship between the 2 variables pf_expression_influence and pf_score. Values appear to jointly increase for both variables.
ggplot(hfi, aes(pf_expression_influence,pf_score)) + geom_point()

Exercise 2
This seems to be a comparable model to using pf_expression_control as predictor variable to the personal freedom score. Adjusted R^2 is 62% in comparison to the 63% of using pf_expression_control as predictor variable. The intercept is similar at approximately 5; the rate of change in personal freedom is slightly higher at .41 increase in pf_score for every unit increase of pf_expression_influence. Rate of change predicted was only .35 increase when using pf_expression_control.
This model is not a noticeably better predicting personal freedom than using pf_expression_control.
model_pf_score_2 <- lm(pf_score ~ pf_expression_influence, data=hfi)
summary(model_pf_score_2)
##
## Call:
## lm(formula = pf_score ~ pf_expression_influence, data = hfi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9688 -0.5830 0.1681 0.5903 3.6730
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.06135 0.05064 99.95 <2e-16 ***
## pf_expression_influence 0.41150 0.00869 47.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8482 on 1376 degrees of freedom
## Multiple R-squared: 0.6197, Adjusted R-squared: 0.6195
## F-statistic: 2243 on 1 and 1376 DF, p-value: < 2.2e-16
hfi %>%
summarise(correlation = cor(pf_expression_influence, pf_score))
## # A tibble: 1 x 1
## correlation
## <dbl>
## 1 0.787
Exercise 3
I would have expected that those countries with the most political influence in newspapers to have a strong predictor on personal freedom. Freedom of press has long been tied up in the idea of personal liberty & freedom. However, on graphing these two variables, it is obvious there is no linear relationship between the 2 variables. Even newspapers that rate a 10 as least exterted poltical influence were associated with a range of personal freedom scores.
The linear model, using influence on newspapers as predictor variable only has an adjusted R^2 of 32%. If only 32% of the variation in personal freedom scores can be supposedly accounted for by the model, this is not a good fit model for making predictions and should be rejected.
ggplot(hfi, aes(pf_expression_newspapers,pf_score)) + geom_point()
## Warning: Removed 255 rows containing missing values (geom_point).

model_pf_score_3 <- lm(pf_score ~ pf_expression_newspapers, data=hfi)
summary(model_pf_score_3)
##
## Call:
## lm(formula = pf_score ~ pf_expression_newspapers, data = hfi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9877 -0.8174 0.0780 0.9253 2.4888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.77611 0.13553 27.86 <2e-16 ***
## pf_expression_newspapers 0.40639 0.01532 26.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.095 on 1121 degrees of freedom
## (255 observations deleted due to missingness)
## Multiple R-squared: 0.3858, Adjusted R-squared: 0.3852
## F-statistic: 704.1 on 1 and 1121 DF, p-value: < 2.2e-16
