May 19library(tidyverse)
library(openintro)
library(statsr)What are the dimensions of the dataset? Answer: It has 1458 observations and 123 variables
dim(hfi)## [1] 1458 123
What type of plot would you use to display the relationship between the personal freedaom score, pf_score, and one of the other numnerical variables? Plot the relationship using the variable pf_expression_control as the predictor. Does the relationship look linear? If you knew a country’s pf_expression_control, or its score out of 10, with 0 being the most, of political pressures and controls on media content, would you be comfortable using a linear model to predict the personal freedom score?
Answer: * The scatter plot is used. * The plot shows the relationship between pf_expression_control and pf_score appears to be linear. * No. I wouldn’t quite be compfortable using a linear model for prediction pf_score based on pf_expression_control.
hfi %>%
summarise(cor(pf_expression_control, pf_score, use = "complete.obs"))## # A tibble: 1 × 1
## `cor(pf_expression_control, pf_score, use = "complete.obs")`
## <dbl>
## 1 0.796
# Create a scatter plot
plot(hfi$pf_expression_control, hfi$pf_score,
xlab = "pf_expression_control",
ylab = "pf_score",
main = " pf_expression_control Vs pf_score",
col = "red",
pch = 18)Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction and strength of the relationship as well as any unusual observations.
ANSWER 3 After two points are selected on scattered plot, a straight
line can be visible, the relationship between two variables is linear.
As the values of x-axis increases, y-values also increases, so the
direction of relationship is positive.
Almost all of the data points are tightly scattered around in the form
of strong relationship. A few unusual or inconsistent data points that
are noticeably distant from the majority of main cluster may be
potential outliers or anomalies.
plot_ss(x = pf_expression_control, y = pf_score, data = hfi)## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## 4.6171 0.4914
##
## Sum of Squares: 952.153
Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?
ANSWER 4 The smallest SSR is 952.153.
plot_ss(x = pf_expression_control, y = pf_score, data = hfi, showSquares = TRUE)## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## 4.6171 0.4914
##
## Sum of Squares: 952.153
Fit a new model that uses pf_expression_control to predict hf_score, or the total human freedom score. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between human freedom and the amount of political pressure on media content?
Answer 5 The equation of the regression line can be written as: hf_score = 4.61707 + 0.49143 * pf_expression_control. The slope is 0.49143 that indicates the positive increase in the human freedom score related to average increase in the amount of political pressure on media content.
m1 <- lm(pf_score ~ pf_expression_control, data = hfi)
summary(m1)##
## Call:
## lm(formula = pf_score ~ pf_expression_control, data = hfi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8467 -0.5704 0.1452 0.6066 3.2060
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.61707 0.05745 80.36 <2e-16 ***
## pf_expression_control 0.49143 0.01006 48.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8318 on 1376 degrees of freedom
## (80 observations deleted due to missingness)
## Multiple R-squared: 0.6342, Adjusted R-squared: 0.634
## F-statistic: 2386 on 1 and 1376 DF, p-value: < 2.2e-16
If someone saw the least squares regression line and not the actual data, how would they predict a country’s personal freedom school for one with a 6.7 rating for pf_expression_control? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?
Answer 6 It is an overestimated and the residual for this prediction is -0.48.
ggplot(data = hfi, aes(x = pf_expression_control, y = pf_score)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE)## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 80 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 80 rows containing missing values (`geom_point()`).
intercept <- 4.61707
slope <- 0.49143
given_pf_expression_control <- 6.7
# Calculate the predicted pf_score
predicted_pf_score <- intercept + slope * given_pf_expression_control
predicted_pf_score## [1] 7.909651
# Check observed values of pf_score with 6.7 rating for `pf_expression_control`
hfi %>%
group_by(pf_score) %>%
filter(pf_expression_control == 6.7)## # A tibble: 0 × 123
## # Groups: pf_score [0]
## # ℹ 123 variables: year <dbl>, ISO_code <chr>, countries <chr>, region <chr>,
## # pf_rol_procedural <dbl>, pf_rol_civil <dbl>, pf_rol_criminal <dbl>,
## # pf_rol <dbl>, pf_ss_homicide <dbl>, pf_ss_disappearances_disap <dbl>,
## # pf_ss_disappearances_violent <dbl>, pf_ss_disappearances_organized <dbl>,
## # pf_ss_disappearances_fatalities <dbl>, pf_ss_disappearances_injuries <dbl>,
## # pf_ss_disappearances <dbl>, pf_ss_women_fgm <dbl>,
## # pf_ss_women_missing <dbl>, pf_ss_women_inheritance_widows <dbl>, …
# Assuming that the actual observed pf_score
actual_pf_score <- 7.43
# Calculate the residual
residual <- actual_pf_score - predicted_pf_score
residual## [1] -0.479651
Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between the two variables?
Answer 7 There is no any apparent pattern in the residuals plot. This indicates a linear relationship between two variables.
#verify the linerity condition with a plot of the residuals vs. predicted values
ggplot(data = m1, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")#plat a histogram to check nearly normal residuals
ggplot(data = m1, aes(x = .resid)) +
geom_histogram(binwidth = 1) +
xlab("Residuals")# or a normal probability plot of residuals
ggplot(data = m1, aes(sample = .resid)) +
stat_qq()Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?
Answer 8 Yes. The nearly normal residuals condition appear to be met.
Based on the residuals vs. fitted plot, does the constant variability condition appear to be met?
Answer 9 Yes. the constant variability condition appear to be met.
Choose another freedom variable and a variable you think would strongly correlate with it.. Produce a scatter plot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?
Answer (a) Two variables are in linear relationship and positive direction.
ggplot(data = hfi, aes(x = ef_score, y = hf_score)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE)## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 80 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 80 rows containing missing values (`geom_point()`).
How does this relationship compare to the relationship between pf_expression_control and pf_score? Use R squared values from the two model summaries to compare. Does your independent variable seem to predict your dependent one better? Why or why not?
Answer (b) 63.4% of the variability is obtained in comparison pf_score and pf_expression_control and 88.87% of the variability in hf_score and pf_score relationship. Yes. My independent variable seems to predict my dependent variables better because my Multiple R-squared is greater than R-squared ofpf_expression_control and pf_score` model, it counts more variation.
lm2 <- lm(pf_expression_control ~ pf_score, data = hfi)
summary(lm2)##
## Call:
## lm(formula = pf_expression_control ~ pf_score, data = hfi)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4021 -0.8727 0.1288 0.9325 6.2415
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.03537 0.19370 -20.83 <2e-16 ***
## pf_score 1.29059 0.02642 48.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.348 on 1376 degrees of freedom
## (80 observations deleted due to missingness)
## Multiple R-squared: 0.6342, Adjusted R-squared: 0.634
## F-statistic: 2386 on 1 and 1376 DF, p-value: < 2.2e-16
lma <- lm(hfi$hf_score ~ hfi$pf_score)
summary(lma)##
## Call:
## lm(formula = hfi$hf_score ~ hfi$pf_score)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.63243 -0.18541 0.02243 0.19522 1.01876
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.928227 0.049166 39.22 <2e-16 ***
## hfi$pf_score 0.703378 0.006706 104.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3422 on 1376 degrees of freedom
## (80 observations deleted due to missingness)
## Multiple R-squared: 0.8888, Adjusted R-squared: 0.8887
## F-statistic: 1.1e+04 on 1 and 1376 DF, p-value: < 2.2e-16
What’s one freedom relationship you were most surprised about and why? Display the model diagnostics for the regression model analyzing this relationship.
Answer (c)
Parental_marriage and parental_divorce relationship make me surprised because I would like to know how many average percentage of a couple can maintain their sustainable marriage life. The model has a Multiple R Squared compared to the previous model. Check summary and the model diagnostics for the regression model analyzing this “pf_identity_parental_marriage” and “pf_identity_parental_divorce” relationship.
The two variables have a positive direction and non-linear relationship. From the histogram and the normal probability plot, these data are not normal distribution. The points residuals vs. fitted plot shows that points are randomly scatterd around 0, there is not a constant variability.
model1 <- lm(hfi$pf_identity_parental_marriage ~ hfi$pf_identity_parental_divorce)
summary(model1)##
## Call:
## lm(formula = hfi$pf_identity_parental_marriage ~ hfi$pf_identity_parental_divorce)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4096 0.0925 0.5904 0.5904 5.0925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.40538 0.16069 2.523 0.0118 *
## hfi$pf_identity_parental_divorce 0.90042 0.01897 47.460 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.124 on 917 degrees of freedom
## (539 observations deleted due to missingness)
## Multiple R-squared: 0.7107, Adjusted R-squared: 0.7104
## F-statistic: 2252 on 1 and 917 DF, p-value: < 2.2e-16
hfi %>%
summarise(cor(pf_identity_parental_marriage, pf_identity_parental_divorce, use = "complete.obs"))## # A tibble: 1 × 1
## `cor(...)`
## <dbl>
## 1 0.843
### Model diagnostics
### Linearity : plot residuals vs. predicted values
ggplot(data = model1, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")#Nearly normal residuals: plat a histogram to check nearly normal residuals
ggplot(data = model1, aes(x = .resid)) +
geom_histogram(binwidth = 1) +
xlab("Residuals")# or a normal probability plot of residuals
ggplot(data = model1, aes(sample = .resid)) +
stat_qq()ggplot(data = model1, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")