data <- read.csv ("C:\\Users\\91630\\OneDrive\\Desktop\\statistics\\age_gaps.CSV")
library(ggplot2)
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.3.3
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.3.3
library(boot)
library(broom)
library(lindia)
## Warning: package 'lindia' was built under R version 4.3.3
model <- lm(actor_1_age ~ age_difference + release_year + couple_number, data = data)
summary(model)
##
## Call:
## lm(formula = actor_1_age ~ age_difference + release_year + couple_number,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.723 -4.943 -0.566 3.712 36.430
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -141.64135 26.70573 -5.304 1.36e-07 ***
## age_difference 0.92008 0.02640 34.857 < 2e-16 ***
## release_year 0.08553 0.01331 6.425 1.93e-10 ***
## couple_number 1.11478 0.29163 3.823 0.000139 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.243 on 1151 degrees of freedom
## Multiple R-squared: 0.5185, Adjusted R-squared: 0.5172
## F-statistic: 413.1 on 3 and 1151 DF, p-value: < 2.2e-16
Age Difference: There is a strong positive correlation between
actor_1’s age in films and the age difference between romantic partners.
For every unit difference in age, Actor_1’s age increases by around 0.92
units (p < 2e-16).
Year of the Movie’s Release: Older actors typically appear in newer
films; this is indicated by a positive coefficient of approximately
0.0855 for every unit increase in release year (p = 1.93e-10).
Count of couples: Actor_1’s age increases by roughly 1.11 units for
every new pair in the movie, indicating that older actors are more
likely to be in relationships. (p = 0.000139).
Model Fit: The adjusted R-squared value indicates that 51.72% of the variation in actor_1 age can be explained by the regression model. The highly significant F-statistic (p < 2.2e-16) indicates that the model predicts actor_1 age appropriately.
The residuals show a rather normal distribution, indicating that the model captures the variation in actor_1’s age fairly well. Moreover, 7.243 is the residual standard error.
Visualizations:
gg_resfitted(model) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The residual variance does not stay constant throughout all levels of the predictor variable, which is shown by a clear pattern or trend in the residuals.
Ideally, the graphic should show residuals dispersing randomly around the horizontal line at 0. This demonstrates that the residuals are independent of the fitted values and have a constant variance, supporting the assumptions of independence and homogeneity.
residual_plots <- gg_resX(model)
Excessive outliers or points in the plot with large residuals and high leverage may indicate important observations that have a strong effect on the regression coefficients.
The linearity assumption is supported if the residuals are randomly distributed with respect to the horizontal line at 0 and show no discernible pattern or trend as the X values change. This suggests that there is a linear relationship between the predictor and responder variables.
Residuals Histogram
gg_reshist(model)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The residuals are not normally distributed if the histogram substantially deviates from a bell-shaped curve. Deviations from normalcy may be indicated by the histogram’s irregularity or vast residue.
The idea of residual normalcy is supported if the histogram has a bell-shaped curve with zero at its center. Roughly normal distribution of the residuals is indicated by a symmetric distribution with no discernible skewness.
gg_qqplot(model)
The residuals are not normal if the points on the Q-Q plot deviate noticeably from the diagonal line.
The points should ideally remain on the diagonal, indicating a consistent distribution of residuals. Deviations from this line could mean that normalcy has broken.
gg_cooksd(model, threshold = 'matlab')
Significant observations that can have a negative impact on the regression coefficients are indicated by points with Cook’s distance values significantly higher than the others. The computed coefficients may be significantly affected by deleting these observations, according to high Cook’s distance values.
By emphasizing data with Cook’s distance values greater than a
certain threshold, the Cook’s Distance Plot can identify significant
observations.