data <- read.csv ("C:\\Users\\91630\\OneDrive\\Desktop\\statistics\\age_gaps.CSV")
library(ggplot2)
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.3.3
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.3.3
library(boot)
library(broom)
library(lindia)
## Warning: package 'lindia' was built under R version 4.3.3
model <- lm(actor_1_age ~ age_difference + release_year + couple_number, data = data)
summary(model)
##
## Call:
## lm(formula = actor_1_age ~ age_difference + release_year + couple_number,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.723 -4.943 -0.566 3.712 36.430
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -141.64135 26.70573 -5.304 1.36e-07 ***
## age_difference 0.92008 0.02640 34.857 < 2e-16 ***
## release_year 0.08553 0.01331 6.425 1.93e-10 ***
## couple_number 1.11478 0.29163 3.823 0.000139 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.243 on 1151 degrees of freedom
## Multiple R-squared: 0.5185, Adjusted R-squared: 0.5172
## F-statistic: 413.1 on 3 and 1151 DF, p-value: < 2.2e-16
When all other variables—the age difference, the release year, and the number of couples—are zero, the intercept of -141.64135 indicates the projected age of actor 1. Since it is improbable that any of the anticipated values would be zero, this value might not have any real-world significance.
Age_difference: Actor 1’s age increases by around 0.92 years for every unit increase in age difference, while other predictors stay constant, according to the coefficient of 0.92008.
release_year: Actor 1’s age rises by around 0.0855 years for every additional year in the release year, while other predictors stay the same, according to the coefficient of 0.08553.
couple_number: Actor 1’s age rises by around 1.11478 years for every additional couple in the movie, while all other factors stay the same, according to the coefficient of 1.11478.
Visualizations:
gg_resfitted(model) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The residual variance does not stay constant throughout all levels of the predictor variable, which is shown by a clear pattern or trend in the residuals.
Ideally, the graphic should show residuals dispersing randomly around the horizontal line at 0. This demonstrates that the residuals are independent of the fitted values and have a constant variance, supporting the assumptions of independence and homogeneity.
X Values vs Residuals
residual_plots <- gg_resX(model)
Excessive outliers or points in the plot with large residuals and high leverage may indicate important observations that have a strong effect on the regression coefficients.
The linearity assumption is supported if the residuals are randomly distributed with respect to the horizontal line at 0 and show no discernible pattern or trend as the X values change. This suggests that there is a linear relationship between the predictor and responder variables.
Residuals Histogram
gg_reshist(model)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The residuals are not normally distributed if the histogram substantially deviates from a bell-shaped curve. Deviations from normalcy may be indicated by the histogram’s irregularity or vast residue.
The idea of residual normalcy is supported if the histogram has a bell-shaped curve with zero at its center. Roughly normal distribution of the residuals is indicated by a symmetric distribution with no discernible skewness.
gg_qqplot(model)
The residuals are not normal if the points on the Q-Q plot deviate noticeably from the diagonal line.
The points should ideally remain on the diagonal, indicating a consistent distribution of residuals. Deviations from this line could mean that normalcy has broken.
gg_cooksd(model, threshold = 'matlab')
Significant observations that can have a negative impact on the regression coefficients are indicated by points with Cook’s distance values significantly higher than the others. The computed coefficients may be significantly affected by deleting these observations, according to high Cook’s distance values.
By emphasizing data with Cook’s distance values greater than a certain threshold, the Cook’s Distance Plot can identify significant observations.