Datadive9.Rmd

data <- read.csv ("C:\\Users\\varsh\\OneDrive\\Desktop\\Gitstuff\\age_gaps.CSV")

library(ggplot2)
library(ggthemes)
library(ggrepel)
library(boot)
library(broom)
library(lindia)

model <- lm(actor_1_age ~ age_difference + release_year + couple_number, data = data)
summary(model)

## 
## Call:
## lm(formula = actor_1_age ~ age_difference + release_year + couple_number, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.723  -4.943  -0.566   3.712  36.430 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -141.64135   26.70573  -5.304 1.36e-07 ***
## age_difference    0.92008    0.02640  34.857  < 2e-16 ***
## release_year      0.08553    0.01331   6.425 1.93e-10 ***
## couple_number     1.11478    0.29163   3.823 0.000139 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.243 on 1151 degrees of freedom
## Multiple R-squared:  0.5185, Adjusted R-squared:  0.5172 
## F-statistic: 413.1 on 3 and 1151 DF,  p-value: < 2.2e-16

release_year-

Reason for Inclusion:

The release year may capture long-term industry trends or changes. For example, there could be differences in how actors’ ages are shown or viewed between decades.
There are no multicollinearity issues here, as release year is a distinct variable from age difference and couple number.
couple_number-

Reason for Inclusion:

This variable represents the number of couples in the movie. It may represent differences in the dynamics of movies involving multiple couples vs those with a single couple.
There are no multicollinearity issues here.

Conclusion-

Age Difference:

There is a significant positive relationship between the age difference between romantic couples and the age of actor_1 in movies.

Actor_1’s age increases by about 0.92 units for every one-unit variance in age (p < 2e-16).
Movie Release Year:

Newer movies tend to feature older actors, as seen by a positive coefficient of around 0.0855 for each one-unit increase in release year (p = 1.93e-10).
Number of couples:

The presence of more couples in the film is associated with older actors, with actor_1’s age increasing by about 1.11 units for each new couple (p = 0.000139).
Model Fit:
The regression model explains approximately 51.72% of the variation in actor_1 age, as displayed by the corrected R-squared value.

The model accurately predicts actor_1 age, as demonstrated by the extremely significant F-statistic (p < 2.2e-16).

Residual Analysis:

The residuals appear to be relatively normally distributed, suggesting that the model accurately represents the variation in actor_1 age. And the residual standard error is 7.243

Visualizations-

1. Residuals vs. Fitted Values

gg_resfitted(model) +
  geom_smooth(se=FALSE)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

A distinct pattern or trend in the residuals may suggest randomness, which means that the residual variance does not remain constant across all levels of the predictor variable.
The plot should ideally display random scattering of residuals around the horizontal line at 0. This shows that the residuals have constant variance and are independent of the fitted values, which supports the assumption of homogeneity and independence.

2. Residuals vs. X Values

residual_plots <- gg_resX(model)

Extreme outliers or points with high leverage and big residuals in the plot may suggest influential observations that have a significant impact on the regression coefficients.
If the residuals are randomly distributed with regard to the horizontal line at 0, with no visible pattern or trend as the X values vary, it indicates that the connection between the predictor and response variables is linear, which supports the linearity assumption.

3. Residuals Histogram

gg_reshist(model)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

If the histogram deviates significantly from a bell-shaped curve, it indicates that the residuals are not distributed normally. The irregularity or wide residue in the histogram can indicate deviations from normality.
If the histogram resembles a bell-shaped curve and is centered around zero, it supports the concept of residual normality. A symmetric distribution with no noticeable skewness indicates that the residuals are roughly normal distributed.

4. QQ-Plots

gg_qqplot(model)

If the Q-Q plot’s points vary significantly from the diagonal line, it indicates that the residuals are not normal.
Ideally, the points must stick to the diagonal line, suggesting that the residuals are regularly distributed. Deviations from this line may indicate breaks from normality.

5. Cook’s Distance Plot

gg_cooksd(model, threshold = 'matlab')

Points with Cook’s distance values much higher than the rest indicate significant observations that may have an unfavorable effect on the regression coefficients. High Cook’s distance values indicate that removing these observations may have a significant impact on the derived coefficients.
The Cook’s Distance Plot detects influential observations by focusing on those with Cook’s distance values that are higher than a particular threshold.