library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
astro <- read_delim('/Users/sneha/H510-Statistics/astronaut-data.csv')
## Rows: 1277 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): name, sex, nationality, military_civilian, selection, occupation, ...
## dbl (13): id, number, nationwide_number, year_of_birth, year_of_selection, m...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Add 1-3 More Variables into your regression model
The numeric features we could try is :
year_of_birth: Useful for calculating age at selection.
total_number_of_missions: Could capture the experience level, which could have affect on hours_mission.
total_hrs_sum: Total cumulative hours in space, which may correlate with hours_mission.
eva_hrs_mission: Hours spent on extravehicular activities, which might impact mission hours
I am creating a new variable for better computation: lets create “age_at_selection”
astro$age_at_selection <- astro$year_of_selection - astro$year_of_birth
New linear regression model:
lm_model_new <- lm(hours_mission ~ age_at_selection + total_number_of_missions + eva_hrs_mission, data = astro)
summary(lm_model_new)
##
## Call:
## lm(formula = hours_mission ~ age_at_selection + total_number_of_missions +
## eva_hrs_mission, data = astro)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3983.3 -718.8 -439.2 -117.1 9775.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2549.227 363.521 7.013 3.79e-12 ***
## age_at_selection -44.124 9.746 -4.527 6.54e-06 ***
## total_number_of_missions -110.497 31.893 -3.465 0.000549 ***
## eva_hrs_mission 89.386 6.032 14.818 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1570 on 1273 degrees of freedom
## Multiple R-squared: 0.1637, Adjusted R-squared: 0.1617
## F-statistic: 83.06 on 3 and 1273 DF, p-value: < 2.2e-16
The intercept is 2549.23 , suggesting that, on average, an astronaut’s hours_mission is around this value when all other variables are zero.
age_at_selection : The coefficient for age_at_selection is -44.124, meaning that for each additional year at selection, hours_mission is expected to decrease by about 44.12 hours, holding all other factors constant. This negative coefficient could indicate that younger astronauts spend more mission hours.
With a coefficient of -110.497 each additional mission seems to correlate with a reduction of 110.49 hours in hours_mission. This negative relationship could suggest that those who have completed more missions might undertake shorter or more specialized missions.
The coefficient for eva_hrs_missionis 89.386, suggesting that for each additional hour of extravehicular activity (EVA), hours_mission increases by about 89.386 hours, indicating a strong positive relationship. This makes sense, as EVAs could be associated with longer missions overall.
Adjusted R-squared: The adjusted R-squared is similar at 0.1617, which means that including these variables does explain some portion of the variance but not extensively.
Checking for Multicollinearity
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
vif(lm_model_new)
## age_at_selection total_number_of_missions eva_hrs_mission
## 1.033392 1.033126 1.000288
from the above values, we dont see any issues with multicollinearity, hence we could move with our chosen variables
Lets us also test an Interaction Term:
lm_model_interaction <- lm(hours_mission ~ age_at_selection:total_number_of_missions + eva_hrs_mission, data = astro)
summary(lm_model_interaction)
##
## Call:
## lm(formula = hours_mission ~ age_at_selection:total_number_of_missions +
## eva_hrs_mission, data = astro)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3828.1 -693.7 -464.6 -179.9 9631.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1099.7681 108.0180 10.181 < 2e-16
## eva_hrs_mission 89.7790 6.0578 14.820 < 2e-16
## age_at_selection:total_number_of_missions -3.7766 0.9611 -3.929 8.98e-05
##
## (Intercept) ***
## eva_hrs_mission ***
## age_at_selection:total_number_of_missions ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1577 on 1274 degrees of freedom
## Multiple R-squared: 0.1557, Adjusted R-squared: 0.1544
## F-statistic: 117.5 on 2 and 1274 DF, p-value: < 2.2e-16
We could hypothesize that age at selection combined with total missions may influence hours_mission, giving a compound effect of age and experience.
library(lindia)
gg_resfitted(lm_model_new) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The plot shows a slight downward trend, suggesting that the variance of the residuals decreases as the fitted values increase. This is a violation of homoscedasticity
The plot doesn’t show any major outliers or extreme non-linearity.
There are patterns in the residuals as well
The trend in the residuals seems to be smooth, suggesting that the linear model might be a reasonable fit but cant be sure of it.
Residuals vs. X Values
plots <- gg_resX(lm_model_new, plot.all = FALSE)
plots$age_at_selection +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
plots$total_number_of_missions +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Failed to fit group -1.
## Caused by error in `smooth.construct.cr.smooth.spec()`:
## ! x has insufficient unique values to support 10 knots: reduce k.
plots$eva_hrs_mission +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Here, i am going to focus on the plot of residuals and the predictor variable “eva_hrs_mission” ;
While not as good as in the previous plot, there appears to be some variation in the spread of residuals across different values of “eva_hrs_mission”. this plot shows higher residuals for higher eva_hours.
If the relationship between X and y were perfectly linear, then the blue line would line up with the red dotted line, here the blue curve shows a clear non-linear pattern in the residuals. This suggests that the linear relationship assumed in the model might not be appropriate.
gg_reshist(lm_model_new)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This plot is not normally distributed, which means that the model might not be capturing the relationship between the variables accurately
this can be due to potential outliers, the histogram might also reveal the presence of outliers, which can significantly impact the model’s performance.
gg_qqplot(lm_model_new)
For this plot normally the points on the plot should follow a straight line, but we can see a deviation here.
The points deviate significantly from the straight line, especially at the ends. This indicates that the residuals are not normally distributed, which means that models assumptions are not met.
I am guessing that these might be due to potential outliers or data quality issues.
gg_cooksd(lm_model_new, threshold = 'matlab')
Cook’s distance is a measure of influence of individual data points on a regression model. It quantifies how much the model’s estimates would change if a particular data point were removed.
There are a lot of points with high Cook’s distances which are troublesome, shown in red labels, which means they have a significant impact on the model’s fit. Maybe focusing on other plots might lead to a solution here, because we cant simply remove these data points without a solid reason. and i feel that individually checking these data points might take some time. it is better to consider fitting alternative models or try to figure out a fix from above plots.