library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
astro <- read_delim('/Users/sneha/H510-Statistics/astronaut-data.csv')
## Rows: 1277 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): name, sex, nationality, military_civilian, selection, occupation, ...
## dbl (13): id, number, nationwide_number, year_of_birth, year_of_selection, m...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Add 1-3 More Variables into your regression model

The numeric features we could try is :

I am creating a new variable for better computation: lets create “age_at_selection”

astro$age_at_selection <- astro$year_of_selection - astro$year_of_birth

New linear regression model:

lm_model_new <- lm(hours_mission ~ age_at_selection + total_number_of_missions + eva_hrs_mission, data = astro)

summary(lm_model_new)
## 
## Call:
## lm(formula = hours_mission ~ age_at_selection + total_number_of_missions + 
##     eva_hrs_mission, data = astro)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3983.3  -718.8  -439.2  -117.1  9775.8 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              2549.227    363.521   7.013 3.79e-12 ***
## age_at_selection          -44.124      9.746  -4.527 6.54e-06 ***
## total_number_of_missions -110.497     31.893  -3.465 0.000549 ***
## eva_hrs_mission            89.386      6.032  14.818  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1570 on 1273 degrees of freedom
## Multiple R-squared:  0.1637, Adjusted R-squared:  0.1617 
## F-statistic: 83.06 on 3 and 1273 DF,  p-value: < 2.2e-16

The intercept is 2549.23 , suggesting that, on average, an astronaut’s hours_mission is around this value when all other variables are zero.

age_at_selection : The coefficient for age_at_selection is -44.124, meaning that for each additional year at selection, hours_mission is expected to decrease by about 44.12 hours, holding all other factors constant. This negative coefficient could indicate that younger astronauts spend more mission hours.

With a coefficient of -110.497 each additional mission seems to correlate with a reduction of 110.49 hours in hours_mission. This negative relationship could suggest that those who have completed more missions might undertake shorter or more specialized missions.

The coefficient for eva_hrs_missionis 89.386, suggesting that for each additional hour of extravehicular activity (EVA), hours_mission increases by about 89.386 hours, indicating a strong positive relationship. This makes sense, as EVAs could be associated with longer missions overall.

Adjusted R-squared: The adjusted R-squared is similar at 0.1617, which means that including these variables does explain some portion of the variance but not extensively.

Checking for Multicollinearity

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
vif(lm_model_new)
##         age_at_selection total_number_of_missions          eva_hrs_mission 
##                 1.033392                 1.033126                 1.000288

from the above values, we dont see any issues with multicollinearity, hence we could move with our chosen variables

Lets us also test an Interaction Term:

lm_model_interaction <- lm(hours_mission ~ age_at_selection:total_number_of_missions + eva_hrs_mission, data = astro)

summary(lm_model_interaction)
## 
## Call:
## lm(formula = hours_mission ~ age_at_selection:total_number_of_missions + 
##     eva_hrs_mission, data = astro)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3828.1  -693.7  -464.6  -179.9  9631.8 
## 
## Coefficients:
##                                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                               1099.7681   108.0180  10.181  < 2e-16
## eva_hrs_mission                             89.7790     6.0578  14.820  < 2e-16
## age_at_selection:total_number_of_missions   -3.7766     0.9611  -3.929 8.98e-05
##                                              
## (Intercept)                               ***
## eva_hrs_mission                           ***
## age_at_selection:total_number_of_missions ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1577 on 1274 degrees of freedom
## Multiple R-squared:  0.1557, Adjusted R-squared:  0.1544 
## F-statistic: 117.5 on 2 and 1274 DF,  p-value: < 2.2e-16

We could hypothesize that age at selection combined with total missions may influence hours_mission, giving a compound effect of age and experience.

Model Diagnostics

library(lindia)

Residuals vs. Fitted Values

gg_resfitted(lm_model_new) +
  geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

The plot shows a slight downward trend, suggesting that the variance of the residuals decreases as the fitted values increase. This is a violation of homoscedasticity

The plot doesn’t show any major outliers or extreme non-linearity.

There are patterns in the residuals as well

The trend in the residuals seems to be smooth, suggesting that the linear model might be a reasonable fit but cant be sure of it.

Residuals vs. X Values

plots <- gg_resX(lm_model_new, plot.all = FALSE)

plots$age_at_selection +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

plots$total_number_of_missions +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Failed to fit group -1.
## Caused by error in `smooth.construct.cr.smooth.spec()`:
## ! x has insufficient unique values to support 10 knots: reduce k.

plots$eva_hrs_mission +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Here, i am going to focus on the plot of residuals and the predictor variable “eva_hrs_mission” ;

While not as good as in the previous plot, there appears to be some variation in the spread of residuals across different values of “eva_hrs_mission”. this plot shows higher residuals for higher eva_hours.

If the relationship between X and y were perfectly linear, then the blue line would line up with the red dotted line, here the blue curve shows a clear non-linear pattern in the residuals. This suggests that the linear relationship assumed in the model might not be appropriate.

Residual Histogram

gg_reshist(lm_model_new)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This plot is not normally distributed, which means that the model might not be capturing the relationship between the variables accurately

this can be due to potential outliers, the histogram might also reveal the presence of outliers, which can significantly impact the model’s performance.

QQ-Plots

gg_qqplot(lm_model_new)

For this plot normally the points on the plot should follow a straight line, but we can see a deviation here.

The points deviate significantly from the straight line, especially at the ends. This indicates that the residuals are not normally distributed, which means that models assumptions are not met.

I am guessing that these might be due to potential outliers or data quality issues.

Cook’s D by Observation

gg_cooksd(lm_model_new, threshold = 'matlab')

Cook’s distance is a measure of influence of individual data points on a regression model. It quantifies how much the model’s estimates would change if a particular data point were removed.

There are a lot of points with high Cook’s distances which are troublesome, shown in red labels, which means they have a significant impact on the model’s fit. Maybe focusing on other plots might lead to a solution here, because we cant simply remove these data points without a solid reason. and i feel that individually checking these data points might take some time. it is better to consider fitting alternative models or try to figure out a fix from above plots.