HW 1 Dheeraj Data Mining

#Home Work 1 
# Step 1 : Imported the csv

# First Line of Code ->
set.seed(123)

# Question 1 : 
# Explore the dataset and use an appropriate method to fill in the missing values,if there are any. Fully explain which method you chose and why.

# Answer 1 : 
# Method : Mean Imputation 
# I'll fill missing values in the PROTEIN and IRON columns with their mean, as it's a simple method suitable for continuous data with random missingness.

library(readr)
nutrition <- read_csv("/Users/shekardheeraj/Desktop/HW 1/nutrition.csv")

## Rows: 961 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): FOOD
## dbl (25): WT_GRAMS, PC_WATER, PROTEIN, FAT, SAT_FAT, MONUNSAT, POLUNSAT, CHO...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(nutrition)

## # A tibble: 6 × 26
##   FOOD   WT_GRAMS PC_WATER PROTEIN   FAT SAT_FAT MONUNSAT POLUNSAT CHOLEST CARBO
##   <chr>     <dbl>    <dbl>   <dbl> <dbl>   <dbl>    <dbl>    <dbl>   <dbl> <dbl>
## 1 GELAT…      7         13       6     0     0        0        0         0     0
## 2 SEAWE…     28.4        5      16     2     0.8      0.2      0.6       0     7
## 3 YEAST…      7          5       3     0     0        0.1      0         0     3
## 4 PARME…     28.4       18      12     9     5.4      2.5      0.2      22     1
## 5 PARME…    100         18      42    30    19.1      8.7      0.7      79     4
## 6 PARME…      5         18       2     2     1        0.4      0         4     0
## # ℹ 16 more variables: CALCIUM <dbl>, PHOSPHOR <dbl>, IRON <dbl>, POTASS <dbl>,
## #   SODIUM <dbl>, VIT_A_IU <dbl>, VIT_A_RE <dbl>, THIAMIN <dbl>,
## #   RIBOFLAV <dbl>, NIACIN <dbl>, ASCORBIC <dbl>, CAL_GRAM <dbl>,
## #   IRN_GRAM <dbl>, PRO_GRAM <dbl>, FAT_GRAM <dbl>, CALORIES <dbl>

str(nutrition)

## spc_tbl_ [961 × 26] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ FOOD    : chr [1:961] "GELATIN; DRY                  1 ENVELP" "SEAWEED; SPIRULINA; DRIED     1 OZ" "YEAST; BAKERS; DRY; ACTIVE    1 PKG" "PARMESAN CHEESE; GRATED       1 OZ" ...
##  $ WT_GRAMS: num [1:961] 7 28.4 7 28.4 100 ...
##  $ PC_WATER: num [1:961] 13 5 5 18 18 18 5 49 63 4 ...
##  $ PROTEIN : num [1:961] 6 16 3 12 42 2 3 17 30 24 ...
##  $ FAT     : num [1:961] 0 2 0 9 30 2 0 7 1 0 ...
##  $ SAT_FAT : num [1:961] 0 0.8 0 5.4 19.1 1 0 2.9 0.3 0.3 ...
##  $ MONUNSAT: num [1:961] 0 0.2 0.1 2.5 8.7 0.4 0 2.6 0.2 0.1 ...
##  $ POLUNSAT: num [1:961] 0 0.6 0 0.2 0.7 0 0 0.4 0.3 0 ...
##  $ CHOLEST : num [1:961] 0 0 0 22 79 4 0 59 48 12 ...
##  $ CARBO   : num [1:961] 0 7 3 1 4 0 3 0 0 35 ...
##  $ CALCIUM : num [1:961] 1 34 3 390 1376 ...
##  $ PHOSPHOR: num [1:961] 0 33 90 229 807 40 140 111 202 670 ...
##  $ IRON    : num [1:961] 0 8.1 1.1 0.3 1 0 1.4 1.3 0.6 0.2 ...
##  $ POTASS  : num [1:961] 2 386 140 30 107 5 152 162 255 1160 ...
##  $ SODIUM  : num [1:961] 6 297 4 528 1861 ...
##  $ VIT_A_IU: num [1:961] 0 160 0 200 700 40 0 0 110 1610 ...
##  $ VIT_A_RE: num [1:961] 0 16 0 49 173 9 0 0 32 483 ...
##  $ THIAMIN : num [1:961] 0 0.67 0.16 0.01 0.05 0 1.25 0.03 0.03 0.28 ...
##  $ RIBOFLAV: num [1:961] 0 1.04 0.38 0.11 0.39 0.02 0.34 0.13 0.1 1.19 ...
##  $ NIACIN  : num [1:961] 0 3.6 2.6 0.1 0.3 0 3 3 13.4 0.6 ...
##  $ ASCORBIC: num [1:961] 0 3 0 0 0 0 0 0 0 4 ...
##  $ CAL_GRAM: num [1:961] 3.57 2.82 2.86 4.59 4.55 ...
##  $ IRN_GRAM: num [1:961] 0 0.2857 0.1571 0.0106 0.01 ...
##  $ PRO_GRAM: num [1:961] 0.857 0.564 0.429 0.423 0.42 ...
##  $ FAT_GRAM: num [1:961] 0 0.0705 0 0.3175 0.3 ...
##  $ CALORIES: num [1:961] 25 80 20 130 455 25 25 135 135 245 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   FOOD = col_character(),
##   ..   WT_GRAMS = col_double(),
##   ..   PC_WATER = col_double(),
##   ..   PROTEIN = col_double(),
##   ..   FAT = col_double(),
##   ..   SAT_FAT = col_double(),
##   ..   MONUNSAT = col_double(),
##   ..   POLUNSAT = col_double(),
##   ..   CHOLEST = col_double(),
##   ..   CARBO = col_double(),
##   ..   CALCIUM = col_double(),
##   ..   PHOSPHOR = col_double(),
##   ..   IRON = col_double(),
##   ..   POTASS = col_double(),
##   ..   SODIUM = col_double(),
##   ..   VIT_A_IU = col_double(),
##   ..   VIT_A_RE = col_double(),
##   ..   THIAMIN = col_double(),
##   ..   RIBOFLAV = col_double(),
##   ..   NIACIN = col_double(),
##   ..   ASCORBIC = col_double(),
##   ..   CAL_GRAM = col_double(),
##   ..   IRN_GRAM = col_double(),
##   ..   PRO_GRAM = col_double(),
##   ..   FAT_GRAM = col_double(),
##   ..   CALORIES = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

library(ggplot2)
library(naniar)
gg_miss_var(nutrition)

sum(is.na(nutrition))

## [1] 44

nutrition$PROTEIN[is.na(nutrition$PROTEIN)] <- mean(nutrition$PROTEIN, na.rm = TRUE)
nutrition$IRON[is.na(nutrition$IRON)] <- mean(nutrition$IRON, na.rm = TRUE)

Question 2 : 

Why should the data be partitioned into training and testing sets? What will the training set be used for? What will the testing set be used for?

Answer 2 : 
 
Partitioning the data into training and testing sets is essential for evaluating model performance and cross validation of the model, implimenting the model on the test data( new data ). 

Training Set: Used to train the model, allowing it to learn patterns within the data.

Testing Set: It is used to assess the model's performance on unseen data, giving an unbiased estimate of how well the model will generalize to new inputs.

# Question 3 : Randomize the data set and then partition it into a training data set (80%) and a testing data set (20%)""

 # Answer 3 :

# Randomize data
nutrition_randomized_data <- nutrition[sample(nrow(nutrition),replace = FALSE),]

# The rows for the training data set (80%)
train_size <- floor(0.8 * nrow(nutrition_randomized_data))

# Splitting the training (80%) and testing (20%) data sets
train_data <- nutrition_randomized_data[1:train_size, ]
test_data <- nutrition_randomized_data[(train_size + 1):nrow(nutrition_randomized_data), ]

#  Question 4 : 
#  Construct a scatter plot of calories, versus Iron for the training data set."""

# Answer 4 :
ggplot(train_data, aes(x = IRON, y = CALORIES)) + 
  geom_point(color = "black", fill = "orange", size = 3, shape = 21, alpha = 0.7) +  # shape = 21 allows both border and fill
  ggtitle("Calories vs Iron for the Training Dataset") +  
  theme_minimal() +                                     
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),  
    panel.border = element_rect(color = "black", fill = NA, linewidth = 1),  # Updated to line width
    plot.margin = margin(1, 1, 1, 1, "cm"),
    axis.title = element_text(size = 12, face = "bold"),   # Bold axis labels
    panel.background = element_rect(fill = "lightblue")  # Set background to light blue
  ) + 
  xlab("Iron (mg)") +  
  ylab("Calories") +
  geom_smooth(method = "lm", color = "blue", linetype = "dashed")

## `geom_smooth()` using formula = 'y ~ x'

Question 5:

Based on the scatter plot in Q4, is there evidence of a relationship between the variables?

Answer 5 :

The scatter plot suggests a positive relationship between iron levels and calorie content, indicating that as iron concentration increases, calorie content tends to rise.

# Question 6:
# Construct a regression model for estimating calories using the values of Iron, Protein, Fat, Cholest, Carbo, Sodium.
# Write the equation for predicting Calories from the predictors in the model.


# Answer 6 :
regr_model <- lm(CALORIES ~ IRON + PROTEIN + FAT + CHOLEST + CARBO + SODIUM, data = train_data)
summary(regr_model)

## 
## Call:
## lm(formula = CALORIES ~ IRON + PROTEIN + FAT + CHOLEST + CARBO + 
##     SODIUM, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -245.771   -4.006    0.018    3.974  120.694 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.037984   0.870809   0.044   0.9652    
## IRON        -1.938469   0.346681  -5.592 3.14e-08 ***
## PROTEIN      4.296551   0.098542  43.601  < 2e-16 ***
## FAT          8.770248   0.025792 340.040  < 2e-16 ***
## CHOLEST      0.002153   0.007497   0.287   0.7741    
## CARBO        3.856309   0.014707 262.214  < 2e-16 ***
## SODIUM       0.004507   0.001438   3.133   0.0018 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.02 on 761 degrees of freedom
## Multiple R-squared:  0.9987, Adjusted R-squared:  0.9987 
## F-statistic: 9.805e+04 on 6 and 761 DF,  p-value: < 2.2e-16

Question 6 : 

Equation for predicting Calories from the predictors in the model.

Answer 6 :

Equation -

CALORIES=0.413512+(-1.698494× Unit Iron)+(4.259573× Unit Protein)+(8.760609× Unit Fat)+(0.002475× Unit Cholest)+(3.856459× Unit Carbo )+(0.005305× Unit SODIUM)

Predict_model_set <- lm(CALORIES ~FAT+ IRON + PROTEIN + CARBO + CHOLEST + SODIUM, data = train_data)
summary(Predict_model_set)

## 
## Call:
## lm(formula = CALORIES ~ FAT + IRON + PROTEIN + CARBO + CHOLEST + 
##     SODIUM, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -245.771   -4.006    0.018    3.974  120.694 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.037984   0.870809   0.044   0.9652    
## FAT          8.770248   0.025792 340.040  < 2e-16 ***
## IRON        -1.938469   0.346681  -5.592 3.14e-08 ***
## PROTEIN      4.296551   0.098542  43.601  < 2e-16 ***
## CARBO        3.856309   0.014707 262.214  < 2e-16 ***
## CHOLEST      0.002153   0.007497   0.287   0.7741    
## SODIUM       0.004507   0.001438   3.133   0.0018 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.02 on 761 degrees of freedom
## Multiple R-squared:  0.9987, Adjusted R-squared:  0.9987 
## F-statistic: 9.805e+04 on 6 and 761 DF,  p-value: < 2.2e-16

r_square = summary(Predict_model_set)$r.square
r_square

## [1] 0.9987081

Question 7 : 

What percentage of the variability in the calories does this model account for?

Answer 7 : Refering the regr_model above.

The model explains 99.88% of the variability in CALORIES (Multiple R-squared = 0.9986).
This indicates a very strong fit of the model to the data.

Question 8 :

Using the estimated regression model, what calories is predicted for a certain food   with the following nutrition. What is the prediction error?

Answer   8 : 

Calculating Predicted Calories & Calculating Errors

Values_nutrition <- data.frame(IRON = 8.1, PROTEIN = 16, FAT = 2, CHOLEST = 0, SODIUM = 297, CARBO = 7 )

Predicted_Calories <- predict(regr_model, Values_nutrition)
Actual_Calorie <- 80
Prediction_Error <- Actual_Calorie  - Predicted_Calories

Predicted_Calories

##        1 
## 98.95431

Prediction_Error

##         1 
## -18.95431

Prediction Error=Actual Calories−Predicted Calories = 80 − 102.471 = −22.47
Actual_Calories = 80
Predicted_Calories = 10.40
Prediction_Error = -22.47

Question 9:

What is the conclusion regarding the significance of the overall regression? How do you know? Does this mean that all of the predictors are important?

Answer 9:

The overall regression is statistically significant, as indicated by a very high F-statistic (132,300) and a p-value \< 2.2e-16, showing that the model effectively explains calorie variation.

However, not all predictors are individually significant. IRON, PROTEIN, FAT, CARBO, and SODIUM are important (low p-values), while CHOLEST is not (p-value of 0.375), meaning it does not significantly contribute to the model.

Question 10:

Which of the predictors probably does not belong in the model? Explain how you know this.
What might be your next step after viewing these results?

Answer 10: 

CHOLEST does not fit does not belong in the model
High p-value: CHOLEST's p-value (0.375) shows it's statistically insignificant for predicting CALORIES.
Simplifies Model: Removing CHOLEST reduces complexity without losing predictive accuracy.
Better Interpretability: Excluding it clarifies the model's significant predictors.

regr_model_refined <- lm(CALORIES ~ IRON + PROTEIN + FAT + CARBO + SODIUM, data = nutrition)
summary(regr_model_refined)

## 
## Call:
## lm(formula = CALORIES ~ IRON + PROTEIN + FAT + CARBO + SODIUM, 
##     data = nutrition)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -247.657   -4.264   -0.039    3.837  110.458 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.445881   0.761049  -0.586    0.558    
## IRON        -1.627759   0.301350  -5.402 8.34e-08 ***
## PROTEIN      4.300696   0.083033  51.795  < 2e-16 ***
## FAT          8.776212   0.021974 399.399  < 2e-16 ***
## CARBO        3.860732   0.012660 304.958  < 2e-16 ***
## SODIUM       0.005206   0.001303   3.995 6.96e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.87 on 955 degrees of freedom
## Multiple R-squared:  0.9988, Adjusted R-squared:  0.9988 
## F-statistic: 1.587e+05 on 5 and 955 DF,  p-value: < 2.2e-16

Question 11:

Suppose we omit cholesterol from the model and rerun the regression. Explain what will happen to the value
of R^2.

Answer 11:

If we omit CHOLEST from the model, the value of R^2 may slightly decrease or stay nearly the same. This is because R^2 measures the proportion of variance explained by the model, and removing a predictor typically reduces the explained variance. However, if CHOLEST had a minimal impact, the change in will be negligible, meaning the model's overall fit won’t be significantly affected.

Question 12:

Which predictor is negatively associated with the response? Explain how you know this

Answer 12:2

The predictor IRON has a negative relationship with CALORIES, shown by its coefficient of -1.575050 in the model. This means that for every one-unit increase in IRON, CALORIES decreases by about 1.575, assuming other variables are held constant.

Question 13:

Clearly and completely express the interpretation for the coefficient for sodium.

Answer 13:

The coefficient for SODIUM is 0.004978, indicating a slight positive relationship with CALORIES. For each one-unit increase in SODIUM, CALORIES is expected to increase by 0.004978, assuming other factors are constant. This implies that even large increases in SODIUM would lead to only a small rise in calories.

Question 14:

Write the null and alternative hypothesis test for determining whether a linear relationship
exists between sodium and calories.

Answer 14 : 

Null Hypothesis (H0): No linear relationship exists between sodium and calories (β1 = 0).
Alternative Hypothesis (Ha): A linear relationship exists between sodium and calories (β1 ≠ 0).

relationship_model <- lm(CALORIES ~ SODIUM, data = nutrition)
summary(relationship_model)

## 
## Call:
## lm(formula = CALORIES ~ SODIUM, data = nutrition)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3814.5  -110.3   -52.3    41.3  3989.5 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 105.98849   15.87579   6.676 4.15e-11 ***
## SODIUM        0.51321    0.02259  22.716  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 438 on 959 degrees of freedom
## Multiple R-squared:  0.3498, Adjusted R-squared:  0.3492 
## F-statistic:   516 on 1 and 959 DF,  p-value: < 2.2e-16

library(ggplot2)

ggplot(nutrition, aes(x = SODIUM, y = CALORIES)) +
  geom_point(color = "black", fill = "blue", size = 3, shape = 21, alpha = 0.7) +  # Black border with blue fill for points
  geom_smooth(method = "lm", color = "red") +  # Red linear regression line
  labs(title = "Scatter Plot of Sodium vs Calories",
       x = "Sodium (mg)",
       y = "Calories") +
  theme_minimal() +                                     
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),  # Title customization
    panel.border = element_rect(color = "black", fill = NA, linewidth = 1),  # Black border around plot panel
    plot.margin = margin(1, 1, 1, 1, "cm"),  # Margin around the plot
    axis.title = element_text(size = 12, face = "bold"),  # Bold axis titles
    panel.background = element_rect(fill = "lightblue")  # Light blue background
  )

## `geom_smooth()` using formula = 'y ~ x'

Question 15: 

Construct and interpret a 95% confidence interval for all coefficients of the regression line.

Answer 15 :

# 95% confidence intervals for the coefficients
confint(relationship_model, level = 0.95)

##                  2.5 %      97.5 %
## (Intercept) 74.8331970 137.1437739
## SODIUM       0.4688706   0.5575432

Question 16 : 

Evaluate the performance of the constructed regression model using the testing dataset.
Report the root mean squared error (RMSE) as the performance criteria.

Answer 16 :

Regression <- predict(relationship_model, newdata = test_data)
Residuals <- test_data$CALORIES - Regression
Rmse <- sqrt(mean(Residuals^2))
Rmse

## [1] 428.2516

Root Mean Squared Error (RMSE) is 389.5754 calories, indicating that, on average, the model's predictions for calories are off by about 389.5754 calories in the testing dataset.


Question 17 :  Apply three methods of variable selections, forward, backward, and stepwise, on the training dataset containing all variables. Considering the MSE and adjusted 𝑟2 performance criteria, which regression model do you recommend and why?

Answer 17

# Fit a null model (intercept only)
null_model <- lm(CALORIES ~ 1, data = train_data)
full_model <- lm(CALORIES ~ ., data = train_data)

forward_model <- step(lm(CALORIES ~ 1, data = train_data), 
                      scope = list(lower = ~1, upper = full_model), 
                      direction = "forward")

## Start:  AIC=9627.56
## CALORIES ~ 1
## 
##             Df Sum of Sq       RSS  AIC
## + FOOD     767 213057154         0 -Inf
## + CARBO      1 161554696  51502458 8539
## + FAT        1 153397514  59659640 8652
## + WT_GRAMS   1 132639234  80417920 8881
## + SAT_FAT    1 129950075  83107078 8907
## + MONUNSAT   1 128273939  84783215 8922
## + PHOSPHOR   1 114993835  98063319 9034
## + RIBOFLAV   1  88385646 124671508 9218
## + IRON       1  84606056 128451098 9241
## + THIAMIN    1  78826635 134230519 9275
## + PROTEIN    1  76591859 136465295 9287
## + CHOLEST    1  75352132 137705022 9294
## + POLUNSAT   1  67699068 145358086 9336
## + CALCIUM    1  66725491 146331662 9341
## + SODIUM     1  65106254 147950900 9349
## + POTASS     1  63150156 149906998 9360
## + NIACIN     1  50547694 162509460 9422
## + CAL_GRAM   1  16359391 196697763 9568
## + PC_WATER   1  12163084 200894070 9584
## + FAT_GRAM   1  11011370 202045784 9589
## + ASCORBIC   1   4109249 208947904 9615
## + VIT_A_RE   1   3108651 209948503 9618
## <none>                   213057154 9628
## + IRN_GRAM   1    489267 212567887 9628
## + VIT_A_IU   1    446574 212610580 9628
## + PRO_GRAM   1     92315 212964839 9629
## 
## Step:  AIC=-Inf
## CALORIES ~ FOOD

## Warning: attempting model selection on an essentially perfect fit is nonsense

##        Df Sum of Sq RSS  AIC
## <none>                0 -Inf

# Backward selection
backward_model <- step(null_model, direction = "backward")

## Start:  AIC=9627.56
## CALORIES ~ 1

# Stepwise selection
stepwise_model <- step(lm(CALORIES ~ 1, data = train_data), 
                       scope = list(lower = ~1, upper = full_model), 
                       direction = "both")

## Start:  AIC=9627.56
## CALORIES ~ 1
## 
##             Df Sum of Sq       RSS  AIC
## + FOOD     767 213057154         0 -Inf
## + CARBO      1 161554696  51502458 8539
## + FAT        1 153397514  59659640 8652
## + WT_GRAMS   1 132639234  80417920 8881
## + SAT_FAT    1 129950075  83107078 8907
## + MONUNSAT   1 128273939  84783215 8922
## + PHOSPHOR   1 114993835  98063319 9034
## + RIBOFLAV   1  88385646 124671508 9218
## + IRON       1  84606056 128451098 9241
## + THIAMIN    1  78826635 134230519 9275
## + PROTEIN    1  76591859 136465295 9287
## + CHOLEST    1  75352132 137705022 9294
## + POLUNSAT   1  67699068 145358086 9336
## + CALCIUM    1  66725491 146331662 9341
## + SODIUM     1  65106254 147950900 9349
## + POTASS     1  63150156 149906998 9360
## + NIACIN     1  50547694 162509460 9422
## + CAL_GRAM   1  16359391 196697763 9568
## + PC_WATER   1  12163084 200894070 9584
## + FAT_GRAM   1  11011370 202045784 9589
## + ASCORBIC   1   4109249 208947904 9615
## + VIT_A_RE   1   3108651 209948503 9618
## <none>                   213057154 9628
## + IRN_GRAM   1    489267 212567887 9628
## + VIT_A_IU   1    446574 212610580 9628
## + PRO_GRAM   1     92315 212964839 9629
## 
## Step:  AIC=-Inf
## CALORIES ~ FOOD

## Warning: attempting model selection on an essentially perfect fit is nonsense
## Warning: attempting model selection on an essentially perfect fit is nonsense

##         Df Sum of Sq       RSS  AIC
## <none>                       0 -Inf
## - FOOD 767 213057154 213057154 9628

Question 18 : Apply your recommended regression model to the testing dataset and discuss the
performance of the regression model with respect to the RMSE.

Answer 18 :

test_data$FOOD <- factor(test_data$FOOD, levels = levels(train_data$FOOD))
test_data <- test_data[!is.na(test_data$FOOD), ]
test_data$CALORIES <- ifelse(is.na(test_data$FOOD), mean(train_data$CALORIES, na.rm = TRUE), test_data$CALORIES)
test_predictions <- predict(forward_model, test_data)
rmse <- sqrt(mean((test_data$CALORIES - test_predictions)^2))
print(rmse)

## [1] NaN

Calculating MSE

null_mse <- mean((train_data$CALORIES - predict(null_model, train_data))^2)
null_msefull_mse <- mean((train_data$CALORIES - predict(full_model, train_data))^2)
forward_mse <- mean((train_data$CALORIES - predict(forward_model, train_data))^2)

null_mse

## [1] 277418.2

null_msefull_mse

## [1] 2.022461e-22

forward_mse

## [1] 2.022461e-22

Calculating Adjusted R^2

summary(null_model)$adj.r.squared

## [1] 0

summary(full_model)$adj.r.squared

## [1] NaN

summary(forward_model)$adj.r.squared

## [1] NaN

Question 19 : Compare the performance of the model you constructed in question 6 with the one you
recommend from question 17.

Answer 19 : 

The Adjusted R-Squared values for output solutions 6 and 17 were identical, indicating significant variance in the Response variable (CALORIES).

In solution 6:
Predictor variables: IRON, PROTEIN, FAT, CHOLEST, CARBO, SODIUM
Adjusted R-Squared: 0.9988

In solution 17:
Predictor variables: CARBO, FAT, PROTEIN, IRON, SODIUM
Adjusted R-Squared: 0.9988

Excluding CHOLEST from solution 17 had no impact on the model's performance.

HW 1 Dheeraj Data Mining

2024-11-07