── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(MASS)
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
# Step 1: Import a dataset and create a summary statistics table ---# I am using the built-in diamonds datasetdf <- diamondscat("Summary Statistics of the Raw Data:\n")
Summary Statistics of the Raw Data:
print(summary(df))
carat cut color clarity depth
Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
J: 2808 (Other): 2531
table price x y
Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
z
Min. : 0.000
1st Qu.: 2.910
Median : 3.530
Mean : 3.539
3rd Qu.: 4.040
Max. :31.800
#Step 2: Create a new dataframe called 'clean_df' # goal is to eliminate irrelevant variables and fix missing values.clean_df <- df %>%filter(x >0, y >0, z >0)# Use na.omit to remove any rows with missing values (NA) that may existclean_df <-na.omit(clean_df)
cat("\nDimensions of the cleaned dataframe (clean_df):\n")
Dimensions of the cleaned dataframe (clean_df):
print(dim(clean_df))
[1] 53920 10
cat("\nSummary of the Cleaned Data:\n")
Summary of the Cleaned Data:
print(summary(clean_df))
carat cut color clarity depth
Min. :0.2000 Fair : 1609 D: 6774 SI1 :13063 Min. :43.00
1st Qu.:0.4000 Good : 4902 E: 9797 VS2 :12254 1st Qu.:61.00
Median :0.7000 Very Good:12081 F: 9538 SI2 : 9185 Median :61.80
Mean :0.7977 Premium :13780 G:11284 VS1 : 8170 Mean :61.75
3rd Qu.:1.0400 Ideal :21548 H: 8298 VVS2 : 5066 3rd Qu.:62.50
Max. :5.0100 I: 5421 VVS1 : 3654 Max. :79.00
J: 2808 (Other): 2528
table price x y
Min. :43.00 Min. : 326 Min. : 3.730 Min. : 3.680
1st Qu.:56.00 1st Qu.: 949 1st Qu.: 4.710 1st Qu.: 4.720
Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
Mean :57.46 Mean : 3931 Mean : 5.732 Mean : 5.735
3rd Qu.:59.00 3rd Qu.: 5323 3rd Qu.: 6.540 3rd Qu.: 6.540
Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
z
Min. : 1.07
1st Qu.: 2.91
Median : 3.53
Mean : 3.54
3rd Qu.: 4.04
Max. :31.80
#Step 3: Run a 'kitchen sink' model on clean_df # The dependent variable is 'price'.# All other columns in clean_df will be independent variables.independent_vars <-setdiff(names(clean_df), "price")kitchen_sink_formula <-as.formula(paste("price ~", paste(independent_vars, collapse =" + ")))cat("\nKitchen Sink Model Formula:\n")
Kitchen Sink Model Formula:
print(kitchen_sink_formula)
price ~ carat + cut + color + clarity + depth + table + x + y +
z
kitchen_sink_model <-lm(kitchen_sink_formula, data = clean_df)cat("\nSummary of the Kitchen Sink Model:\n")
# To see which variables were removed, you can compare the final formula to the kitchen sink formulacat("\nFinal model formula after backward selection:\n")
Final model formula after backward selection:
print(formula(backward_selection_model))
price ~ carat + cut + color + clarity + depth + table + x + z