The goal of this project is to build a model to predict camera prices. Tasks include:
EDA
Identifying outlier/high-leverage /influential points
Verifying regression assumptions
Data transformation (log, standardization)
Creating test/train sets
## 'data.frame': 1038 obs. of 13 variables:
## $ Model : chr "Agfa ePhoto 1280" "Agfa ePhoto 1680" "Agfa ePhoto CL18" "Agfa ePhoto CL30" ...
## $ Release.date : int 1997 1998 2000 1999 1999 2001 1999 1997 1996 2001 ...
## $ Max.resolution : num 1024 1280 640 1152 1152 ...
## $ Low.resolution : num 640 640 0 640 640 ...
## $ Effective.pixels : num 0 1 0 0 0 1 1 0 0 1 ...
## $ Zoom.wide..W. : num 38 38 45 35 43 51 34 42 50 35 ...
## $ Zoom.tele..T. : num 114 114 45 35 43 51 102 42 50 105 ...
## $ Normal.focus.range : num 70 50 0 0 50 50 0 70 40 76 ...
## $ Macro.focus.range : num 40 0 0 0 0 20 0 3 10 16 ...
## $ Storage.included : num 4 4 2 4 40 8 8 2 1 8 ...
## $ Weight..inc..batteries.: num 420 420 0 0 300 270 0 320 460 375 ...
## $ Dimensions : num 95 158 0 0 128 119 0 93 160 110 ...
## $ Price : num 179 179 179 269 1299 ...
This data set contains 1036 observations and 13 variables.
# Visualize the distribution of 'Release.date'
ggplot(cameras, aes(x = Release.date)) +
geom_histogram(binwidth = 1, fill = "#66c2a5", color = "#7570b3") +
labs(title = "Distribution of Camera Release Dates",
x = "Release Date",
y = "Frequency")# Explore the distribution of 'Max.resolution' and 'Effective.pixels'
ggplot(cameras, aes(x = Max.resolution, y = Effective.pixels)) +
geom_point(alpha = 0.6, color = "#fc8d62") +
labs(title = "Relationship between Max Resolution and Effective Pixels",
x = "Max Resolution",
y = "Effective Pixels")# Analyze the 'Price' variable
ggplot(cameras, aes(x = Price)) +
geom_histogram(binwidth = 50, fill = "#8da0cb", color = "#e78ac3") +
geom_vline(xintercept = mean(cameras$Price), color = "red", linetype = "dashed", size = 1) +
labs(title = "Distribution of Camera Prices",
x = "Price",
y = "Frequency") +
theme_minimal()## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
From the histogram of price, we see there are potential outliers. The red dashed line represents the mean of price.
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1038 457.38 760.45 199 288.25 103.78 14 7999 7985 5.17 36.82
## se
## X1 23.6
## Model Release.date Max.resolution
## 0 0 0
## Low.resolution Effective.pixels Zoom.wide..W.
## 0 0 0
## Zoom.tele..T. Normal.focus.range Macro.focus.range
## 0 0 1
## Storage.included Weight..inc..batteries. Dimensions
## 2 2 2
## Price
## 0
## Model Release.date Max.resolution Low.resolution
## 346 HP Photosmart R927 2006 3296 2592
## 347 HP Photosmart R937 2007 3298 0
## Effective.pixels Zoom.wide..W. Zoom.tele..T. Normal.focus.range
## 346 8 35 105 50
## 347 8 39 118 50
## Macro.focus.range Storage.included Weight..inc..batteries. Dimensions Price
## 346 10 NA NA NA 179
## 347 NA NA NA NA 179
## Model Release.date Max.resolution
## 0.00000000 0.00000000 0.00000000
## Low.resolution Effective.pixels Zoom.wide..W.
## 0.00000000 0.00000000 0.00000000
## Zoom.tele..T. Normal.focus.range Macro.focus.range
## 0.00000000 0.00000000 0.09633911
## Storage.included Weight..inc..batteries. Dimensions
## 0.19267823 0.19267823 0.19267823
## Price
## 0.00000000
Data is missing from 4 columns in 2 rows. The missing data is less than 1% of the data set. Deleting these rows should not present any issues for the analysis.
Price is negatively correlated with Zoom.wide..T. and positively with Weight.inc.batteries.
Distribution of Weight.inc.batteries
# Visualize the distribution of 'Weight.inc.batteries'
ggplot(cameras, aes(x = Weight..inc..batteries.)) +
geom_histogram(binwidth = 50, fill = "#8da0cb", color = "#e78ac3") +
geom_vline(xintercept = mean(cameras$Weight..inc..batteries.), color = "red", linetype = "dashed", size = 1) +
labs(title = "Distribution of Camera Weight..inc..batteries",
x = "Price",
y = "Frequency") +
theme_minimal()## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1036 319.27 260.41 226 266.73 92.66 0 1860 1860 2.82 9.75
## se
## X1 8.09
Weight..inc..batteries is right tailed with a long tail. There are appear to be outliers in the 1000-200 range.
# Visualize the distribution of 'Zoom.wide..W.'
ggplot(cameras, aes(x = Zoom.wide..W.)) +
geom_histogram(bins =50, fill = "#8da0cb", color = "#e78ac3") +
geom_vline(xintercept = mean(cameras$Zoom.wide..W.), color = "red", linetype = "dashed", size = 1) +
labs(title = "Distribution of Camera Zoom.wide..W.",
x = "Price",
y = "Frequency") +
theme_minimal()## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 1036 32.96 10.34 36 35.55 2.97 0 52 52 -2.57 5.56 0.32
Zoom..wide..W is somewhat normally distributed, but appears to have outliers at the 0 mark.
# Fit regression model with Zoom.wide..W. and Weight..inc..batteries
reg_model <- lm(Price ~ Zoom.wide..W. + Weight..inc..batteries., data = cameras )
# Plot the data points with regression line for one predictor variable
plot(cameras$Zoom.wide..W., cameras$Price, main = "Price by Release Date",
xlab = "Release Date",
ylab = "Price")
abline(reg_model, col = "red")## Warning in abline(reg_model, col = "red"): only using the first two of 3
## regression coefficients
## # A tibble: 1,036 × 10
## .rownames Price Zoom.wide..W. Weight..inc..batteries. .fitted .resid .hat
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 179 38 420 443. -264. 0.00220
## 2 2 179 38 420 443. -264. 0.00220
## 3 3 179 45 0 -37.3 216. 0.00260
## 4 4 269 35 0 156. 113. 0.00324
## 5 5 1299 43 300 248. 1051. 0.00257
## 6 6 179 51 270 68.8 110. 0.00591
## 7 7 179 34 0 175. 3.75 0.00350
## 8 8 149 42 320 284. -135. 0.00242
## 9 9 139 50 460 244. -105. 0.00900
## 10 10 139 35 375 464. -325. 0.00124
## # ℹ 1,026 more rows
## # ℹ 3 more variables: .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>
# Identify Outliers
# observations with std res absolute value > 2
which(abs(out_lev_inf$.std.resid) >2)## [1] 49 50 51 52 53 54 55 202 389 390 391 392 393 394 395 396 397 398 399
## [20] 400 401 402 456 457 458 459 460 461 462 552 555 556 557 558 566 567 568 569
## [39] 570 575 585 606 607 608 610 611 623 824 833 901 902
With a standard residual score greater than 2, 51 cameras are considered outliers.
# Identify High-leverage points
# determine threshold for high leverage points
# predictors
m <- 1
# sample size
n <- length(cameras$Price)
# create thresholds
upper_threshold <- (3 * (m+1)) / n
lower_threshold <- (2 * (m+1)) / n
print(upper_threshold)## [1] 0.005791506
## [1] 0.003861004
## [1] 6 9 48 49 50 51 52 53 54 55 56 57 58 59 60
## [16] 61 62 63 64 65 75 186 202 205 206 287 298 299 300 301
## [31] 389 390 391 392 393 394 395 396 397 398 399 400 401 402 469
## [46] 474 475 549 550 551 552 553 554 555 556 557 558 559 560 561
## [61] 562 563 564 565 615 626 627 629 630 631 632 633 634 635 636
## [76] 711 714 717 718 746 747 748 749 750 751 752 753 754 852 853
## [91] 854 874 900 901 902 903 904 905 906 911 912 917 918 919 957
## [106] 1003 1004 1013 1016 1017 1018 1024
## [1] 6 9 48 49 50 51 52 53 54 55 56 57 58 59 60
## [16] 61 62 63 64 65 74 75 168 186 202 205 206 266 280 281
## [31] 287 298 299 300 301 319 380 381 383 389 390 391 392 393 394
## [46] 395 396 397 398 399 400 401 402 437 469 474 475 476 507 549
## [61] 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564
## [76] 565 576 601 615 626 627 628 629 630 631 632 633 634 635 636
## [91] 711 714 717 718 745 746 747 748 749 750 751 752 753 754 852
## [106] 853 854 874 900 901 902 903 904 905 906 907 908 909 911 912
## [121] 917 918 919 922 957 1003 1004 1007 1008 1009 1010 1011 1013 1016 1017
## [136] 1018 1024
# Filter rows within the upper threshold
high_lev_rows_upper <- out_lev_inf$.hat >= upper_threshold
# Display the rows
out_lev_inf[high_lev_rows_upper, ]## # A tibble: 112 × 10
## .rownames Price Zoom.wide..W. Weight..inc..batteries. .fitted .resid .hat
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 179 51 270 68.8 110. 0.00591
## 2 9 139 50 460 244. -105. 0.00900
## 3 48 1299 0 875 1552. -253. 0.0108
## 4 49 4499 0 1585 2135. 2364. 0.0239
## 5 50 4499 0 1565 2119. 2380. 0.0231
## 6 51 4499 0 1335 1930. 2569. 0.0161
## 7 52 4499 0 1565 2119. 2380. 0.0231
## 8 53 7999 0 1585 2135. 5864. 0.0239
## 9 54 7999 0 1565 2119. 5880. 0.0231
## 10 55 7999 0 1385 1971. 6028. 0.0173
## # ℹ 102 more rows
## # ℹ 3 more variables: .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>
# Filter rows within the lower threshold
high_lev_rows_lower <- out_lev_inf$.hat >= lower_threshold
# Display the rows
out_lev_inf[high_lev_rows_lower, ]## # A tibble: 136 × 10
## .rownames Price Zoom.wide..W. Weight..inc..batteries. .fitted .resid .hat
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 179 51 270 68.8 110. 0.00591
## 2 9 139 50 460 244. -105. 0.00900
## 3 48 1299 0 875 1552. -253. 0.0108
## 4 49 4499 0 1585 2135. 2364. 0.0239
## 5 50 4499 0 1565 2119. 2380. 0.0231
## 6 51 4499 0 1335 1930. 2569. 0.0161
## 7 52 4499 0 1565 2119. 2380. 0.0231
## 8 53 7999 0 1585 2135. 5864. 0.0239
## 9 54 7999 0 1565 2119. 5880. 0.0231
## 10 55 7999 0 1385 1971. 6028. 0.0173
## # ℹ 126 more rows
## # ℹ 3 more variables: .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>
Following the upper threshold guideline, 112 rows are considered high-leverage. Following the lower threshold guideline, 136 rows are considered high-leverage.
# Identify influential Observations using Cooks Distance
# df2 = n-m-1 = 1036-13-1 = 1022
# to find median of F distribution with 1 and 1022 DF
cooksd50<- qf(0.5,1,1022)
#cooksd50
# for 25th percentile, any points between 25th and 50th are tending
cooksd25<- qf(0.25,1,26)
#cooksd25
# Identify influential observations
inf_obs <- which(cooks.distance(reg_model) > cooksd50)
inf_obs## 53 54 55
## 53 54 55
# Identify tending influential observations
inf_obs_25 <- which(cooks.distance(reg_model) > cooksd25)
inf_obs_25## 49 50 52 53 54 55 392 402 560
## 49 50 52 53 54 55 390 400 558
# Filter rows within the upper threshold
inf_points <- which(cooks.distance(reg_model) > cooksd50)
# Display the rows
out_lev_inf[inf_points, ]## # A tibble: 3 × 10
## .rownames Price Zoom.wide..W. Weight..inc..batteries. .fitted .resid .hat
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 53 7999 0 1585 2135. 5864. 0.0239
## 2 54 7999 0 1565 2119. 5880. 0.0231
## 3 55 7999 0 1385 1971. 6028. 0.0173
## # ℹ 3 more variables: .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>
Cooks distance was calculated to identify influential observations. The cutoff for Cook’s distance scores at the 25th percentile is 0.1037 and at the 50th percentile is 0.455. Any values calculated with Cooks Distance above 0.455 are considered influential.
Observations 53, 54 and 55 are statistically influential. These observations are high both in the x and y space with large values for Weight..inc..batteries (1585, 1565 and 1385) and Price (7999).
Observation 53 has a leverage value of 0.0239, a residual score of 5863.68, a standardized residual value of greater than 2 (-9.00) and exceeds the F-median value of 0.455 at 0.66.
Observation 54 has a leverage value of 0.231, a residual score of 5880.12, a standardized residual value of greater than 2 (-9.025) and exceeds the F-median value of 0.455 at 0.642.
Observation 55 has a leverage value of 0.017, a residual score of 6023.11, a standardized residual value of greater than 2 (-9.225) and exceeds the F-median value of 0.455 at 0.501.
The independence, constant variance, zero mean, and normality assumptions are verified through analysis of the Residuals vs Fitted plot and Normal Q-Q plot.
# data shape is right-tailed/skews to the right, slight bell-shaped curve
# q-q plot
plot(reg_model, col = '#e78ac3',which=c(2))# Q-Qplot for standardized residuals should follow the line if residuals are drawn from a normally distributed process or population. From this q-qplot we see the bulk of the points lie on the straight line, however, the points begin to diverge from the line as the theoretical quantiles increase. From this plot, one might conclude the data is drawn from a chi-square distribution.
# Hypothesis test
ks.test(reg_model$residuals, pnorm, mean(reg_model$residuals), sd(reg_model$residuals))## Warning in ks.test.default(reg_model$residuals, pnorm,
## mean(reg_model$residuals), : ties should not be present for the
## Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: reg_model$residuals
## D = 0.26335, p-value < 2.2e-16
## alternative hypothesis: two-sided
# Under the null hypothesis, we assume the probability that the data what we observed is drawn from a normally distributed population. The p-value is < 2.2e-16, therefore we reject the null hypothesis. The residuals does not appear to be drawn from a normal process or distribution.
# Normality assumption does not hold.# Assessing zero-mean assumption
# red line is flat but negative, and then peaks up at the end
# Residuals are not evenly distributed about the 0 dotted line and data show a funnel shape
# residuals vs fitted
plot(reg_model, col = '#e78ac3', which=c(1))## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2232.42 -209.01 -106.62 0.00 34.45 6028.11
# residuals vs fitted plot shows data has a slight funnel shape which suggests there is not constant variance. Ideally, we would like a rectangle shape to the data. But the funnel shape suggests as values increase, variance increases.
# scale-location plot shows data has funnel shape and the red line is not very flat. The plot shows the red line has a slight bump and then peaks up at the end, with an overall upward trend.
# Hypothesis Test: Breusch-Pagan Test
ols_test_breusch_pagan(reg_model)##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 2511.4364
## Prob > Chi2 = 0.0000
#Breusch Pagan Test for Heteroskedasticity
# -----------------------------------------
# Ho: the variance is constant
# Ha: the variance is not constant
# p-value = 0.000
# The small p-value (less than 0.05 cutoff value) would suggest we reject Ho and Accept Ha and conclude the variance is not constant.
# constant variance assumption violated# With a visual inspection of fitted vs residual we see red trend line is not flat, but negative and then peaks up
# Hypothesis test: Durbin Watson
# Ho = independent residuals
# Ha = dependent residuals (auto-correlated)
set.seed(1234)
durbinWatsonTest(reg_model, reps = 5000)## lag Autocorrelation D-W Statistic p-value
## 1 0.6324554 0.7349031 0
## Alternative hypothesis: rho != 0
# increase reps to 5000 for more stable p-value
# p-value is 0 which is less than 0.05 cutoff value so we reject null hypothesis and do not have independent residuals
# independence assumption is violatedBecause this data does not verify all of the regression assumptions. We could normalize/log-transform the data, but lets continue on with linear regression without influential points.
# remove influential rows from the data
cameras_sub <- cameras[-c(53:55),]
reg_model_sub <- lm(Price ~ Zoom.wide..W. + Weight..inc..batteries., data = cameras_sub )
summary(reg_model_sub)##
## Call:
## lm(formula = Price ~ Zoom.wide..W. + Weight..inc..batteries.,
## data = cameras_sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1717.2 -200.9 -126.0 12.9 4598.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 918.36894 104.40543 8.796 < 2e-16 ***
## Zoom.wide..W. -19.36014 2.40191 -8.060 2.10e-15 ***
## Weight..inc..batteries. 0.49883 0.09698 5.143 3.23e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 571.5 on 1030 degrees of freedom
## Multiple R-squared: 0.2147, Adjusted R-squared: 0.2132
## F-statistic: 140.8 on 2 and 1030 DF, p-value: < 2.2e-16
The linear regression model is represented by the equation:
Price = 832.18 − 19.32 × Zoom.wide..W. + 0.82 × Weight..inc..batteries.
The residuals (differences between actual and predicted values) exhibit a distribution with a minimum value of -2232.4, a 1st quartile (25th percentile) of -209.0, a median of -106.6, a 3rd quartile (75th percentile) of 34.4, and a maximum value of 6028.1.
The coefficients of the model indicate holding other variables constant, for each one-unit increase in Zoom.wide..W., Price decreases by 19.32 units. Similarly, for each one-unit increase in Weight..inc..batteries., the Price is expected to increase by 0.82 units.
set.seed(12345) # Set seed for reproducibility
n <- nrow(cameras_sub)
# Create an index for the training set (80% of the data)
train_index <- sample(1:n, 0.8 * n)
# Create the training set
train_set <- cameras_sub[train_index, ]
# Create the test set (remaining 20% of the data)
test_set <- cameras_sub[-train_index, ]
# Check the dimensions of the sets
dim(train_set)## [1] 826 13
## [1] 207 13
# create trained model
reg_model_trained <- lm(Price ~ Zoom.wide..W. + Weight..inc..batteries., data = train_set)
summary(reg_model_trained)##
## Call:
## lm(formula = Price ~ Zoom.wide..W. + Weight..inc..batteries.,
## data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1847.3 -195.1 -116.1 17.9 4519.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 952.591 118.455 8.042 3.07e-15 ***
## Zoom.wide..W. -20.949 2.718 -7.707 3.71e-14 ***
## Weight..inc..batteries. 0.567 0.108 5.250 1.94e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 573 on 823 degrees of freedom
## Multiple R-squared: 0.2535, Adjusted R-squared: 0.2517
## F-statistic: 139.7 on 2 and 823 DF, p-value: < 2.2e-16
# Make predictions on the test set
predictions <- predict(reg_model_trained, newdata = test_set)
# Evaluate the model
mse <- mean((test_set$Price - predictions)^2)
rmse <- sqrt(mse)
rsquared <- 1 - (sum((test_set$Price - predictions)^2) / sum((test_set$Price - mean(test_set$Price))^2))
# Print the evaluation metrics
cat("Mean Squared Error (MSE):", mse, "\n")## Mean Squared Error (MSE): 324489.9
## Root Mean Squared Error (RMSE): 569.6401
## R-squared (R²): -0.01273235
With an Multiple R-squared value of 0.2195, this model can only account for 21.95% of the variability in Price using the variables Zoom.wide..W and Weight..inc..batteries. The RMSE is high at 562.5. Perhaps incorporating other variables might improve performance. Lets assess variables “low resolution” and “Zoom.tele..T” to see if they would be appropriate to add to the model.
# Visualize the distribution of 'zoom.tele..T.'
ggplot(cameras, aes(x = Zoom.tele..T.)) +
geom_histogram(binwidth = 50, fill = "#8da0cb", color = "#e78ac3") +
geom_vline(xintercept = mean(cameras$Zoom.tele..T.), color = "red", linetype = "dashed", size = 1) +
labs(title = "Distribution of Camera Zoom.tele..T",
x = "Price",
y = "Frequency") +
theme_minimal()Somewhat normal, but a bit uneven.
ggplot(cameras, aes(x = Low.resolution)) +
geom_histogram(binwidth = 100, fill = "#8da0cb", color = "#e78ac3") +
geom_vline(xintercept = mean(cameras$Low.resolution), color = "red", linetype = "dashed", size = 1) +
labs(title = "Distribution of Camera Low.resolution",
x = "Price",
y = "Frequency") +
theme_minimal()Lots of variability in this but overall normal shape.
Lets add low.resolution and Zoom.wide..W. to our model and call it model 2.
# Build Model 2
model_2 <- lm(Price ~ Low.resolution + Zoom.tele..T. + Zoom.wide..W. + Weight..inc..batteries., data = train_set)
summary(model_2)##
## Call:
## lm(formula = Price ~ Low.resolution + Zoom.tele..T. + Zoom.wide..W. +
## Weight..inc..batteries., data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1838.4 -206.8 -111.8 23.6 4418.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 423.57143 160.63936 2.637 0.008528 **
## Low.resolution 0.12434 0.02777 4.477 8.65e-06 ***
## Zoom.tele..T. -0.89916 0.25423 -3.537 0.000428 ***
## Zoom.wide..W. -11.10968 3.39382 -3.274 0.001107 **
## Weight..inc..batteries. 0.85603 0.12204 7.014 4.84e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 565.5 on 821 degrees of freedom
## Multiple R-squared: 0.2746, Adjusted R-squared: 0.2711
## F-statistic: 77.71 on 4 and 821 DF, p-value: < 2.2e-16
# Make predictions on the test set
predictions <- predict(model_2, newdata = test_set)
# Evaluate the model
mse <- mean((test_set$Price - predictions)^2)
rmse <- sqrt(mse)
rsquared <- 1 - (sum((test_set$Price - predictions)^2) / sum((test_set$Price - mean(test_set$Price))^2))
# Print the evaluation metrics
cat("Mean Squared Error (MSE):", mse, "\n")## Mean Squared Error (MSE): 332374.7
## Root Mean Squared Error (RMSE): 576.5194
## R-squared (R²): -0.03734071
Model 2 still has a poor R-squared value and high RMSE. Lets try log transformation.
# there was # error due to infinite values, so adding 1 to avoid log of zero or negative values
train_set$log_price <- log(train_set$Price + 1)
train_set$log_low.res <- log(train_set$Low.resolution + 1)
train_set$log_zoom.w <- log(train_set$Zoom.wide..W. + 1)
train_set$log_zoom.t <- log(train_set$Zoom.tele..T. + 1)
train_set$log_weight.batt <- log(train_set$Weight..inc..batteries + 1)
log_model <- lm(log_price ~ log_low.res + log_zoom.w + log_zoom.t + log_weight.batt, data = train_set)
summary(log_model)##
## Call:
## lm(formula = log_price ~ log_low.res + log_zoom.w + log_zoom.t +
## log_weight.batt, data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8175 -0.5190 -0.1187 0.4116 2.8372
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.83707 0.26860 21.732 < 2e-16 ***
## log_low.res 0.09672 0.01620 5.970 3.53e-09 ***
## log_zoom.w -0.29635 0.07217 -4.107 4.42e-05 ***
## log_zoom.t -0.03612 0.04941 -0.731 0.465
## log_weight.batt 0.03813 0.03232 1.180 0.238
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7749 on 821 degrees of freedom
## Multiple R-squared: 0.1908, Adjusted R-squared: 0.1868
## F-statistic: 48.38 on 4 and 821 DF, p-value: < 2.2e-16
A log transformation did not raise R-squared. Lets try standardizing the data.
std_camera <- cameras # Create a copy to avoid modifying the original dataset
# Variables to standardize
std_vars <- c("Low.resolution", "Zoom.wide..W.", "Zoom.tele..T.", "Weight..inc..batteries.")
# Standardize selected variables
std_camera[, std_vars] <- scale(std_camera[, std_vars])
# Check the first few rows of the standardized dataset
head(std_camera)## Model Release.date Max.resolution Low.resolution
## 1 Agfa ePhoto 1280 1997 1024 -1.368157
## 2 Agfa ePhoto 1680 1998 1280 -1.368157
## 3 Agfa ePhoto CL18 2000 640 -2.139724
## 4 Agfa ePhoto CL30 1999 1152 -1.368157
## 5 Agfa ePhoto CL30 Clik! 1999 1152 -1.368157
## 6 Agfa ePhoto CL45 2001 1600 -1.368157
## Effective.pixels Zoom.wide..W. Zoom.tele..T. Normal.focus.range
## 1 0 0.4877952 -0.08065066 70
## 2 1 0.4877952 -0.08065066 50
## 3 0 1.1646973 -0.81826991 0
## 4 0 0.1976942 -0.92517124 0
## 5 0 0.9712967 -0.83965017 50
## 6 1 1.7448991 -0.75412910 50
## Macro.focus.range Storage.included Weight..inc..batteries. Dimensions Price
## 1 40 4 0.38683039 95 179
## 2 0 4 0.38683039 158 179
## 3 0 2 -1.22601004 0 179
## 4 0 4 -1.22601004 0 269
## 5 0 40 -0.07398116 128 1299
## 6 20 8 -0.18918405 119 179
# build std model
std_model <- lm(Price ~ Low.resolution + Zoom.wide..W. + Zoom.tele..T. + Weight..inc..batteries., data = std_camera)
summary(std_model)##
## Call:
## lm(formula = Price ~ Low.resolution + Zoom.wide..W. + Zoom.tele..T. +
## Weight..inc..batteries., data = std_camera)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2219.7 -222.0 -113.0 41.7 5639.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 457.92 20.05 22.842 < 2e-16 ***
## Low.resolution 138.47 22.79 6.076 1.73e-09 ***
## Zoom.wide..W. -58.30 35.15 -1.659 0.0975 .
## Zoom.tele..T. -123.89 24.04 -5.153 3.07e-07 ***
## Weight..inc..batteries. 311.04 31.45 9.889 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 645.3 on 1031 degrees of freedom
## Multiple R-squared: 0.284, Adjusted R-squared: 0.2812
## F-statistic: 102.2 on 4 and 1031 DF, p-value: < 2.2e-16
Standardizing the predictor variables improved R squared to 0.284, but it still not a strong model and results regarding the significance of the variables should be interpreted with caution.
## Warning: package 'randomForest' was built under R version 4.3.2
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:psych':
##
## outlier
## The following object is masked from 'package:ggplot2':
##
## margin
##
## Call:
## randomForest(formula = Price ~ Low.resolution + Zoom.tele..T., data = train_set)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 268249.9
## % Var explained: 38.79
## [1] 1
## [1] 442.7539
The random forest can explain 29.9% of the variance in Price. This does not suffice in terms of model deployment, but it beats the standardize model by about 2% and is the best model yet. Other approaches should be explored to model camera price.