Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

What factors play a role in the price of a mobile phone?

Load the data

phone <- read.csv('https://raw.githubusercontent.com/Kingtilon1/DATA607/main/Cellphone.csv')
glimpse(phone)
## Rows: 161
## Columns: 14
## $ Product_id   <int> 203, 880, 40, 99, 880, 947, 774, 947, 99, 1103, 289, 605,…
## $ Price        <int> 2357, 1749, 1916, 1315, 1749, 2137, 1238, 2137, 1315, 258…
## $ Sale         <int> 10, 10, 10, 11, 11, 12, 13, 13, 14, 15, 16, 16, 16, 16, 1…
## $ weight       <dbl> 135.0, 125.0, 110.0, 118.5, 125.0, 150.0, 134.1, 150.0, 1…
## $ resoloution  <dbl> 5.2, 4.0, 4.7, 4.0, 4.0, 5.5, 4.0, 5.5, 4.0, 5.1, 5.3, 5.…
## $ ppi          <int> 424, 233, 312, 233, 233, 401, 233, 401, 233, 432, 277, 20…
## $ cpu.core     <int> 8, 2, 4, 2, 2, 4, 2, 4, 2, 4, 8, 8, 4, 4, 4, 4, 4, 4, 4, …
## $ cpu.freq     <dbl> 1.350, 1.300, 1.200, 1.300, 1.300, 2.300, 1.200, 2.300, 1…
## $ internal.mem <dbl> 16, 4, 8, 4, 4, 16, 8, 16, 4, 16, 32, 4, 16, 32, 16, 8, 1…
## $ ram          <dbl> 3.000, 1.000, 1.500, 0.512, 1.000, 2.000, 1.000, 2.000, 0…
## $ RearCam      <dbl> 13.00, 3.15, 13.00, 3.15, 3.15, 16.00, 2.00, 16.00, 3.15,…
## $ Front_Cam    <dbl> 8.0, 0.0, 5.0, 0.0, 0.0, 8.0, 0.0, 8.0, 0.0, 2.0, 8.0, 0.…
## $ battery      <int> 2610, 1700, 2000, 1400, 1700, 2500, 1560, 2500, 1400, 280…
## $ thickness    <dbl> 7.4, 9.9, 7.6, 11.0, 9.9, 9.5, 11.7, 9.5, 11.0, 8.1, 7.7,…

Fit the model

phone <- phone %>%
  mutate(high_price = ifelse(Price > mean(Price), 1, 0))

phone$weight_squared <- phone$weight^2

phone$battery_interaction <- phone$high_price * phone$battery

model <- lm(Price ~ weight + weight_squared + high_price + battery + battery_interaction, data = phone)

summary(model)
## 
## Call:
## lm(formula = Price ~ weight + weight_squared + high_price + battery + 
##     battery_interaction, data = phone)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -758.30 -207.85   -8.25  237.59  824.01 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         749.117736 117.655399   6.367 2.08e-09 ***
## weight                1.001932   1.521148   0.659    0.511    
## weight_squared       -0.007187   0.001676  -4.289 3.14e-05 ***
## high_price          719.758634 149.853704   4.803 3.66e-06 ***
## battery               0.397754   0.060165   6.611 5.82e-10 ***
## battery_interaction   0.040680   0.052486   0.775    0.439    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 313.6 on 155 degrees of freedom
## Multiple R-squared:  0.8386, Adjusted R-squared:  0.8334 
## F-statistic: 161.1 on 5 and 155 DF,  p-value: < 2.2e-16

Our analysis of smartphone prices revealed notable findings. While weight alone didn’t notably affect price, weight squared exhibited a significant negative relationship, indicating diminishing returns to scale. Products categorized as high-priced commanded a premium of roughly $719.76. Furthermore, increased battery capacity correlated with higher prices, with each unit contributing approximately $0.40. Surprisingly, this relationship didn’t vary between high and low-priced products.

Was the linear model appropriate?

The linear model might not be fully suitable due to uneven variability in errors across different price ranges, like seeing more consistent errors for mid-range smartphones compared to high-end ones like Apple products. Despite this, the normal distribution of errors in the Q-Q plot aligns with linear regression expectations. However, the model’s inability to handle varying levels of error across price categories could limit its ability to accurately predict prices in the real world, particularly for premium products.

residuals_vs_fitted <- ggplot(data = augment(model), aes(.fitted, .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  xlab("Fitted values") +
  ylab("Residuals") +
  ggtitle("Residuals vs Fitted")

qq_plot <- ggplot(data = augment(model), aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line() +
  ggtitle("Normal Q-Q Plot of Residuals")

residual_plots <- gridExtra::grid.arrange(residuals_vs_fitted, qq_plot, ncol = 2)

residual_plots
## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]

The vertical clustering of points in the residual vs fitted plot, particularly between the price ranges of $1500-2000 and $2500-3000, suggests potential issues with varying levels of error in the model predictions across different price levels. However, the dots being close to the diagonal line in the Q-Q plot indicates that the distribution of errors follows a normal distribution, aligning well with the assumptions of linear regression. So, while there might be concerns about the model’s accuracy in certain price ranges, the overall distribution of errors is consistent with my expectations.

Resource

Kaggle