The purpose of this assignment is to conduct an extensive analysis of
the CarDB dataset using R.
We aim to explore relationships among features like engine
specifications, fuel efficiency, and car pricing using data
manipulation, statistical summaries, and visualizations.
This document includes: - Data import and cleaning - Statistical
analysis of key variables - Answering insightful questions using
dplyr - Visualizing data through various plots
We use the readr package for importing data,
dplyr for data manipulation, and ggplot2 for
visualization.
require(readr)
## Loading required package: readr
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(ggplot2)
## Loading required package: ggplot2
require(GGally)
## Loading required package: GGally
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
require(corrplot)
## Loading required package: corrplot
## corrplot 0.95 loaded
carDB <- read_csv("C:\\Users\\newsa\\Desktop\\data.csv")
## Rows: 11914 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Make, Model, Engine Fuel Type, Transmission Type, Driven_Wheels, Ma...
## dbl (8): Year, Engine HP, Engine Cylinders, Number of Doors, highway MPG, ci...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(carDB)
## Make Model Year Engine Fuel Type
## Length:11914 Length:11914 Min. :1990 Length:11914
## Class :character Class :character 1st Qu.:2007 Class :character
## Mode :character Mode :character Median :2015 Mode :character
## Mean :2010
## 3rd Qu.:2016
## Max. :2017
##
## Engine HP Engine Cylinders Transmission Type Driven_Wheels
## Min. : 55.0 Min. : 0.000 Length:11914 Length:11914
## 1st Qu.: 170.0 1st Qu.: 4.000 Class :character Class :character
## Median : 227.0 Median : 6.000 Mode :character Mode :character
## Mean : 249.4 Mean : 5.629
## 3rd Qu.: 300.0 3rd Qu.: 6.000
## Max. :1001.0 Max. :16.000
## NA's :69 NA's :30
## Number of Doors Market Category Vehicle Size Vehicle Style
## Min. :2.000 Length:11914 Length:11914 Length:11914
## 1st Qu.:2.000 Class :character Class :character Class :character
## Median :4.000 Mode :character Mode :character Mode :character
## Mean :3.436
## 3rd Qu.:4.000
## Max. :4.000
## NA's :6
## highway MPG city mpg Popularity MSRP
## Min. : 12.00 Min. : 7.00 Min. : 2 Min. : 2000
## 1st Qu.: 22.00 1st Qu.: 16.00 1st Qu.: 549 1st Qu.: 21000
## Median : 26.00 Median : 18.00 Median :1385 Median : 29995
## Mean : 26.64 Mean : 19.73 Mean :1555 Mean : 40595
## 3rd Qu.: 30.00 3rd Qu.: 22.00 3rd Qu.:2009 3rd Qu.: 42231
## Max. :354.00 Max. :137.00 Max. :5657 Max. :2065902
##
sum(is.na(carDB))
## [1] 108
colSums(is.na(carDB))
## Make Model Year Engine Fuel Type
## 0 0 0 3
## Engine HP Engine Cylinders Transmission Type Driven_Wheels
## 69 30 0 0
## Number of Doors Market Category Vehicle Size Vehicle Style
## 6 0 0 0
## highway MPG city mpg Popularity MSRP
## 0 0 0 0
carDB$`Engine Fuel Type`[carDB$Model == "Verona" & is.na(carDB$`Engine Fuel Type`)] <- "regular unleaded"
carDB$`Engine Cylinders`[carDB$`Engine Fuel Type` == "electric"] <- 0
carDB <- carDB[!is.na(carDB$`Engine Cylinders`), ]
carDB <- carDB[carDB$Make != "Tesla", ]
carDB$`Number of Doors`[carDB$Make == "Ferrari" & carDB$Model == "FF"] <- 2
carDB <- carDB[!is.na(carDB$`Engine HP`), ]
carDB <- carDB[carDB$MSRP != 2000, ]
get_statistics <- function(column) {
uniqv <- unique(column)
mode_val <- uniqv[which.max(tabulate(match(column, uniqv)))]
list(
min = min(column),
max = max(column),
mean = mean(column),
median = median(column),
mode = mode_val
)
}
numerical_columns <- c("Year", "Engine HP", "highway MPG", "city mpg", "Popularity", "MSRP")
statistics_results <- lapply(carDB[numerical_columns], get_statistics)
statistics_results
## $Year
## $Year$min
## [1] 1990
##
## $Year$max
## [1] 2017
##
## $Year$mean
## [1] 2011.989
##
## $Year$median
## [1] 2015
##
## $Year$mode
## [1] 2015
##
##
## $`Engine HP`
## $`Engine HP`$min
## [1] 66
##
## $`Engine HP`$max
## [1] 1001
##
## $`Engine HP`$mean
## [1] 259.4756
##
## $`Engine HP`$median
## [1] 240
##
## $`Engine HP`$mode
## [1] 200
##
##
## $`highway MPG`
## $`highway MPG`$min
## [1] 12
##
## $`highway MPG`$max
## [1] 354
##
## $`highway MPG`$mean
## [1] 26.54148
##
## $`highway MPG`$median
## [1] 26
##
## $`highway MPG`$mode
## [1] 24
##
##
## $`city mpg`
## $`city mpg`$min
## [1] 7
##
## $`city mpg`$max
## [1] 137
##
## $`city mpg`$mean
## [1] 19.51784
##
## $`city mpg`$median
## [1] 18
##
## $`city mpg`$mode
## [1] 17
##
##
## $Popularity
## $Popularity$min
## [1] 2
##
## $Popularity$max
## [1] 5657
##
## $Popularity$mean
## [1] 1564.49
##
## $Popularity$median
## [1] 1385
##
## $Popularity$mode
## [1] 1385
##
##
## $MSRP
## $MSRP$min
## [1] 2002
##
## $MSRP$max
## [1] 2065902
##
## $MSRP$mean
## [1] 44274.69
##
## $MSRP$median
## [1] 31520
##
## $MSRP$mode
## [1] 25995
carDB %>%
filter(`Engine Cylinders` > 6, `highway MPG` > 25)
## # A tibble: 59 × 16
## Make Model Year `Engine Fuel Type` `Engine HP` `Engine Cylinders`
## <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 BMW 7 Seri… 2017 premium unleaded … 445 8
## 2 Audi A8 2016 premium unleaded … 450 8
## 3 Audi A8 2017 premium unleaded … 450 8
## 4 Mercedes-Benz CLS-Cl… 2015 premium unleaded … 402 8
## 5 Mercedes-Benz CLS-Cl… 2016 premium unleaded … 402 8
## 6 Mercedes-Benz CLS-Cl… 2017 premium unleaded … 402 8
## 7 Chevrolet Corvet… 2014 premium unleaded … 455 8
## 8 Chevrolet Corvet… 2014 premium unleaded … 455 8
## 9 Chevrolet Corvet… 2014 premium unleaded … 455 8
## 10 Chevrolet Corvet… 2014 premium unleaded … 455 8
## # ℹ 49 more rows
## # ℹ 10 more variables: `Transmission Type` <chr>, Driven_Wheels <chr>,
## # `Number of Doors` <dbl>, `Market Category` <chr>, `Vehicle Size` <chr>,
## # `Vehicle Style` <chr>, `highway MPG` <dbl>, `city mpg` <dbl>,
## # Popularity <dbl>, MSRP <dbl>
carDB %>%
arrange(desc(MSRP)) %>%
head(10)
## # A tibble: 10 × 16
## Make Model Year `Engine Fuel Type` `Engine HP` `Engine Cylinders`
## <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 Bugatti Veyron 1… 2008 premium unleaded … 1001 16
## 2 Bugatti Veyron 1… 2009 premium unleaded … 1001 16
## 3 Lamborghini Reventon 2008 premium unleaded … 650 12
## 4 Bugatti Veyron 1… 2008 premium unleaded … 1001 16
## 5 Maybach Landaulet 2012 premium unleaded … 620 12
## 6 Maybach Landaulet 2011 premium unleaded … 620 12
## 7 Ferrari Enzo 2003 premium unleaded … 660 12
## 8 Lamborghini Aventador 2014 premium unleaded … 720 12
## 9 Lamborghini Aventador 2015 premium unleaded … 720 12
## 10 Lamborghini Aventador 2016 premium unleaded … 750 12
## # ℹ 10 more variables: `Transmission Type` <chr>, Driven_Wheels <chr>,
## # `Number of Doors` <dbl>, `Market Category` <chr>, `Vehicle Size` <chr>,
## # `Vehicle Style` <chr>, `highway MPG` <dbl>, `city mpg` <dbl>,
## # Popularity <dbl>, MSRP <dbl>
carDB %>%
group_by(Make) %>%
summarise(Average_MSRP = mean(MSRP),
Total_Models = n())
## # A tibble: 47 × 3
## Make Average_MSRP Total_Models
## <chr> <dbl> <int>
## 1 Acura 36106. 243
## 2 Alfa Romeo 61600 5
## 3 Aston Martin 197910. 93
## 4 Audi 60395. 289
## 5 BMW 61547. 334
## 6 Bentley 247169. 74
## 7 Bugatti 1757224. 3
## 8 Buick 32215. 170
## 9 Cadillac 57347. 389
## 10 Chevrolet 30212. 1041
## # ℹ 37 more rows
carDB %>%
group_by(`Engine Fuel Type`) %>%
summarise(Average_HP = mean(`Engine HP`))
## # A tibble: 9 × 2
## `Engine Fuel Type` Average_HP
## <chr> <dbl>
## 1 diesel 186.
## 2 electric 145.
## 3 flex-fuel (premium unleaded recommended/E85) 283.
## 4 flex-fuel (premium unleaded required/E85) 515.
## 5 flex-fuel (unleaded/E85) 286.
## 6 natural gas 110
## 7 premium unleaded (recommended) 270.
## 8 premium unleaded (required) 375.
## 9 regular unleaded 215.
carDB %>%
group_by(`Engine Cylinders`) %>%
filter(Popularity == max(Popularity))
## # A tibble: 812 × 16
## # Groups: Engine Cylinders [9]
## Make Model Year `Engine Fuel Type` `Engine HP` `Engine Cylinders`
## <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 BMW 7 Series 2015 premium unleaded (requir… 535 12
## 2 BMW 8 Series 1995 regular unleaded 322 12
## 3 BMW 8 Series 1995 regular unleaded 372 12
## 4 BMW 8 Series 1996 regular unleaded 322 12
## 5 BMW 8 Series 1997 regular unleaded 322 12
## 6 Ford Bronco 1994 regular unleaded 185 8
## 7 Ford Bronco 1994 regular unleaded 185 8
## 8 Ford Bronco 1994 regular unleaded 185 8
## 9 Ford Bronco 1995 regular unleaded 205 8
## 10 Ford Bronco 1995 regular unleaded 205 8
## # ℹ 802 more rows
## # ℹ 10 more variables: `Transmission Type` <chr>, Driven_Wheels <chr>,
## # `Number of Doors` <dbl>, `Market Category` <chr>, `Vehicle Size` <chr>,
## # `Vehicle Style` <chr>, `highway MPG` <dbl>, `city mpg` <dbl>,
## # Popularity <dbl>, MSRP <dbl>
carDB <- carDB %>%
mutate(`Combined MPG` = (`city mpg` + `highway MPG`) / 2)
head(carDB$`Combined MPG`)
## [1] 22.5 23.5 24.0 23.0 23.0 23.0
applying various visualization functions on CarDB to gain a deeper understanding of all the variables and their relation to each other
This code creates a bar chart showing the count of cars for each Vehicle Style, using an blue fill and a minimalistic theme for a clean look.
ggplot(carDB, aes(x = `Vehicle Style`)) +
geom_bar(fill = "steelblue") +
labs(title = "Number of Cars by Vehicle Style",
x = "Vehicle Style",
y = "Count of Cars") +
theme_minimal()
This code plots a bar chart displaying the frequency of cars across different Transmission Types, with an orange color scheme and a minimal clean theme.
ggplot(carDB, aes(x = `Transmission Type`)) +
geom_bar(fill = "orange") +
labs(title = "Number of Cars by Transmission Type",
x = "Transmission Type",
y = "Count of Cars") +
theme_minimal()
This code generates a scatter plot where each point represents a car, mapping Engine HP to the x-axis and MSRP to the y-axis, with red-colored points and a minimalistic theme for enhanced clarity.
ggplot(carDB, aes(x = `Engine HP`, y = MSRP)) +
geom_point(color = "red", size = 2, shape = 20) +
labs(title = "Your Title", x = "X-axis Label", y = "Y-axis Label") +
theme_minimal() # Optional: Add a theme for better aesthetics
This code creates a scatter plot showing the relationship between Engine Horsepower and Highway MPG, using semi-transparent tomato-colored points and a minimalist theme for a clear and simple presentation.
ggplot(carDB, aes(x = `Engine HP`, y = `highway MPG`)) +
geom_point(color = "tomato", size = 2, alpha = 0.7) +
labs(title = "Engine Horsepower vs Highway Mileage",
x = "Engine Horsepower (HP)",
y = "Highway MPG") +
theme_minimal()
This code generates a pair plot to visualize pairwise relationships between Engine HP, Engine Cylinders, Highway MPG, City MPG, and MSRP, allowing quick detection of correlations and patterns across multiple variables.
library(GGally)
carDB_numeric <- carDB[, c("Engine HP", "Engine Cylinders", "highway MPG", "city mpg", "MSRP")]
ggpairs(carDB_numeric)
This code creates a pair plot displaying relationships between key numeric variables, with points colored by Vehicle Size to highlight differences across categories while exploring correlations and distributions.
ggpairs(carDB,
columns = c("Engine HP", "Engine Cylinders", "highway MPG", "city mpg", "MSRP"),
mapping = aes(color = `Vehicle Size`))
This code filters the dataset to include only cars with an MSRP less than or equal to $200,000 and creates a histogram that visualizes the distribution of car prices, using light blue bars with dark blue outlines and a minimalistic theme for clarity.
filteredCarDB <- carDB[carDB$MSRP <= 200000,]
ggplot(filteredCarDB, aes(x = MSRP)) +
geom_histogram(bins = 30, fill = "lightblue", color = "darkblue") +
labs(title = "Distribution of Car Prices",
x = "Value",
y = "Frequency") +
theme_minimal()
This code creates a histogram to visualize the distribution of engine horsepower across cars, with orange bars outlined in black, and 30 bins to show the frequency of different horsepower values, all presented in a minimalistic theme for clear readability.
ggplot(carDB, aes(x = `Engine HP`)) +
geom_histogram(fill = "orange", color = "black", bins = 30) +
labs(title = "Distribution of Engine Horsepower",
x = "Horsepower (HP)",
y = "Number of Cars") +
theme_minimal()
This code plots the empirical cumulative distribution function (ECDF) for car prices (MSRP), using a blue step-line to show the cumulative probability of car prices, with clear axis labels and a minimalistic theme for a clean and easy-to-read visualization.
ggplot(carDB, aes(x = MSRP)) +
stat_ecdf(geom = "step", color = "blue", size = 1) +
labs(title = "Cumulative Distribution of Car Prices",
x = "Price (MSRP)",
y = "Cumulative Probability") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
This code plots the empirical cumulative distribution function (ECDF) for Engine Horsepower (HP), using a dark green step-line to show the cumulative probability of horsepower values, with clearly labeled axes and a minimalistic theme for a clean and professional look.
ggplot(carDB, aes(x = `Engine HP`)) +
stat_ecdf(geom = "step", color = "darkgreen", size = 1) +
labs(title = "Cumulative Distribution of Engine Horsepower",
x = "Horsepower (HP)",
y = "Cumulative Probability") +
theme_minimal()
This code creates a boxplot to visualize the distribution of car prices (MSRP) for different Vehicle Sizes, with light blue boxes and dark blue outlines, providing insights into the spread and central tendency of prices across vehicle size categories. The plot uses a minimalistic theme for clarity.
library(ggplot2)
ggplot(carDB, aes(x = `Vehicle Size`, y = MSRP)) +
geom_boxplot(fill = "lightblue", color = "darkblue") +
labs(title = "Car Prices by Vehicle Size",
x = "Vehicle Size",
y = "Price (MSRP)") +
theme_minimal()
This code generates a boxplot that visualizes city mileage (MPG) across different Transmission Types, using light green boxes with dark green outlines, to show the distribution, spread, and central tendency of city MPG for each transmission category. The plot is presented with a minimalistic theme for better clarity.
ggplot(carDB, aes(x = `Transmission Type`, y = `city mpg`)) +
geom_boxplot(fill = "lightgreen", color = "darkgreen") +
labs(title = "City Mileage by Transmission Type",
x = "Transmission Type",
y = "City MPG") +
theme_minimal()
This code creates a boxplot to visualize car prices (MSRP) across different Engine Cylinder categories, using purple boxes with black outlines. It also rotates the x-axis labels by 45° for better readability, especially when cylinder numbers are numerous. The plot uses a minimalistic theme for clarity.
ggplot(carDB, aes(x = as.factor(`Engine Cylinders`), y = MSRP)) +
geom_boxplot(fill = "purple", color = "black") +
labs(title = "Car Prices by Engine Cylinders",
x = "Engine Cylinders",
y = "Price (MSRP)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
anova_result1 <- aov(MSRP ~ as.factor(`Engine Cylinders`), data = carDB)
summary(anova_result1)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(`Engine Cylinders`) 8 2.595e+13 3.244e+12 2277 <2e-16 ***
## Residuals 10780 1.535e+13 1.424e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova_result2 <- aov(`city mpg` ~ `Vehicle Size`, data = carDB)
summary(anova_result2)
## Df Sum Sq Mean Sq F value Pr(>F)
## `Vehicle Size` 2 62026 31013 650.4 <2e-16 ***
## Residuals 10786 514311 48
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This code computes and prints the correlation matrix for selected numeric variables, including Engine HP, Engine Cylinders, highway MPG, city mpg, and MSRP, to examine the pairwise relationships and strength of correlation between these variables.
# Compute correlation matrix for selected numeric variables
cor_matrix <- cor(carDB[, c("Engine HP", "Engine Cylinders", "highway MPG", "city mpg", "MSRP")])
# Print correlation matrix
print(cor_matrix)
## Engine HP Engine Cylinders highway MPG city mpg MSRP
## Engine HP 1.0000000 0.7996033 -0.4462507 -0.4723677 0.6494039
## Engine Cylinders 0.7996033 1.0000000 -0.6267807 -0.6216346 0.5528937
## highway MPG -0.4462507 -0.6267807 1.0000000 0.8527788 -0.2160119
## city mpg -0.4723677 -0.6216346 0.8527788 1.0000000 -0.2270533
## MSRP 0.6494039 0.5528937 -0.2160119 -0.2270533 1.0000000
This code visualizes the correlation matrix for selected numeric variables using a colored correlation plot, where the upper triangle is shown with colored tiles representing correlation strength. The black text labels are rotated for better readability, making it easier to assess the relationships between Engine HP, Engine Cylinders, highway MPG, city mpg, and MSRP.
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 45)
This code fits a simple linear regression model to predict MSRP based on Engine HP. The model summary is displayed to evaluate the relationship between Engine HP and MSRP, including key statistics like coefficients, p-values, and R-squared values to assess the model’s performance.
# Simple Linear Regression Model: MSRP ~ Engine HP
model1 <- lm(MSRP ~ `Engine HP`, data = carDB)
# Show model summary
summary(model1)
##
## Call:
## lm(formula = MSRP ~ `Engine HP`, data = carDB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -151849 -18861 230 13552 1746791
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -51896.514 1175.121 -44.16 <2e-16 ***
## `Engine HP` 370.637 4.179 88.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47050 on 10787 degrees of freedom
## Multiple R-squared: 0.4217, Adjusted R-squared: 0.4217
## F-statistic: 7867 on 1 and 10787 DF, p-value: < 2.2e-16
This code fits a simple linear regression model to predict City MPG based on Engine HP. The model summary is displayed to examine the coefficients, p-values, and other key statistics that describe the relationship between Engine HP and City MPG.
# Simple Linear Regression Model: City MPG ~ Engine HP
model2 <- lm(`city mpg` ~ `Engine HP`, data = carDB)
# Show model summary
summary(model2)
##
## Call:
## lm(formula = `city mpg` ~ `Engine HP`, data = carDB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.444 -3.023 -0.431 2.180 114.633
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.7814182 0.1608967 172.67 <2e-16 ***
## `Engine HP` -0.0318472 0.0005722 -55.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.443 on 10787 degrees of freedom
## Multiple R-squared: 0.2231, Adjusted R-squared: 0.2231
## F-statistic: 3098 on 1 and 10787 DF, p-value: < 2.2e-16
This code creates a scatter plot showing the relationship between Engine Horsepower and MSRP. It also adds a linear regression line (in red) to highlight the trend between the two variables, with a minimalistic theme for clarity and focus.
# Visualize regression line for MSRP vs. Engine HP
ggplot(carDB, aes(x = `Engine HP`, y = MSRP)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Regression: MSRP vs Engine HP",
x = "Engine HP",
y = "Price (MSRP)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
This code fits a simple linear regression model to predict City MPG based on Engine Horsepower. The predicted values are then plotted as a red regression line, overlaying the scatter plot of actual data points. The plot is designed with a minimalistic theme, and the title and subtitle are styled for emphasis.
# Simple linear regression model: city mpg ~ Engine HP
model_mpg <- lm(`city mpg` ~ `Engine HP`, data = carDB)
# Predict values
carDB$predicted_city_mpg <- predict(model_mpg)
# Plot
ggplot(carDB, aes(x = `Engine HP`, y = `city mpg`)) +
geom_point(alpha = 0.4, color = "#00BFC4") + # scatter points
geom_line(aes(y = predicted_city_mpg), color = "#F8766D", size = 1.5) + # regression line
labs(title = "Regression Prediction Line: Engine HP vs City MPG",
subtitle = "Red Line: Predicted City MPG from Engine HP",
x = "Engine Horsepower",
y = "City MPG") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(size = 12))
This code fits a multiple linear regression model to predict MSRP based on Engine HP, Engine Cylinders, and Vehicle Size. The model summary is displayed to examine the coefficients, statistical significance, and other key parameters that describe the relationship between the predictors and the target variable.
# Multiple Linear Regression Model: MSRP ~ Engine HP + Engine Cylinders + Vehicle Size
model3 <- lm(MSRP ~ `Engine HP` + `Engine Cylinders` + `Vehicle Size`, data = carDB)
# Show model summary
summary(model3)
##
## Call:
## lm(formula = MSRP ~ `Engine HP` + `Engine Cylinders` + `Vehicle Size`,
## data = carDB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -137202 -16943 461 13313 1694961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64222.331 1549.615 -41.44 <2e-16 ***
## `Engine HP` 332.619 6.759 49.21 <2e-16 ***
## `Engine Cylinders` 6388.191 431.169 14.82 <2e-16 ***
## `Vehicle Size`Large -32935.232 1305.613 -25.23 <2e-16 ***
## `Vehicle Size`Midsize -16737.506 1028.196 -16.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45560 on 10784 degrees of freedom
## Multiple R-squared: 0.4581, Adjusted R-squared: 0.4579
## F-statistic: 2279 on 4 and 10784 DF, p-value: < 2.2e-16
This code fits a 2nd degree polynomial regression model to predict City MPG based on Engine Horsepower. It then plots the actual data points with alpha transparency and overlays the predicted polynomial regression curve in red. The plot is presented with a minimalistic theme, and the title is styled for emphasis.
library(ggplot2)
# Fit polynomial regression model (degree 2)
model_poly <- lm(`city mpg` ~ poly(`Engine HP`, 2), data = carDB)
# Make predictions
carDB$predicted_city_mpg_poly <- predict(model_poly)
# Plot
ggplot(carDB, aes(x = `Engine HP`, y = `city mpg`)) +
geom_point(alpha = 0.4, color = "#00BFC4") +
geom_line(aes(y = predicted_city_mpg_poly), color = "#F8766D", size = 1.5) +
labs(title = "Polynomial Regression: Engine HP vs City MPG",
subtitle = "2nd Degree Polynomial Fit",
x = "Engine Horsepower",
y = "City MPG") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(size = 12))