Introduction

The purpose of this assignment is to conduct an extensive analysis of the CarDB dataset using R.
We aim to explore relationships among features like engine specifications, fuel efficiency, and car pricing using data manipulation, statistical summaries, and visualizations.

This document includes: - Data import and cleaning - Statistical analysis of key variables - Answering insightful questions using dplyr - Visualizing data through various plots

Loading Required Libraries

We use the readr package for importing data, dplyr for data manipulation, and ggplot2 for visualization.

require(readr)
## Loading required package: readr
require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(ggplot2)
## Loading required package: ggplot2
require(GGally)
## Loading required package: GGally
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
require(corrplot)
## Loading required package: corrplot
## corrplot 0.95 loaded

Importing the Dataset

carDB <- read_csv("C:\\Users\\newsa\\Desktop\\data.csv")
## Rows: 11914 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Make, Model, Engine Fuel Type, Transmission Type, Driven_Wheels, Ma...
## dbl (8): Year, Engine HP, Engine Cylinders, Number of Doors, highway MPG, ci...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Initial Data Inspection

summary(carDB)
##      Make              Model                Year      Engine Fuel Type  
##  Length:11914       Length:11914       Min.   :1990   Length:11914      
##  Class :character   Class :character   1st Qu.:2007   Class :character  
##  Mode  :character   Mode  :character   Median :2015   Mode  :character  
##                                        Mean   :2010                     
##                                        3rd Qu.:2016                     
##                                        Max.   :2017                     
##                                                                         
##    Engine HP      Engine Cylinders Transmission Type  Driven_Wheels     
##  Min.   :  55.0   Min.   : 0.000   Length:11914       Length:11914      
##  1st Qu.: 170.0   1st Qu.: 4.000   Class :character   Class :character  
##  Median : 227.0   Median : 6.000   Mode  :character   Mode  :character  
##  Mean   : 249.4   Mean   : 5.629                                        
##  3rd Qu.: 300.0   3rd Qu.: 6.000                                        
##  Max.   :1001.0   Max.   :16.000                                        
##  NA's   :69       NA's   :30                                            
##  Number of Doors Market Category    Vehicle Size       Vehicle Style     
##  Min.   :2.000   Length:11914       Length:11914       Length:11914      
##  1st Qu.:2.000   Class :character   Class :character   Class :character  
##  Median :4.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :3.436                                                           
##  3rd Qu.:4.000                                                           
##  Max.   :4.000                                                           
##  NA's   :6                                                               
##   highway MPG        city mpg        Popularity        MSRP        
##  Min.   : 12.00   Min.   :  7.00   Min.   :   2   Min.   :   2000  
##  1st Qu.: 22.00   1st Qu.: 16.00   1st Qu.: 549   1st Qu.:  21000  
##  Median : 26.00   Median : 18.00   Median :1385   Median :  29995  
##  Mean   : 26.64   Mean   : 19.73   Mean   :1555   Mean   :  40595  
##  3rd Qu.: 30.00   3rd Qu.: 22.00   3rd Qu.:2009   3rd Qu.:  42231  
##  Max.   :354.00   Max.   :137.00   Max.   :5657   Max.   :2065902  
## 
sum(is.na(carDB))
## [1] 108
colSums(is.na(carDB))
##              Make             Model              Year  Engine Fuel Type 
##                 0                 0                 0                 3 
##         Engine HP  Engine Cylinders Transmission Type     Driven_Wheels 
##                69                30                 0                 0 
##   Number of Doors   Market Category      Vehicle Size     Vehicle Style 
##                 6                 0                 0                 0 
##       highway MPG          city mpg        Popularity              MSRP 
##                 0                 0                 0                 0

Data Cleaning

Handling missing Engine Fuel Type

carDB$`Engine Fuel Type`[carDB$Model == "Verona" & is.na(carDB$`Engine Fuel Type`)] <- "regular unleaded"

Assign 0 cylinders to electric cars

carDB$`Engine Cylinders`[carDB$`Engine Fuel Type` == "electric"] <- 0

Remove entries with NA Engine Cylinders

carDB <- carDB[!is.na(carDB$`Engine Cylinders`), ]

Remove Tesla cars

carDB <- carDB[carDB$Make != "Tesla", ]

Fix door count for Ferrari FF

carDB$`Number of Doors`[carDB$Make == "Ferrari" & carDB$Model == "FF"] <- 2

Remove entries with missing Engine HP

carDB <- carDB[!is.na(carDB$`Engine HP`), ]

Remove obsolete MSRP entries

carDB <- carDB[carDB$MSRP != 2000, ]

Custom function to compute basic statistics

get_statistics <- function(column) {
  uniqv <- unique(column)
  mode_val <- uniqv[which.max(tabulate(match(column, uniqv)))]
  
  list(
    min = min(column),
    max = max(column),
    mean = mean(column),
    median = median(column),
    mode = mode_val
  )
}

Numerical columns for analysis

numerical_columns <- c("Year", "Engine HP", "highway MPG", "city mpg", "Popularity", "MSRP")

Applying function to each numerical column

statistics_results <- lapply(carDB[numerical_columns], get_statistics)
statistics_results
## $Year
## $Year$min
## [1] 1990
## 
## $Year$max
## [1] 2017
## 
## $Year$mean
## [1] 2011.989
## 
## $Year$median
## [1] 2015
## 
## $Year$mode
## [1] 2015
## 
## 
## $`Engine HP`
## $`Engine HP`$min
## [1] 66
## 
## $`Engine HP`$max
## [1] 1001
## 
## $`Engine HP`$mean
## [1] 259.4756
## 
## $`Engine HP`$median
## [1] 240
## 
## $`Engine HP`$mode
## [1] 200
## 
## 
## $`highway MPG`
## $`highway MPG`$min
## [1] 12
## 
## $`highway MPG`$max
## [1] 354
## 
## $`highway MPG`$mean
## [1] 26.54148
## 
## $`highway MPG`$median
## [1] 26
## 
## $`highway MPG`$mode
## [1] 24
## 
## 
## $`city mpg`
## $`city mpg`$min
## [1] 7
## 
## $`city mpg`$max
## [1] 137
## 
## $`city mpg`$mean
## [1] 19.51784
## 
## $`city mpg`$median
## [1] 18
## 
## $`city mpg`$mode
## [1] 17
## 
## 
## $Popularity
## $Popularity$min
## [1] 2
## 
## $Popularity$max
## [1] 5657
## 
## $Popularity$mean
## [1] 1564.49
## 
## $Popularity$median
## [1] 1385
## 
## $Popularity$mode
## [1] 1385
## 
## 
## $MSRP
## $MSRP$min
## [1] 2002
## 
## $MSRP$max
## [1] 2065902
## 
## $MSRP$mean
## [1] 44274.69
## 
## $MSRP$median
## [1] 31520
## 
## $MSRP$mode
## [1] 25995

Cars with more than 6 Cylinders and Highway MPG > 25

carDB %>%
  filter(`Engine Cylinders` > 6, `highway MPG` > 25)
## # A tibble: 59 × 16
##    Make          Model    Year `Engine Fuel Type` `Engine HP` `Engine Cylinders`
##    <chr>         <chr>   <dbl> <chr>                    <dbl>              <dbl>
##  1 BMW           7 Seri…  2017 premium unleaded …         445                  8
##  2 Audi          A8       2016 premium unleaded …         450                  8
##  3 Audi          A8       2017 premium unleaded …         450                  8
##  4 Mercedes-Benz CLS-Cl…  2015 premium unleaded …         402                  8
##  5 Mercedes-Benz CLS-Cl…  2016 premium unleaded …         402                  8
##  6 Mercedes-Benz CLS-Cl…  2017 premium unleaded …         402                  8
##  7 Chevrolet     Corvet…  2014 premium unleaded …         455                  8
##  8 Chevrolet     Corvet…  2014 premium unleaded …         455                  8
##  9 Chevrolet     Corvet…  2014 premium unleaded …         455                  8
## 10 Chevrolet     Corvet…  2014 premium unleaded …         455                  8
## # ℹ 49 more rows
## # ℹ 10 more variables: `Transmission Type` <chr>, Driven_Wheels <chr>,
## #   `Number of Doors` <dbl>, `Market Category` <chr>, `Vehicle Size` <chr>,
## #   `Vehicle Style` <chr>, `highway MPG` <dbl>, `city mpg` <dbl>,
## #   Popularity <dbl>, MSRP <dbl>

Top 10 Most Expensive Cars

carDB %>%
  arrange(desc(MSRP)) %>%
  head(10)
## # A tibble: 10 × 16
##    Make        Model      Year `Engine Fuel Type` `Engine HP` `Engine Cylinders`
##    <chr>       <chr>     <dbl> <chr>                    <dbl>              <dbl>
##  1 Bugatti     Veyron 1…  2008 premium unleaded …        1001                 16
##  2 Bugatti     Veyron 1…  2009 premium unleaded …        1001                 16
##  3 Lamborghini Reventon   2008 premium unleaded …         650                 12
##  4 Bugatti     Veyron 1…  2008 premium unleaded …        1001                 16
##  5 Maybach     Landaulet  2012 premium unleaded …         620                 12
##  6 Maybach     Landaulet  2011 premium unleaded …         620                 12
##  7 Ferrari     Enzo       2003 premium unleaded …         660                 12
##  8 Lamborghini Aventador  2014 premium unleaded …         720                 12
##  9 Lamborghini Aventador  2015 premium unleaded …         720                 12
## 10 Lamborghini Aventador  2016 premium unleaded …         750                 12
## # ℹ 10 more variables: `Transmission Type` <chr>, Driven_Wheels <chr>,
## #   `Number of Doors` <dbl>, `Market Category` <chr>, `Vehicle Size` <chr>,
## #   `Vehicle Style` <chr>, `highway MPG` <dbl>, `city mpg` <dbl>,
## #   Popularity <dbl>, MSRP <dbl>

Average MSRP and Total Number of Models per Make

carDB %>%
  group_by(Make) %>%
  summarise(Average_MSRP = mean(MSRP),
            Total_Models = n())
## # A tibble: 47 × 3
##    Make         Average_MSRP Total_Models
##    <chr>               <dbl>        <int>
##  1 Acura              36106.          243
##  2 Alfa Romeo         61600             5
##  3 Aston Martin      197910.           93
##  4 Audi               60395.          289
##  5 BMW                61547.          334
##  6 Bentley           247169.           74
##  7 Bugatti          1757224.            3
##  8 Buick              32215.          170
##  9 Cadillac           57347.          389
## 10 Chevrolet          30212.         1041
## # ℹ 37 more rows

Average Engine HP per Engine Fuel Type

carDB %>%
  group_by(`Engine Fuel Type`) %>%
  summarise(Average_HP = mean(`Engine HP`))
## # A tibble: 9 × 2
##   `Engine Fuel Type`                           Average_HP
##   <chr>                                             <dbl>
## 1 diesel                                             186.
## 2 electric                                           145.
## 3 flex-fuel (premium unleaded recommended/E85)       283.
## 4 flex-fuel (premium unleaded required/E85)          515.
## 5 flex-fuel (unleaded/E85)                           286.
## 6 natural gas                                        110 
## 7 premium unleaded (recommended)                     270.
## 8 premium unleaded (required)                        375.
## 9 regular unleaded                                   215.

Feature Engineering

We create a new feature: Combined Average MPG as the mean of city and highway MPG.

carDB <- carDB %>%
  mutate(`Combined MPG` = (`city mpg` + `highway MPG`) / 2)

head(carDB$`Combined MPG`)
## [1] 22.5 23.5 24.0 23.0 23.0 23.0

Visualization

applying various visualization functions on CarDB to gain a deeper understanding of all the variables and their relation to each other

Bar Plot: Number of Cars by Vehicle Style

This code creates a bar chart showing the count of cars for each Vehicle Style, using an blue fill and a minimalistic theme for a clean look.

ggplot(carDB, aes(x = `Vehicle Style`)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Number of Cars by Vehicle Style",
       x = "Vehicle Style",
       y = "Count of Cars") +
  theme_minimal()

Bar Plot: Number of Cars by Transmission Type

This code plots a bar chart displaying the frequency of cars across different Transmission Types, with an orange color scheme and a minimal clean theme.

 ggplot(carDB, aes(x = `Transmission Type`)) +
  geom_bar(fill = "orange") +
  labs(title = "Number of Cars by Transmission Type",
       x = "Transmission Type",
       y = "Count of Cars") +
  theme_minimal()

Scatter Plot: Engine HP vs MSRP

This code generates a scatter plot where each point represents a car, mapping Engine HP to the x-axis and MSRP to the y-axis, with red-colored points and a minimalistic theme for enhanced clarity.

ggplot(carDB, aes(x = `Engine HP`, y = MSRP)) +
  geom_point(color = "red", size = 2, shape = 20) +
  labs(title = "Your Title", x = "X-axis Label", y = "Y-axis Label") +
  theme_minimal() # Optional: Add a theme for better aesthetics

Scatter Plot: Engine HP vs Highway MPG

This code creates a scatter plot showing the relationship between Engine Horsepower and Highway MPG, using semi-transparent tomato-colored points and a minimalist theme for a clear and simple presentation.

ggplot(carDB, aes(x = `Engine HP`, y = `highway MPG`)) +
  geom_point(color = "tomato", size = 2, alpha = 0.7) +
  labs(title = "Engine Horsepower vs Highway Mileage",
       x = "Engine Horsepower (HP)",
       y = "Highway MPG") +
  theme_minimal()

Pair Plot: Exploring Numeric Relationships in carDB

This code generates a pair plot to visualize pairwise relationships between Engine HP, Engine Cylinders, Highway MPG, City MPG, and MSRP, allowing quick detection of correlations and patterns across multiple variables.

library(GGally)

carDB_numeric <- carDB[, c("Engine HP", "Engine Cylinders", "highway MPG", "city mpg", "MSRP")]

ggpairs(carDB_numeric)

Colored Pair Plot: Numeric Relationships by Vehicle Size

This code creates a pair plot displaying relationships between key numeric variables, with points colored by Vehicle Size to highlight differences across categories while exploring correlations and distributions.

ggpairs(carDB,
        columns = c("Engine HP", "Engine Cylinders", "highway MPG", "city mpg", "MSRP"),
        mapping = aes(color = `Vehicle Size`))

Histogram: Distribution of Car Prices

This code filters the dataset to include only cars with an MSRP less than or equal to $200,000 and creates a histogram that visualizes the distribution of car prices, using light blue bars with dark blue outlines and a minimalistic theme for clarity.

filteredCarDB <- carDB[carDB$MSRP <= 200000,]
ggplot(filteredCarDB, aes(x = MSRP)) +
  geom_histogram(bins = 30, fill = "lightblue", color = "darkblue") +
  labs(title = "Distribution of Car Prices",
       x = "Value",
       y = "Frequency") +
  theme_minimal()

Histogram: Distribution of Engine Horsepower

This code creates a histogram to visualize the distribution of engine horsepower across cars, with orange bars outlined in black, and 30 bins to show the frequency of different horsepower values, all presented in a minimalistic theme for clear readability.

ggplot(carDB, aes(x = `Engine HP`)) +
  geom_histogram(fill = "orange", color = "black", bins = 30) +
  labs(title = "Distribution of Engine Horsepower",
       x = "Horsepower (HP)",
       y = "Number of Cars") +
  theme_minimal()

ECDF: Cumulative Distribution of Car Prices

This code plots the empirical cumulative distribution function (ECDF) for car prices (MSRP), using a blue step-line to show the cumulative probability of car prices, with clear axis labels and a minimalistic theme for a clean and easy-to-read visualization.

ggplot(carDB, aes(x = MSRP)) +
  stat_ecdf(geom = "step", color = "blue", size = 1) +
  labs(title = "Cumulative Distribution of Car Prices",
       x = "Price (MSRP)",
       y = "Cumulative Probability") +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ECDF: Cumulative Distribution of Engine Horsepower

This code plots the empirical cumulative distribution function (ECDF) for Engine Horsepower (HP), using a dark green step-line to show the cumulative probability of horsepower values, with clearly labeled axes and a minimalistic theme for a clean and professional look.

ggplot(carDB, aes(x = `Engine HP`)) +
  stat_ecdf(geom = "step", color = "darkgreen", size = 1) +
  labs(title = "Cumulative Distribution of Engine Horsepower",
       x = "Horsepower (HP)",
       y = "Cumulative Probability") +
  theme_minimal()

Boxplot: Car Prices by Vehicle Size

This code creates a boxplot to visualize the distribution of car prices (MSRP) for different Vehicle Sizes, with light blue boxes and dark blue outlines, providing insights into the spread and central tendency of prices across vehicle size categories. The plot uses a minimalistic theme for clarity.

library(ggplot2)

ggplot(carDB, aes(x = `Vehicle Size`, y = MSRP)) +
  geom_boxplot(fill = "lightblue", color = "darkblue") +
  labs(title = "Car Prices by Vehicle Size",
       x = "Vehicle Size",
       y = "Price (MSRP)") +
  theme_minimal()

Boxplot: City Mileage by Transmission Type

This code generates a boxplot that visualizes city mileage (MPG) across different Transmission Types, using light green boxes with dark green outlines, to show the distribution, spread, and central tendency of city MPG for each transmission category. The plot is presented with a minimalistic theme for better clarity.

ggplot(carDB, aes(x = `Transmission Type`, y = `city mpg`)) +
  geom_boxplot(fill = "lightgreen", color = "darkgreen") +
  labs(title = "City Mileage by Transmission Type",
       x = "Transmission Type",
       y = "City MPG") +
  theme_minimal()

Boxplot: Car Prices by Engine Cylinders

This code creates a boxplot to visualize car prices (MSRP) across different Engine Cylinder categories, using purple boxes with black outlines. It also rotates the x-axis labels by 45° for better readability, especially when cylinder numbers are numerous. The plot uses a minimalistic theme for clarity.

ggplot(carDB, aes(x = as.factor(`Engine Cylinders`), y = MSRP)) +
  geom_boxplot(fill = "purple", color = "black") +
  labs(title = "Car Prices by Engine Cylinders",
       x = "Engine Cylinders",
       y = "Price (MSRP)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  

One-way ANOVA for MSRP based on Engine Cylinders

anova_result1 <- aov(MSRP ~ as.factor(`Engine Cylinders`), data = carDB)

Summary of ANOVA result

summary(anova_result1)
##                                  Df    Sum Sq   Mean Sq F value Pr(>F)    
## as.factor(`Engine Cylinders`)     8 2.595e+13 3.244e+12    2277 <2e-16 ***
## Residuals                     10780 1.535e+13 1.424e+09                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

One-way ANOVA for City MPG based on Vehicle Size

anova_result2 <- aov(`city mpg` ~ `Vehicle Size`, data = carDB)

Summary of ANOVA result

summary(anova_result2)
##                   Df Sum Sq Mean Sq F value Pr(>F)    
## `Vehicle Size`     2  62026   31013   650.4 <2e-16 ***
## Residuals      10786 514311      48                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation Matrix: Selected Numeric Variables

This code computes and prints the correlation matrix for selected numeric variables, including Engine HP, Engine Cylinders, highway MPG, city mpg, and MSRP, to examine the pairwise relationships and strength of correlation between these variables.

# Compute correlation matrix for selected numeric variables
cor_matrix <- cor(carDB[, c("Engine HP", "Engine Cylinders", "highway MPG", "city mpg", "MSRP")])

# Print correlation matrix
print(cor_matrix)
##                   Engine HP Engine Cylinders highway MPG   city mpg       MSRP
## Engine HP         1.0000000        0.7996033  -0.4462507 -0.4723677  0.6494039
## Engine Cylinders  0.7996033        1.0000000  -0.6267807 -0.6216346  0.5528937
## highway MPG      -0.4462507       -0.6267807   1.0000000  0.8527788 -0.2160119
## city mpg         -0.4723677       -0.6216346   0.8527788  1.0000000 -0.2270533
## MSRP              0.6494039        0.5528937  -0.2160119 -0.2270533  1.0000000

Correlation Plot: Visualizing Pairwise Relationships

This code visualizes the correlation matrix for selected numeric variables using a colored correlation plot, where the upper triangle is shown with colored tiles representing correlation strength. The black text labels are rotated for better readability, making it easier to assess the relationships between Engine HP, Engine Cylinders, highway MPG, city mpg, and MSRP.

corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 45)

Simple Linear Regression: MSRP ~ Engine HP

This code fits a simple linear regression model to predict MSRP based on Engine HP. The model summary is displayed to evaluate the relationship between Engine HP and MSRP, including key statistics like coefficients, p-values, and R-squared values to assess the model’s performance.

# Simple Linear Regression Model: MSRP ~ Engine HP
model1 <- lm(MSRP ~ `Engine HP`, data = carDB)

# Show model summary
summary(model1)
## 
## Call:
## lm(formula = MSRP ~ `Engine HP`, data = carDB)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -151849  -18861     230   13552 1746791 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -51896.514   1175.121  -44.16   <2e-16 ***
## `Engine HP`    370.637      4.179   88.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47050 on 10787 degrees of freedom
## Multiple R-squared:  0.4217, Adjusted R-squared:  0.4217 
## F-statistic:  7867 on 1 and 10787 DF,  p-value: < 2.2e-16

Simple Linear Regression: City MPG ~ Engine HP

This code fits a simple linear regression model to predict City MPG based on Engine HP. The model summary is displayed to examine the coefficients, p-values, and other key statistics that describe the relationship between Engine HP and City MPG.

# Simple Linear Regression Model: City MPG ~ Engine HP
model2 <- lm(`city mpg` ~ `Engine HP`, data = carDB)

# Show model summary
summary(model2)
## 
## Call:
## lm(formula = `city mpg` ~ `Engine HP`, data = carDB)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.444  -3.023  -0.431   2.180 114.633 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.7814182  0.1608967  172.67   <2e-16 ***
## `Engine HP` -0.0318472  0.0005722  -55.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.443 on 10787 degrees of freedom
## Multiple R-squared:  0.2231, Adjusted R-squared:  0.2231 
## F-statistic:  3098 on 1 and 10787 DF,  p-value: < 2.2e-16

Regression: MSRP vs Engine HP

This code creates a scatter plot showing the relationship between Engine Horsepower and MSRP. It also adds a linear regression line (in red) to highlight the trend between the two variables, with a minimalistic theme for clarity and focus.

# Visualize regression line for MSRP vs. Engine HP
ggplot(carDB, aes(x = `Engine HP`, y = MSRP)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Regression: MSRP vs Engine HP",
       x = "Engine HP",
       y = "Price (MSRP)") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Simple Linear Regression: Engine HP vs City MPG

This code fits a simple linear regression model to predict City MPG based on Engine Horsepower. The predicted values are then plotted as a red regression line, overlaying the scatter plot of actual data points. The plot is designed with a minimalistic theme, and the title and subtitle are styled for emphasis.

# Simple linear regression model: city mpg ~ Engine HP
model_mpg <- lm(`city mpg` ~ `Engine HP`, data = carDB)

# Predict values
carDB$predicted_city_mpg <- predict(model_mpg)

# Plot
ggplot(carDB, aes(x = `Engine HP`, y = `city mpg`)) +
  geom_point(alpha = 0.4, color = "#00BFC4") + # scatter points
  geom_line(aes(y = predicted_city_mpg), color = "#F8766D", size = 1.5) + # regression line
  labs(title = "Regression Prediction Line: Engine HP vs City MPG",
       subtitle = "Red Line: Predicted City MPG from Engine HP",
       x = "Engine Horsepower",
       y = "City MPG") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 16),
        plot.subtitle = element_text(size = 12))

Multiple Linear Regression: MSRP ~ Engine HP + Engine Cylinders + Vehicle Size

This code fits a multiple linear regression model to predict MSRP based on Engine HP, Engine Cylinders, and Vehicle Size. The model summary is displayed to examine the coefficients, statistical significance, and other key parameters that describe the relationship between the predictors and the target variable.

# Multiple Linear Regression Model: MSRP ~ Engine HP + Engine Cylinders + Vehicle Size
model3 <- lm(MSRP ~ `Engine HP` + `Engine Cylinders` + `Vehicle Size`, data = carDB)

# Show model summary
summary(model3)
## 
## Call:
## lm(formula = MSRP ~ `Engine HP` + `Engine Cylinders` + `Vehicle Size`, 
##     data = carDB)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -137202  -16943     461   13313 1694961 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -64222.331   1549.615  -41.44   <2e-16 ***
## `Engine HP`              332.619      6.759   49.21   <2e-16 ***
## `Engine Cylinders`      6388.191    431.169   14.82   <2e-16 ***
## `Vehicle Size`Large   -32935.232   1305.613  -25.23   <2e-16 ***
## `Vehicle Size`Midsize -16737.506   1028.196  -16.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45560 on 10784 degrees of freedom
## Multiple R-squared:  0.4581, Adjusted R-squared:  0.4579 
## F-statistic:  2279 on 4 and 10784 DF,  p-value: < 2.2e-16

Polynomial Regression: Engine HP vs City MPG

This code fits a 2nd degree polynomial regression model to predict City MPG based on Engine Horsepower. It then plots the actual data points with alpha transparency and overlays the predicted polynomial regression curve in red. The plot is presented with a minimalistic theme, and the title is styled for emphasis.

library(ggplot2)

# Fit polynomial regression model (degree 2)
model_poly <- lm(`city mpg` ~ poly(`Engine HP`, 2), data = carDB)

# Make predictions
carDB$predicted_city_mpg_poly <- predict(model_poly)

# Plot
ggplot(carDB, aes(x = `Engine HP`, y = `city mpg`)) +
  geom_point(alpha = 0.4, color = "#00BFC4") +
  geom_line(aes(y = predicted_city_mpg_poly), color = "#F8766D", size = 1.5) +
  labs(title = "Polynomial Regression: Engine HP vs City MPG",
       subtitle = "2nd Degree Polynomial Fit",
       x = "Engine Horsepower",
       y = "City MPG") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 16),
        plot.subtitle = element_text(size = 12))