Midterm Project: Cars2015 Analysis

Author

Cody Paulay-Simmons

Published

July 27, 2025

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Introduction

This midterm project works on the cars2015.csv dataset, which includes 110 car models and 20 variables such as make, model, fuel efficiency, dimensions, acceleration times, and price ranges. Using exploratory data analysis (EDA) techniques, I will explore the patterns and relationships in the data.

cars2015 <- read_csv("cars2015.csv")

Rows: 110 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Make, Model, Type, Drive, Size
dbl (15): LowPrice, HighPrice, CityMPG, HwyMPG, FuelCap, Length, Width, Whee...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(cars2015)

Rows: 110
Columns: 20
$ Make      <chr> "Chevrolet", "Hyundai", "Kia", "Mitsubishi", "Nissan", "Dodg…
$ Model     <chr> "Spark", "Accent", "Rio", "Mirage", "Versa Note", "Dart", "C…
$ Type      <chr> "Hatchback", "Hatchback", "Sedan", "Hatchback", "Hatchback",…
$ LowPrice  <dbl> 12.270, 14.745, 13.990, 12.995, 14.180, 16.495, 16.170, 19.3…
$ HighPrice <dbl> 25.560, 17.495, 18.290, 15.395, 17.960, 23.795, 25.660, 24.6…
$ Drive     <chr> "FWD", "FWD", "FWD", "FWD", "FWD", "FWD", "FWD", "FWD", "FWD…
$ CityMPG   <dbl> 30, 28, 28, 37, 31, 23, 24, 24, 28, 30, 27, 27, 25, 27, 30, …
$ HwyMPG    <dbl> 39, 37, 36, 44, 40, 35, 36, 33, 38, 35, 33, 36, 36, 37, 39, …
$ FuelCap   <dbl> 9.0, 11.4, 11.3, 9.2, 10.9, 14.2, 15.6, 13.1, 12.4, 11.1, 11…
$ Length    <dbl> 145, 172, 172, 149, 164, 184, 181, 167, 179, 154, 156, 180, …
$ Width     <dbl> 63, 67, 68, 66, 67, 72, 71, 70, 72, 67, 68, 69, 70, 68, 69, …
$ Wheelbase <dbl> 94, 101, 101, 97, 102, 106, 106, 103, 104, 99, 98, 104, 104,…
$ Height    <dbl> 61, 57, 57, 59, 61, 58, 58, 66, 58, 59, 58, 58, 57, 58, 59, …
$ UTurn     <dbl> 34, 37, 37, 32, 37, 38, 38, 37, 39, 34, 35, 38, 37, 36, 37, …
$ Weight    <dbl> 2345, 2550, 2575, 2085, 2470, 3260, 3140, 3330, 2990, 2385, …
$ Acc030    <dbl> 4.4, 3.7, 3.5, 4.4, 4.0, 3.4, 3.7, 3.9, 3.4, 3.9, 3.9, 3.7, …
$ Acc060    <dbl> 12.8, 10.3, 9.5, 12.1, 10.9, 9.3, 9.8, 9.5, 9.2, 10.8, 11.1,…
$ QtrMile   <dbl> 19.4, 17.8, 17.3, 19.0, 18.2, 17.2, 17.6, 17.4, 17.1, 18.3, …
$ PageNum   <dbl> 123, 148, 163, 188, 196, 128, 119, 131, 136, 216, 179, 205, …
$ Size      <chr> "Small", "Small", "Small", "Small", "Small", "Small", "Small…

Single Variable Exploration

1. What is the range and average of City and Highway MPG?

cars2015 |> 
  summarize(mean_city = mean(CityMPG),
            mean_highway = mean(HwyMPG),
            range_city = range(CityMPG),
            range_highway = range(HwyMPG))

Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.

# A tibble: 2 × 4
  mean_city mean_highway range_city range_highway
      <dbl>        <dbl>      <dbl>         <dbl>
1      20.8         29.4         12            18
2      20.8         29.4         37            44

Data Analysis and Insights

The average MPG for the city sits right under at 21 MPG and ranges from 12 to 37 MPG. The highway averages just above at 29 MPG and ranges from 18 to 44 MPG.

2. Distribution of Vehicle Types

cars2015 |> 
  count(Type) |> 
  ggplot(aes(x = reorder(Type,
                         n), 
             y = n)) +
  geom_bar(stat = "identity",
           fill = "skyblue") + 
  labs(title = "Count of Cars by Type", x = "Type", y = "Count")

Data Analysis and Insights

Vehicle Type is skewed toward sedans by far. Wagon is least preferred but it is not way off among other types.

I used ChatGPT to help me sort the bars by count, not by alphabets so it suggested me to use reorder(Type, n) within the aes() function and it worked nicely. Because in this case, the y axis is the count of the cars of each type that I used in the first line at count(Type) to count how many cars belong to each type and there is 6 types in total. Without reorder, it is a mess so this helped me to understand the process of making it look cleaner in the plot. The line in stat = “identity” tells ggplot to use the actual n values so the bar would not be all messed up, lining up at the top altogether.

3. Cars with Highest and Lowest Weight

cars2015 |> 
  select(Make, 
         Model, 
         Weight) |> 
  arrange(Weight) |> 
  slice(c(1,
          110))

# A tibble: 2 × 3
  Make       Model      Weight
  <chr>      <chr>       <dbl>
1 Mitsubishi Mirage       2085
2 Ford       Expedition   6265

Data Analysis and Insights

Mitsubishi’s Mirage weights the lightest at 2,085 pounds while Ford’s Expedition weights the heaviest, more than tripling the Mitsubishi’s Mirage at 6,265 pounds.

The slice function makes it quick to pick the first and last observation through arranging the weight variable but I do wonder if there is another function that would automatically pick the first and last observation?

4. Average 0–60 Acceleration Time

cars2015 |> 
  summarize(mean(Acc060))

# A tibble: 1 × 1
  `mean(Acc060)`
           <dbl>
1           7.94

Data Analysis and Insights

The average 0-60 mph acceleration time in this dataset is around right under 8 seconds.

Variable Relationships

5. Weight vs. City MPG

cars2015 |> 
  ggplot(aes(x = Weight, 
             y = CityMPG)) +
  geom_point() +
  labs(title = "Relationship Between Weight and City MPG")

Data Analysis and Insights

Smaller, lighter cars tend to have better fuel economy. Heavier cars usually have lower MPG, it is a tradeoff.

6. High Price vs. Acceleration Time

cars2015 |> 
  ggplot(aes(x = Acc060, 
             y = HighPrice)) +
  geom_point(alpha = 0.6,
             color = "maroon") +
  geom_smooth(method = "lm", 
              se = FALSE, 
              color = "gold") +
  labs(title = "Price vs. 0–60 MPH Acceleration Time",
       x = "0-60 MPH of Acceleration Time in Seconds",
       y = "Value of Car") +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

Data Analysis and Insights

High-end cars usually accelerate faster and cost more. The powerful the car is, the pricey it gets.

I added some extra artwork in the plot by using geom_smooth to add the line which is linear model and removed the standard error to show the relationship more clearly to the eye rather than previous plot. I added more labels and changed the colors with transparency to show more visuals. Made the theme minimal as well.

Conclusion

There are trends linking weight to efficiency and price to performance. These plots show how manufacturers pick vehicles in the market based on tradeoffs between cost, performance, and usage. This was a fun project to work on overall because there are many cars that I am familiar with and getting to see more of their statistics in my hands are interesting to see even though it is 10 years old data but used car market is on the rise so its good to know!