Data110 Project 2

Author

Nathaniel Nguyen

Introduction

In this project, I will be exploring data within a cars data set. This data set is really interesting and cool as it includes many variables such as make, model, year, price, transmission, kilometer (mileage), and many more. Although some of the variables are messed up, as in some numerical variables are listed as categorical and some categorical variables are numerical, I will be able to read them in as the correct variable types. In this project, I plan to use the filter() function along with other dplyr functions to clean the data and create a graph using only the specific information that I want. I will specifically be using the variables, make, kilometer, year, and transmission. From this, I will be able to conclude which makes have the cheapest used cars, which makes have the most mileage, what make has the most used cars, and what the average selling price is per make. My source for this project is, “Information is Web scrapped from Various Car Websites”.

Getting Work Directory

getwd()
[1] "C:/Users/Nathaniel/DATA110/Data110Project2Dataset"

Calling all packages, setting work directory, and reading csv file

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(RColorBrewer)
setwd("C:/Users/Nathaniel/DATA110/Data110Project2Dataset")
data <- read_csv("cardataset.csv")
Rows: 2059 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): Make, Model, Fuel Type, Transmission, Location, Color, Owner, Sell...
dbl  (8): Price, Year, Kilometer, Length, Width, Height, Seating Capacity, F...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Creating a subset with specific filter data

#Filters for observations only with the makes Toyota, Honda, Nissan, Ford, and Hyundai
cleanedData1 <- data |> filter(Make == "Toyota" | Make == "Honda" | Make == "Nissan" | Make == "Ford" | Make == "Hyundai") |>
#Filters for cars only under 50,000 kilometers
  filter(Kilometer < 50000) |>
#Filters for cars only 2015 and newer
  filter(Year >= 2015) |>
#Filters for cars with prices only between 300,000 and 750,000
  filter(Price < 750000 & Price > 300000)
#Shows the first couple of observations with the specific filters
head(cleanedData1)
# A tibble: 6 × 20
  Make    Model    Price  Year Kilometer `Fuel Type` Transmission Location Color
  <chr>   <chr>    <dbl> <dbl>     <dbl> <chr>       <chr>        <chr>    <chr>
1 Hyundai Elite … 591000  2017     20281 Petrol      Manual       Mumbai   Red  
2 Hyundai Santro… 589000  2018     13772 Petrol      Automatic    Bangalo… White
3 Hyundai Elite … 551000  2015     47752 Petrol      Manual       Mumbai   White
4 Ford    Ecospo… 535000  2015     28000 Petrol      Manual       Mumbai   Silv…
5 Hyundai Xcent … 490000  2015     23103 Petrol      Manual       Udupi    White
6 Hyundai Grand … 715000  2020     17000 Petrol      Automatic    Bangalo… Silv…
# ℹ 11 more variables: Owner <chr>, `Seller Type` <chr>, Engine <chr>,
#   `Max Power` <chr>, `Max Torque` <chr>, Drivetrain <chr>, Length <dbl>,
#   Width <dbl>, Height <dbl>, `Seating Capacity` <dbl>,
#   `Fuel Tank Capacity` <dbl>
graph1 <- cleanedData1 |>
#Makes the price smaller and easier to read
  mutate(Price = Price / 10^3) |>
#Creates graph with x axis, y axis, and legend
  ggplot(aes(x = Year, y = Price, color = Make)) +
#Changes the color palette to "Dark2"
  scale_color_brewer(palette = "Dark2") +
#Changes the names of the axis and legend
  labs(x = "Year of the Car",
       y = "Price of the Car (By Thousands)",
       color = "Make of the Car") +
#Changes the theme to minimal
  theme_minimal(base_size = 12) +
  geom_point() +
  geom_line() +
#Creates the linear regression line and changes the color to yellow
  geom_smooth(method = "lm", color = "yellow") +
#Adds title to the graph
  ggtitle("Significance of Car's Price Based on Car's Year")
#Calls the graph to be displayed
graph1
`geom_smooth()` using formula = 'y ~ x'

Correlation between the year of the car to the price of it

cor(cleanedData1$Year, cleanedData1$Price)
[1] 0.3739758

Gets the Linear Regression Equation

fit1 <- lm(Year ~ Price, data = cleanedData1)
summary(fit1)

Call:
lm(formula = Year ~ Price, data = cleanedData1)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2040 -1.3920 -0.2138  1.4792  4.8923 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept) 2.014e+03  1.014e+00 1986.020  < 2e-16 ***
Price       6.851e-06  1.771e-06    3.868 0.000205 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.775 on 92 degrees of freedom
Multiple R-squared:  0.1399,    Adjusted R-squared:  0.1305 
F-statistic: 14.96 on 1 and 92 DF,  p-value: 0.0002048

Linear Regression Equation

Price = 6.85(Year) + 2.01

Diagnosis of Correlation

Based on the p-value, plots, linear regression line, and other information, it is safe to say that the year of the vehicle has a direct correlation to the price of the vehicle. By the p-value being so little (0.0002) this shows that the p-value is very significant and the correlation is very strong. Also, on the linear regression plot, we can clearly see the plots following the linear regression line with many plots being close to and on the line.

Filters for manual, rear wheel drive cars under 100,000 kilometers, 2015 and newer, and have a color of white, black, red, blue, or silver

cleanedData2 <- data |> filter(Transmission == "Manual") |>
  filter(Drivetrain == "RWD") |>
  filter(Kilometer < 100000) |>
  filter(Year >= 2015) |>
  filter(Color == "White" | Color == "Black" | Color == "Red" | Color == "Blue" | Color == "Silver")
head(cleanedData2)
# A tibble: 6 × 20
  Make     Model   Price  Year Kilometer `Fuel Type` Transmission Location Color
  <chr>    <chr>   <dbl> <dbl>     <dbl> <chr>       <chr>        <chr>    <chr>
1 Toyota   Innov… 8.65e5  2015     43634 Diesel      Manual       Delhi    Silv…
2 Mahindra Scorp… 1.16e6  2018     32000 Diesel      Manual       Patna    White
3 Mahindra Scorp… 1.22e6  2019     46000 Diesel      Manual       Patna    White
4 Mahindra Scorp… 1.48e6  2020     78000 Diesel      Manual       Yamunan… White
5 Tata     Hexa … 1   e6  2018     38000 Diesel      Manual       Mumbai   White
6 Toyota   Innov… 1.96e6  2020     26000 Petrol      Manual       Delhi    Silv…
# ℹ 11 more variables: Owner <chr>, `Seller Type` <chr>, Engine <chr>,
#   `Max Power` <chr>, `Max Torque` <chr>, Drivetrain <chr>, Length <dbl>,
#   Width <dbl>, Height <dbl>, `Seating Capacity` <dbl>,
#   `Fuel Tank Capacity` <dbl>
Graph2 <- cleanedData2 |>
#Creates bar graph with x axis and legend
  ggplot() +
  geom_bar(aes(x = Color, fill = Make, text = Make)) +
#Changes the names for the axis, legend, title, and caption
  labs(x = "Color of the Car",
       y = "Number of Cars",
       fill = "Make of the Car",
       title = "Bar Chart of How Many Cars Were Produced In Each Color and their Make",
       caption = "Source: Information is Web scrapped from Various Car Websites") +
#Changes the theme to minimal
  theme_minimal(base_size = 12) +
#Changes the color palette to "Pastel1"
  scale_fill_brewer(palette = "Pastel1")
Warning in geom_bar(aes(x = Color, fill = Make, text = Make)): Ignoring unknown
aesthetics: text
#Calls the graph to be displayed
Graph2

Adds interactivity

library(plotly)
Warning: package 'plotly' was built under R version 4.3.3

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout
ggplotly(Graph2)

End of Project Essay

In this project I focused on a data set based on used cars. I really liked this data set and I had a lot of fun playing around with the variables. I used several different variables from categorical ones such as, make, transmission, color, and drive train, to numerical ones such as, price, year, and kilometers. All of this data was web scrapped from several car websites. I used many dplyr functions such as filter() and mutate() in order to get the exact data that I wanted. By filtering each variable by specific means, I got data that displayed cars only 2015 and newer, or cars that were only made by Toyota. I chose this data set specifically because I love cars and I knew I wanted to do something with them and I found this data set pretty interesting. Used cars have been around for a long time and buying one can be difficult at first but after some research it should be easier. I found a book in the MC library RaptorSearch that can provide more information about buying a used car. The book “Buying a used car” talks all about the journey of buying a used car and even gives tips and helps you out if you ever wondered what it is like. Source: Buying a Used Car. Federal Trade Commission, 2016. Nonetheless, I am very happy with my visualizations in this project. I got almost everything I wanted to work and everything went smoothly just how I wanted. I enjoyed playing around with the variables to make the cleaned data as specific as possible to limit the unnecessary observations. I played around with variables such as the year and the kilometers a lot in order to really get exactly what I was looking for. I am especially happy with my second visualization because it was exactly what I wanted. My plan was to filter out a car that I would be interested in buying if I were a customer looking for a used car on a website. Just like my car now, I looked for rear wheel drive manual cars and also filtered by specific colors. From this, I got beautiful data that I then used to create a beautiful graph using beautiful colors. I then used plotly at the end to make my bar chart interactive.