getwd()[1] "C:/Users/Nathaniel/DATA110/Data110Project2Dataset"
In this project, I will be exploring data within a cars data set. This data set is really interesting and cool as it includes many variables such as make, model, year, price, transmission, kilometer (mileage), and many more. Although some of the variables are messed up, as in some numerical variables are listed as categorical and some categorical variables are numerical, I will be able to read them in as the correct variable types. In this project, I plan to use the filter() function along with other dplyr functions to clean the data and create a graph using only the specific information that I want. I will specifically be using the variables, make, kilometer, year, and transmission. From this, I will be able to conclude which makes have the cheapest used cars, which makes have the most mileage, what make has the most used cars, and what the average selling price is per make. My source for this project is, “Information is Web scrapped from Various Car Websites”.
getwd()[1] "C:/Users/Nathaniel/DATA110/Data110Project2Dataset"
library(tidyverse)Warning: package 'tidyverse' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(RColorBrewer)
setwd("C:/Users/Nathaniel/DATA110/Data110Project2Dataset")
data <- read_csv("cardataset.csv")Rows: 2059 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): Make, Model, Fuel Type, Transmission, Location, Color, Owner, Sell...
dbl (8): Price, Year, Kilometer, Length, Width, Height, Seating Capacity, F...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Filters for observations only with the makes Toyota, Honda, Nissan, Ford, and Hyundai
cleanedData1 <- data |> filter(Make == "Toyota" | Make == "Honda" | Make == "Nissan" | Make == "Ford" | Make == "Hyundai") |>
#Filters for cars only under 50,000 kilometers
filter(Kilometer < 50000) |>
#Filters for cars only 2015 and newer
filter(Year >= 2015) |>
#Filters for cars with prices only between 300,000 and 750,000
filter(Price < 750000 & Price > 300000)
#Shows the first couple of observations with the specific filters
head(cleanedData1)# A tibble: 6 × 20
Make Model Price Year Kilometer `Fuel Type` Transmission Location Color
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
1 Hyundai Elite … 591000 2017 20281 Petrol Manual Mumbai Red
2 Hyundai Santro… 589000 2018 13772 Petrol Automatic Bangalo… White
3 Hyundai Elite … 551000 2015 47752 Petrol Manual Mumbai White
4 Ford Ecospo… 535000 2015 28000 Petrol Manual Mumbai Silv…
5 Hyundai Xcent … 490000 2015 23103 Petrol Manual Udupi White
6 Hyundai Grand … 715000 2020 17000 Petrol Automatic Bangalo… Silv…
# ℹ 11 more variables: Owner <chr>, `Seller Type` <chr>, Engine <chr>,
# `Max Power` <chr>, `Max Torque` <chr>, Drivetrain <chr>, Length <dbl>,
# Width <dbl>, Height <dbl>, `Seating Capacity` <dbl>,
# `Fuel Tank Capacity` <dbl>
graph1 <- cleanedData1 |>
#Makes the price smaller and easier to read
mutate(Price = Price / 10^3) |>
#Creates graph with x axis, y axis, and legend
ggplot(aes(x = Year, y = Price, color = Make)) +
#Changes the color palette to "Dark2"
scale_color_brewer(palette = "Dark2") +
#Changes the names of the axis and legend
labs(x = "Year of the Car",
y = "Price of the Car (By Thousands)",
color = "Make of the Car") +
#Changes the theme to minimal
theme_minimal(base_size = 12) +
geom_point() +
geom_line() +
#Creates the linear regression line and changes the color to yellow
geom_smooth(method = "lm", color = "yellow") +
#Adds title to the graph
ggtitle("Significance of Car's Price Based on Car's Year")
#Calls the graph to be displayed
graph1`geom_smooth()` using formula = 'y ~ x'
cor(cleanedData1$Year, cleanedData1$Price)[1] 0.3739758
fit1 <- lm(Year ~ Price, data = cleanedData1)
summary(fit1)
Call:
lm(formula = Year ~ Price, data = cleanedData1)
Residuals:
Min 1Q Median 3Q Max
-3.2040 -1.3920 -0.2138 1.4792 4.8923
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.014e+03 1.014e+00 1986.020 < 2e-16 ***
Price 6.851e-06 1.771e-06 3.868 0.000205 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.775 on 92 degrees of freedom
Multiple R-squared: 0.1399, Adjusted R-squared: 0.1305
F-statistic: 14.96 on 1 and 92 DF, p-value: 0.0002048
Price = 6.85(Year) + 2.01
Based on the p-value, plots, linear regression line, and other information, it is safe to say that the year of the vehicle has a direct correlation to the price of the vehicle. By the p-value being so little (0.0002) this shows that the p-value is very significant and the correlation is very strong. Also, on the linear regression plot, we can clearly see the plots following the linear regression line with many plots being close to and on the line.
cleanedData2 <- data |> filter(Transmission == "Manual") |>
filter(Drivetrain == "RWD") |>
filter(Kilometer < 100000) |>
filter(Year >= 2015) |>
filter(Color == "White" | Color == "Black" | Color == "Red" | Color == "Blue" | Color == "Silver")
head(cleanedData2)# A tibble: 6 × 20
Make Model Price Year Kilometer `Fuel Type` Transmission Location Color
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
1 Toyota Innov… 8.65e5 2015 43634 Diesel Manual Delhi Silv…
2 Mahindra Scorp… 1.16e6 2018 32000 Diesel Manual Patna White
3 Mahindra Scorp… 1.22e6 2019 46000 Diesel Manual Patna White
4 Mahindra Scorp… 1.48e6 2020 78000 Diesel Manual Yamunan… White
5 Tata Hexa … 1 e6 2018 38000 Diesel Manual Mumbai White
6 Toyota Innov… 1.96e6 2020 26000 Petrol Manual Delhi Silv…
# ℹ 11 more variables: Owner <chr>, `Seller Type` <chr>, Engine <chr>,
# `Max Power` <chr>, `Max Torque` <chr>, Drivetrain <chr>, Length <dbl>,
# Width <dbl>, Height <dbl>, `Seating Capacity` <dbl>,
# `Fuel Tank Capacity` <dbl>
Graph2 <- cleanedData2 |>
#Creates bar graph with x axis and legend
ggplot() +
geom_bar(aes(x = Color, fill = Make, text = Make)) +
#Changes the names for the axis, legend, title, and caption
labs(x = "Color of the Car",
y = "Number of Cars",
fill = "Make of the Car",
title = "Bar Chart of How Many Cars Were Produced In Each Color and their Make",
caption = "Source: Information is Web scrapped from Various Car Websites") +
#Changes the theme to minimal
theme_minimal(base_size = 12) +
#Changes the color palette to "Pastel1"
scale_fill_brewer(palette = "Pastel1")Warning in geom_bar(aes(x = Color, fill = Make, text = Make)): Ignoring unknown
aesthetics: text
#Calls the graph to be displayed
Graph2library(plotly)Warning: package 'plotly' was built under R version 4.3.3
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
ggplotly(Graph2)In this project I focused on a data set based on used cars. I really liked this data set and I had a lot of fun playing around with the variables. I used several different variables from categorical ones such as, make, transmission, color, and drive train, to numerical ones such as, price, year, and kilometers. All of this data was web scrapped from several car websites. I used many dplyr functions such as filter() and mutate() in order to get the exact data that I wanted. By filtering each variable by specific means, I got data that displayed cars only 2015 and newer, or cars that were only made by Toyota. I chose this data set specifically because I love cars and I knew I wanted to do something with them and I found this data set pretty interesting. Used cars have been around for a long time and buying one can be difficult at first but after some research it should be easier. I found a book in the MC library RaptorSearch that can provide more information about buying a used car. The book “Buying a used car” talks all about the journey of buying a used car and even gives tips and helps you out if you ever wondered what it is like. Source: Buying a Used Car. Federal Trade Commission, 2016. Nonetheless, I am very happy with my visualizations in this project. I got almost everything I wanted to work and everything went smoothly just how I wanted. I enjoyed playing around with the variables to make the cleaned data as specific as possible to limit the unnecessary observations. I played around with variables such as the year and the kilometers a lot in order to really get exactly what I was looking for. I am especially happy with my second visualization because it was exactly what I wanted. My plan was to filter out a car that I would be interested in buying if I were a customer looking for a used car on a website. Just like my car now, I looked for rear wheel drive manual cars and also filtered by specific colors. From this, I got beautiful data that I then used to create a beautiful graph using beautiful colors. I then used plotly at the end to make my bar chart interactive.