Have you ever wanted to have a collection of something? It could be trading cards, board games, art pieces, or whatever you want them to be. In this project, we’ll suppose you collect something with a lot of data - cars!

Your car collection is almost complete - it’s just missing one thing. You think about it for a while, then it comes to you - you don’t have a car from 1985! But there are so many cars from 1985 available for you to collect! How will you know which 1985 car you will buy for your collection?

Luckily for you, we have a dataset that stores precisely that information. This dataset comes from the UCI Machine Learning Repository and is linked here. This dataset has been adapted so that the variable names are along the first rows.

As you’ll notice in this project, though, this dataset isn’t ready for analysis yet. You will have to make some changes along the way to clean and tidy up the dataset. Do that, and you’ll have the ability to perform a great analysis!

Loading and Inspecting the Data

1.What good is an analysis if we don’t even have the tools to perform the said analysis? Some of the tools you will need for this analysis are the readr and dplyr tidyverse packages. Load the libraries at the top of the notebook.Rmd file so you can access the functions you will need later on.

Task 1

# load libraries

library(readr)
library(dplyr)

2.The last tool we need is the data itself! The file cars85.csv stores the data that comes from the UCI Machine Learning Repository. Load the file into a dataframe called cars to get started.

Task 2

# load data
cars <- read_csv('cars85.csv')
cars

3.It’s always a good idea to inspect the data you load into R. It helps you to know what you are working with. Inspect cars with head() and summary().

What kind of information do you have? What can you do with this information?

Each row in this dataframe is a single car, and each column stores some characteristic about that car. You want to get the best value for your collection, so you want to analyze as much as you can before buying. Doing so will help you make your choice easier!

Task 3

# inspect data
head(cars)
summary()
錯誤發生在 summary.default(): 缺少引數 "object",也沒有預設值

Clean the Data

4.After inspecting the dataframe, you notice something odd about the normalized_losses column. This column has a lot of entries that are question marks (?). This variable is not worth looking at since we don’t have all the cars’ expected losses.

Let’s remove this column from the dataset. Select all columns from cars but normalized_losses. Save your new dataframe to cars.

Task 4

# select columns
cars <- cars %>%
  select(-normalized_losses)
cars

5.Print the column names of cars. Are they clear and descriptive?

Task 5

# view columns
colnames(cars)
 [1] "symboling"         "make"              "fuel_type"         "aspiration"        "num_of_doors"     
 [6] "body_style"        "drive_wheels"      "engine_location"   "wheel_base"        "length"           
[11] "width"             "height"            "curb_weight"       "engine_type"       "num_of_cylinders" 
[16] "engine_size"       "fuel_system"       "bore"              "stroke"            "compression_ratio"
[21] "horsepower"        "peak_rpm"          "city_mpg"          "highway_mpg"       "price"            

6.You know, symboling doesn’t say anything to you at first glance. According to the UCI webpage, the symboling variable represents the car’s risk factor. That variable name doesn’t seem to go with the description. You should simplify this variable name to have it make more sense.

Update that column name in cars as follows:

symboling -> risk_factor Print the column names of cars to confirm the names of the columns have changed.

Task 6

# rename column
cars <- cars %>%
  rename(risk_factor = symboling)
colnames(cars)
 [1] "risk_factor"       "make"              "fuel_type"         "aspiration"        "num_of_doors"     
 [6] "body_style"        "drive_wheels"      "engine_location"   "wheel_base"        "length"           
[11] "width"             "height"            "curb_weight"       "engine_type"       "num_of_cylinders" 
[16] "engine_size"       "fuel_system"       "bore"              "stroke"            "compression_ratio"
[21] "horsepower"        "peak_rpm"          "city_mpg"          "highway_mpg"       "price"            

7.Your car collection means a lot to you. You want each car to be of value to you. What better way to do that than to buy a car with a lot of miles-per-gallon on the highways? To determine this, first, suppose only cars exceeding 30 mpg on the highways interest you. You seek to measure how different each car’s highway mpg is from your 30 mpg threshold. Create a variable called mpg_threshold with the value 30.

Task 7

# define threshold
mpg_threshold <- 30
mpg_threshold
[1] 30

8.Add a new column to cars called mpg_diff_from_threshold. This will measure how far each car’s highway mpg is from 30 mpg. View the updated cars dataframe.

Task 8

# add column
cars <- cars %>%
  mutate(mpg_diff_from_threshold = highway_mpg - mpg_threshold)
cars

Filter and Arrange Rows

9.You’ll add a car to your collection only if it gets more than 30 miles per gallon on the highways. Filter the rows of cars to find all the cars where mpg_diff_from_threshold is greater than 0. Save this new dataframe to mpg_exceeds_threshold and view it.

Task 9

# filter rows
mpg_exceeds_threshold <- cars %>%
  filter(mpg_diff_from_threshold > 0)
mpg_exceeds_threshold

10.Which cars have the highest miles per gallon on the highways? To find this, arrange the rows of mpg_exceeds_threshold by mpg_diff_from_threshold descending. Save this new dataframe as mpg_exceeds_threshold.

Task 10

# arrange rows
mpg_exceeds_threshold <- cars %>%
  arrange(desc(mpg_diff_from_threshold))
mpg_exceeds_threshold

11.Now suppose you want your next car to have a large engine. Order the rows of cars by engine_size descending. Save the new data frame to ordered_by_engine_size. View ordered_by_engine_size.

Task 11

# order rows by engine size
ordered_by_engine_size <- cars %>%
  arrange(desc(engine_size))
ordered_by_engine_size

Specifying the Make of the Car

12.There’s a lot of makes of cars to choose from, but you may prefer one over the others. Which make do you prefer the most? Create a variable called chosen_make that contains the make you want to check. The hint below provides the list of makes to choose from.

*Hint : The list below provides the makes for you to choose from. Pick a make and assign it to chosen_make. alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo

Task 12

# choose make
chosen_make <- "mercedes-benz"
chosen_make
[1] "mercedes-benz"

13.Filter cars to only include rows where the make column is equal to chosen_make. Save the new dataframe to chosen_make_details.

Task 13

# filter rows by make
chosen_make_details <- cars %>%
  filter(make == chosen_make)
chosen_make_details

14.Order the rows of chosen_make_details by engine_size descending and save the new dataframe to chosen_make_details. View chosen_make_details.

How large are the engines in each of the cars from that make that you chose? You can change the make stored in chosen_make to check out the engine sizes for other makes.

Task 14

# order filtered rows by engine size
chosen_make_details <- cars %>%
  arrange(desc(engine_size))
chosen_make_details

15.The process of buying a new car can cause a lot of stress - you don’t want to buy a car you won’t like! You’ve now seen how performing an analysis can ease some of the stress of making the decision of which car to buy. You also get to add a nice new car to your collection! Great work!

