Have you ever wanted to have a collection of something? It could be
trading cards, board games, art pieces, or whatever you want them to be.
In this project, we’ll suppose you collect something with a lot of data
- cars!
Your car collection is almost complete - it’s just missing one thing.
You think about it for a while, then it comes to you - you don’t have a
car from 1985! But there are so many cars from 1985 available for you to
collect! How will you know which 1985 car you will buy for your
collection?
Luckily for you, we have a dataset that stores precisely that
information. This dataset comes from the UCI Machine Learning Repository
and is linked here. This dataset has been adapted so that the variable
names are along the first rows.
As you’ll notice in this project, though, this dataset isn’t ready
for analysis yet. You will have to make some changes along the way to
clean and tidy up the dataset. Do that, and you’ll have the ability to
perform a great analysis!
Loading and Inspecting the Data
1.What good is an analysis if we don’t even have the tools to perform
the said analysis? Some of the tools you will need for this analysis are
the readr and dplyr tidyverse packages. Load the libraries at the top of
the notebook.Rmd file so you can access the functions you will need
later on.
Task 1
# load libraries
library(readr)
library(dplyr)
2.The last tool we need is the data itself! The file cars85.csv
stores the data that comes from the UCI Machine Learning Repository.
Load the file into a dataframe called cars to get started.
Task 2
# load data
cars <- read_csv('cars85.csv')
cars
3.It’s always a good idea to inspect the data you load into R. It
helps you to know what you are working with. Inspect cars with head()
and summary().
What kind of information do you have? What can you do with this
information?
Each row in this dataframe is a single car, and each column stores
some characteristic about that car. You want to get the best value for
your collection, so you want to analyze as much as you can before
buying. Doing so will help you make your choice easier!
Task 3
# inspect data
head(cars)
summary()
錯誤發生在 summary.default(): 缺少引數 "object",也沒有預設值
Clean the Data
4.After inspecting the dataframe, you notice something odd about the
normalized_losses column. This column has a lot of entries that are
question marks (?). This variable is not worth looking at since we don’t
have all the cars’ expected losses.
Let’s remove this column from the dataset. Select all columns from
cars but normalized_losses. Save your new dataframe to cars.
Task 4
# select columns
cars <- cars %>%
select(-normalized_losses)
cars
5.Print the column names of cars. Are they clear and descriptive?
Task 5
# view columns
colnames(cars)
[1] "symboling" "make" "fuel_type" "aspiration" "num_of_doors"
[6] "body_style" "drive_wheels" "engine_location" "wheel_base" "length"
[11] "width" "height" "curb_weight" "engine_type" "num_of_cylinders"
[16] "engine_size" "fuel_system" "bore" "stroke" "compression_ratio"
[21] "horsepower" "peak_rpm" "city_mpg" "highway_mpg" "price"
6.You know, symboling doesn’t say anything to you at first glance.
According to the UCI webpage, the symboling variable represents the
car’s risk factor. That variable name doesn’t seem to go with the
description. You should simplify this variable name to have it make more
sense.
Update that column name in cars as follows:
symboling -> risk_factor Print the column names of cars to confirm
the names of the columns have changed.
Task 6
# rename column
cars <- cars %>%
rename(risk_factor = symboling)
colnames(cars)
[1] "risk_factor" "make" "fuel_type" "aspiration" "num_of_doors"
[6] "body_style" "drive_wheels" "engine_location" "wheel_base" "length"
[11] "width" "height" "curb_weight" "engine_type" "num_of_cylinders"
[16] "engine_size" "fuel_system" "bore" "stroke" "compression_ratio"
[21] "horsepower" "peak_rpm" "city_mpg" "highway_mpg" "price"
7.Your car collection means a lot to you. You want each car to be of
value to you. What better way to do that than to buy a car with a lot of
miles-per-gallon on the highways? To determine this, first, suppose only
cars exceeding 30 mpg on the highways interest you. You seek to measure
how different each car’s highway mpg is from your 30 mpg threshold.
Create a variable called mpg_threshold with the value 30.
Task 7
# define threshold
mpg_threshold <- 30
mpg_threshold
[1] 30
8.Add a new column to cars called mpg_diff_from_threshold. This will
measure how far each car’s highway mpg is from 30 mpg. View the updated
cars dataframe.
Task 8
# add column
cars <- cars %>%
mutate(mpg_diff_from_threshold = highway_mpg - mpg_threshold)
cars
Filter and Arrange Rows
9.You’ll add a car to your collection only if it gets more than 30
miles per gallon on the highways. Filter the rows of cars to find all
the cars where mpg_diff_from_threshold is greater than 0. Save this new
dataframe to mpg_exceeds_threshold and view it.
Task 9
# filter rows
mpg_exceeds_threshold <- cars %>%
filter(mpg_diff_from_threshold > 0)
mpg_exceeds_threshold
10.Which cars have the highest miles per gallon on the highways? To
find this, arrange the rows of mpg_exceeds_threshold by
mpg_diff_from_threshold descending. Save this new dataframe as
mpg_exceeds_threshold.
Task 10
# arrange rows
mpg_exceeds_threshold <- cars %>%
arrange(desc(mpg_diff_from_threshold))
mpg_exceeds_threshold
11.Now suppose you want your next car to have a large engine. Order
the rows of cars by engine_size descending. Save the new data frame to
ordered_by_engine_size. View ordered_by_engine_size.
Task 11
# order rows by engine size
ordered_by_engine_size <- cars %>%
arrange(desc(engine_size))
ordered_by_engine_size
Specifying the Make of the Car
12.There’s a lot of makes of cars to choose from, but you may prefer
one over the others. Which make do you prefer the most? Create a
variable called chosen_make that contains the make you want to check.
The hint below provides the list of makes to choose from.
*Hint : The list below provides the makes for you to choose from.
Pick a make and assign it to chosen_make. alfa-romero, audi, bmw,
chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury,
mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru,
toyota, volkswagen, volvo
Task 12
# choose make
chosen_make <- "mercedes-benz"
chosen_make
[1] "mercedes-benz"
13.Filter cars to only include rows where the make column is equal to
chosen_make. Save the new dataframe to chosen_make_details.
Task 13
# filter rows by make
chosen_make_details <- cars %>%
filter(make == chosen_make)
chosen_make_details
14.Order the rows of chosen_make_details by engine_size descending
and save the new dataframe to chosen_make_details. View
chosen_make_details.
How large are the engines in each of the cars from that make that you
chose? You can change the make stored in chosen_make to check out the
engine sizes for other makes.
Task 14
# order filtered rows by engine size
chosen_make_details <- cars %>%
arrange(desc(engine_size))
chosen_make_details
15.The process of buying a new car can cause a lot of stress - you
don’t want to buy a car you won’t like! You’ve now seen how performing
an analysis can ease some of the stress of making the decision of which
car to buy. You also get to add a nice new car to your collection! Great
work!
