#Load Tidyverse
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
library (dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Load the Dataset
For today’s activity, we will use a built-in dataset so that everyone is working with the same data. Load the mtcars dataset
data("mtcars")
Take a moment to look at the dataset.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Before cleaning data, it is important to understand its structure. View the structure of your variables.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Answer the following questions in text below (not as code):
What does each row represent? Each row represents a specific type of car
Name two variables that are numeric. horsepower and miles per gallon are numeric
Name one variable that represents a category, even if it is currently stored as a number. Number of cylinders would be a category, but is being represented as a number in the current dataset
Some variables in this dataset are stored as numbers even though they represent categories.
Convert the following variables to factors:
cyl (number of cylinders)
am (transmission type)
``` r
mtcars <- mtcars %>%
mutate(
cyl = factor(cyl),
am = factor(am)
)
Check that the conversion worked by looking at the structure again.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
For visualization, it is often helpful to work with only the variables you need.
Create a new object called cars_clean that contains
only:
mpg
hp
wt
cyl
am
cars_clean <- mtcars %>%
select(mpg, hp, wt, cyl, am)
head(cars_clean)
## mpg hp wt cyl am
## Mazda RX4 21.0 110 2.620 6 1
## Mazda RX4 Wag 21.0 110 2.875 6 1
## Datsun 710 22.8 93 2.320 4 1
## Hornet 4 Drive 21.4 110 3.215 6 0
## Hornet Sportabout 18.7 175 3.440 8 0
## Valiant 18.1 105 3.460 6 0
Now filter the dataset to include only cars with:
Save the result as a new object called cars_hp.
cars_hp <-cars_clean %>%
filter(hp > 100)
head(cars_hp)
## mpg hp wt cyl am
## Mazda RX4 21.0 110 2.620 6 1
## Mazda RX4 Wag 21.0 110 2.875 6 1
## Hornet 4 Drive 21.4 110 3.215 6 0
## Hornet Sportabout 18.7 175 3.440 8 0
## Valiant 18.1 105 3.460 6 0
## Duster 360 14.3 245 3.570 8 0
Check how many rows remain.
str(cars_hp)
## 'data.frame': 23 obs. of 5 variables:
## $ mpg: num 21 21 21.4 18.7 18.1 14.3 19.2 17.8 16.4 17.3 ...
## $ hp : num 110 110 110 175 105 245 123 123 180 180 ...
## $ wt : num 2.62 2.88 3.21 3.44 3.46 ...
## $ cyl: Factor w/ 3 levels "4","6","8": 2 2 2 3 2 3 2 2 3 3 ...
## $ am : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 1 ...
Create a new variable called power_to_weight defined
as:
horsepower / weight
Add this variable to cars_hp.
cars_hp <- cars_hp %>%
mutate(
power_to_weight = hp / wt
)
head(cars_hp)
## mpg hp wt cyl am power_to_weight
## Mazda RX4 21.0 110 2.620 6 1 41.98473
## Mazda RX4 Wag 21.0 110 2.875 6 1 38.26087
## Hornet 4 Drive 21.4 110 3.215 6 0 34.21462
## Hornet Sportabout 18.7 175 3.440 8 0 50.87209
## Valiant 18.1 105 3.460 6 0 30.34682
## Duster 360 14.3 245 3.570 8 0 68.62745
To prepare data for figures, we often summarize values by group.
Create a summary table that shows:
Mean miles per gallon (mpg)
Mean horsepower (hp)
Number of observations
Grouped by:
cylSave this as summary_cyl.
summary_cyl <-cars_hp %>%
group_by(cyl) %>%
summarize(
mean_mpg = mean(mpg),
mean_hp = mean(hp),
nrows = nrow(cars_hp)
)
Display the table.
head(summary_cyl)
## # A tibble: 3 × 4
## cyl mean_mpg mean_hp nrows
## <fct> <dbl> <dbl> <int>
## 1 4 25.9 111 23
## 2 6 19.7 122. 23
## 3 8 15.1 209. 23
Now create a second summary table grouped by:
amInclude:
Mean miles per gallon
Mean power-to-weight ratio
Save this as summary_transmission.
summary_transmission <- cars_hp %>%
group_by(am) %>%
summarize(
mean_mpg = mean(mpg),
mean_power_to_weight_ratio = mean(power_to_weight),
)
head(summary_transmission)
## # A tibble: 2 × 3
## am mean_mpg mean_power_to_weight_ratio
## <fct> <dbl> <dbl>
## 1 0 16.1 44.6
## 2 1 20.6 62.1
```
Answer the following questions in text:
Which group appears to have higher fuel efficiency? Manual cars seem to be more efficient with an average of about 20.6 miles per gallon versus automatic cars from this dataset which have an average of about 16.1 miles per gallon.
Which summary table would be useful for making a bar plot? The summary_cyl table would be more useful for making a bar plot because you can break it into miles per gallon and horsepower as its two separate categories and compare each cylinder type (4,6,8) to compare average for both categorical variables.
Which would work better for a boxplot later? The summary_transmission would be better for a box plot later because there could be a fair amount of overlap between automatic and manual cars for miles per gallon and power to weight ratio, and the raw data could be analyzed to find quartiles and outliers to be able to show this potential overlap.
In a short paragraph, describe:
One thing that was confusing
One thing that makes more sense now
Why cleaning and summarizing data before plotting is important The most confusing aspect of this project was the specific syntax that each function requires. For example, I often forgot to add commas between each line when doing summaries, and knowing where identations belong. One thing that makes more sense is knowing when to employ each function based on the direction being asked. I have a better command of the tools and functions in R Markdown and even though I may struggle with syntax, I know which function does what. Cleaning and summarizing data before plotting is important because you need to determine what aspects of raw data is actually valuable and influential to yield results, and which aspects of raw data are incomplete or aren’t really adding much. Cleaning the data also presents it in a way that’s easier to read and interpret. Summarizing the data on the other hand is important because it provides insight of potential conclusions for the data and is can be compared and readied for a figure. —
Before submitting, confirm that:
The document renders without errors
All code chunks run successfully
All written responses are complete