Data Cleaning Practice

#Load Tidyverse
install.packages("tidyverse")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)

install.packages("dplyr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)

library (dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Load the Dataset

For today’s activity, we will use a built-in dataset so that everyone is working with the same data. Load the mtcars dataset

data("mtcars")

Take a moment to look at the dataset.

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Inspecting the Data

Before cleaning data, it is important to understand its structure. View the structure of your variables.

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Answer the following questions in text below (not as code):

What does each row represent? Each row represents a specific type of car
Name two variables that are numeric. horsepower and miles per gallon are numeric
Name one variable that represents a category, even if it is currently stored as a number. Number of cylinders would be a category, but is being represented as a number in the current dataset

Cleaning the Data

Some variables in this dataset are stored as numbers even though they represent categories.

Convert Variables to Factors

Convert the following variables to factors:

cyl (number of cylinders)
am (transmission type)


``` r
mtcars <- mtcars %>%
  mutate(
    cyl = factor(cyl),
    am = factor(am)
  )

Check that the conversion worked by looking at the structure again.

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Selecting Relevant Variables

For visualization, it is often helpful to work with only the variables you need.

Create a new object called cars_clean that contains only:

mpg
hp
wt
cyl
am

cars_clean <- mtcars %>%
  select(mpg, hp, wt, cyl, am)
head(cars_clean)

##                    mpg  hp    wt cyl am
## Mazda RX4         21.0 110 2.620   6  1
## Mazda RX4 Wag     21.0 110 2.875   6  1
## Datsun 710        22.8  93 2.320   4  1
## Hornet 4 Drive    21.4 110 3.215   6  0
## Hornet Sportabout 18.7 175 3.440   8  0
## Valiant           18.1 105 3.460   6  0

Filtering Observations

Now filter the dataset to include only cars with:

More than 100 horsepower

Save the result as a new object called cars_hp.

cars_hp <-cars_clean %>%
  filter(hp > 100)
head(cars_hp)

##                    mpg  hp    wt cyl am
## Mazda RX4         21.0 110 2.620   6  1
## Mazda RX4 Wag     21.0 110 2.875   6  1
## Hornet 4 Drive    21.4 110 3.215   6  0
## Hornet Sportabout 18.7 175 3.440   8  0
## Valiant           18.1 105 3.460   6  0
## Duster 360        14.3 245 3.570   8  0

Check how many rows remain.

str(cars_hp)

## 'data.frame':    23 obs. of  5 variables:
##  $ mpg: num  21 21 21.4 18.7 18.1 14.3 19.2 17.8 16.4 17.3 ...
##  $ hp : num  110 110 110 175 105 245 123 123 180 180 ...
##  $ wt : num  2.62 2.88 3.21 3.44 3.46 ...
##  $ cyl: Factor w/ 3 levels "4","6","8": 2 2 2 3 2 3 2 2 3 3 ...
##  $ am : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 1 ...

Creating New Variables

Create a new variable called power_to_weight defined as:


horsepower / weight

Add this variable to cars_hp.

cars_hp <- cars_hp %>%
mutate(
  power_to_weight = hp / wt
)

head(cars_hp)

##                    mpg  hp    wt cyl am power_to_weight
## Mazda RX4         21.0 110 2.620   6  1        41.98473
## Mazda RX4 Wag     21.0 110 2.875   6  1        38.26087
## Hornet 4 Drive    21.4 110 3.215   6  0        34.21462
## Hornet Sportabout 18.7 175 3.440   8  0        50.87209
## Valiant           18.1 105 3.460   6  0        30.34682
## Duster 360        14.3 245 3.570   8  0        68.62745

Grouping and Summarizing Data

To prepare data for figures, we often summarize values by group.

Summary by Number of Cylinders

Create a summary table that shows:

Mean miles per gallon (mpg)
Mean horsepower (hp)
Number of observations

Grouped by:

cyl

Save this as summary_cyl.

summary_cyl <-cars_hp %>%
  group_by(cyl) %>%
  summarize(
    mean_mpg = mean(mpg),
    mean_hp = mean(hp),
    nrows = nrow(cars_hp)
  )

Display the table.

head(summary_cyl)

## # A tibble: 3 × 4
##   cyl   mean_mpg mean_hp nrows
##   <fct>    <dbl>   <dbl> <int>
## 1 4         25.9    111     23
## 2 6         19.7    122.    23
## 3 8         15.1    209.    23

Summary by Transmission Type

Now create a second summary table grouped by:

am

Include:

Mean miles per gallon
Mean power-to-weight ratio

Save this as summary_transmission.

summary_transmission <- cars_hp %>%
  group_by(am) %>%
  summarize(
    mean_mpg = mean(mpg),
    mean_power_to_weight_ratio = mean(power_to_weight),
  )
head(summary_transmission)

## # A tibble: 2 × 3
##   am    mean_mpg mean_power_to_weight_ratio
##   <fct>    <dbl>                      <dbl>
## 1 0         16.1                       44.6
## 2 1         20.6                       62.1

```

Interpreting the Summaries

Answer the following questions in text:

Which group appears to have higher fuel efficiency? Manual cars seem to be more efficient with an average of about 20.6 miles per gallon versus automatic cars from this dataset which have an average of about 16.1 miles per gallon.
Which summary table would be useful for making a bar plot? The summary_cyl table would be more useful for making a bar plot because you can break it into miles per gallon and horsepower as its two separate categories and compare each cylinder type (4,6,8) to compare average for both categorical variables.
Which would work better for a boxplot later? The summary_transmission would be better for a box plot later because there could be a fair amount of overlap between automatic and manual cars for miles per gallon and power to weight ratio, and the raw data could be analyzed to find quartiles and outliers to be able to show this potential overlap.

Reflection

In a short paragraph, describe:

One thing that was confusing
One thing that makes more sense now
Why cleaning and summarizing data before plotting is important The most confusing aspect of this project was the specific syntax that each function requires. For example, I often forgot to add commas between each line when doing summaries, and knowing where identations belong. One thing that makes more sense is knowing when to employ each function based on the direction being asked. I have a better command of the tools and functions in R Markdown and even though I may struggle with syntax, I know which function does what. Cleaning and summarizing data before plotting is important because you need to determine what aspects of raw data is actually valuable and influential to yield results, and which aspects of raw data are incomplete or aren’t really adding much. Cleaning the data also presents it in a way that’s easier to read and interpret. Summarizing the data on the other hand is important because it provides insight of potential conclusions for the data and is can be compared and readied for a figure. —

Final Check

Before submitting, confirm that:

The document renders without errors
All code chunks run successfully
All written responses are complete