1 + 1[1] 2
develop an R program to quickly explore a given dataset,including categorical analysis using the ’group_by()’command,and visualize the finding using ggplot2
##what we will do
In the program , we will:
1.load the required libraries and dataset. 2.explore the structure of the dataset 3.convert a numeric variable into a categorical variable. 4.perform a categorical analysis using ‘group_by()’and ’summarise()’. 5.visualize the results using ‘ggplot2’.
1 + 1[1] 2
Step1:load required libraries and dataset ‘tidyverse’ is a collection of packages for data science. ’dplyr’is used for grouping and summarizing data.
[1] 4
The echo: false option disables the printing of code (only output is displayed).
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
data<- mtcars##Step2:exlore the dataset
Before performing analysis ,we should understand the dataset.
we will check : Number of rows and columns
-coloumn names
-datatypes
-summary statistics
-first few rows
dim(data)
#columns names
names(data)
#summary of dataset
str(data)
#summary statistics
summary(data)
#first six rows
head
(data)
##Step 3:convert numeric variables to categorical the variables ‘cyl’ represents the number of cylinders in a car
although it is numeric(4, 6 ,8)it represents categories.
For categorical analysis ,we convert it into a factor.
#Convert 'cyl' to factor
data$cyl <- as.factor(data$cyl)
#Confirm conversion
str(data$cyl) Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
levels(data$cyl)[1] "4" "6" "8"
##Step 4:Performs categorical analysis
we calculate the averages miles per gallon(‘mpg’)for each cylinder category
##How These Functioms work together
%>% passes output from one function to the next.group_by(cyl)splits the dataset into groups.summarise()calculates statistics per group.mean(mpg)computes average mileage..group = "drops" removes grouping afterwards.summary_data <- data %>%
group_by(cyl) %>%
summarise(
avg_mpg = mean(mpg),
.group = "drop"
)
summary_data# A tibble: 3 × 3
cyl avg_mpg .group
<fct> <dbl> <chr>
1 4 26.7 drop
2 6 19.7 drop
3 8 15.1 drop
##step 5:visualize using a bar plate
ggplot(summary_data, aes(x = cyl, y = avg_mpg, fill = cyl)) +
geom_bar(stat = "identity") +
labs(
title = "Average MPG by cylinder Count",
x = "Number of cylinders",
y = "Average MPG"
) +
theme_minimal()