Develop a R program to quickly explore a given dataset,including categorical analysis using the ‘group_by()’command,and visualize the findings using ’ggplot2’
##What we will do
In this program,we will:
1.Load the required and dataset. 2.explore the structure of the dataset. 3.Convert a numeric variable into a categorical variable. 4.perform categorical analysis using ‘group_by()’and ’summarise()’. 5.Visualize the results using ‘ggplot2’.
##Step:1 Load req libraries and dataset -‘todyverse’ is a collection of packages for data science. -‘dplyr’ is used for grouping and summarizing data. —
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)#load built in datasetdata <- mtcars
##Step 2:Explore the dataset
Before performing any analysis,we should understand the dataset.
mpg cyl disp hp
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
Median :19.20 Median :6.000 Median :196.3 Median :123.0
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
Median :3.695 Median :3.325 Median :17.71 Median :0.0000
Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
am gear carb
Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :0.0000 Median :4.000 Median :2.000
Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :5.000 Max. :8.000
###convert numerical variable to categorical the variable cyl represents the number of cylinders in a car. although it is numeric (4,6,8),it reps categories.
for categorical analysis,we convert it into a factor.
#Convert 'cyl' to factordata$cyl <-as.factor(data$cyl)#confirm conversionstr(data$cyl)
we calculate the avg miles per gallon(mpg)for each cylinder category.
###how these functions work together -%> passes output from one function to the next. -group_by(cyl) splits the dataset into groups. -summarise() calculates statistics per group. -mean(mpg) computes avg mileage. -.group="drop" removes grouping after afterward.
summary_data <- data %>%group_by(cyl) %>%summarise(avg_mpg=mean(mpg),.groups="drop" )
ggplot(summary_data, aes(x = cyl, y = avg_mpg,fill = cyl))+geom_bar(stat="identity")+labs(title="Average MPG by cylinder count",x="number of cylinders",y="avg MPG" )+theme_minimal()