Develop an R program to quickly explore a given dataset, including categorical analysis using the group_by() command, and visualize the findings using ggplot2.
What we will do?
In this program, we will:
Load the required libraries
Exploring the structure of dataset
Concerting a numeric variable into categorical Variable
Perform categorical analysis uisng and summarize()
˳‰Visualize the results using ggplot2
Step 1: Load the tequired libraries and dataset
tidyverse - is a collection of packages for data science.
dplyr - is used for grouping and summarizing data.
ggplot2 - for visualization
library(tidyverse)
Warning: package 'dplyr' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.2
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)library(ggplot2)# Loading the built-in datasetdata = mtcarshead(data)
mpg cyl disp hp
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
Median :19.20 Median :6.000 Median :196.3 Median :123.0
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
Median :3.695 Median :3.325 Median :17.71 Median :0.0000
Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
am gear carb
Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :0.0000 Median :4.000 Median :2.000
Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :5.000 Max. :8.000
?str
#Get the first and last few rows - already done - head and tailhead(data)
The variable cyl represents thenumver of cylineders in a car Alghough it is numeric (4,6,8), it represnts categories For categorical analysis, we need to convert it to a factor.
Here we can computer additional statistics - n() -> number of cars per group - sd() -> Standard deviation - min() and max() -> range of values in any column of our intererest
summary_data_optional = data %>%group_by(cyl) %>%summarize(n=n(),avg_mpg=mean(mpg),min_mpg=min(mpg),max_mpg=max(mpg),.groups="drop")summary_data_optional
ggplot(summary_data,aes(x=cyl, y=avg_mpg, colour = cyl))+geom_bar(stat ='identity')+labs( title="Average MPG by clylinder count",x="Number of cylinders in car",y="Avg. miles per gallon of fuel")
ggplot(summary_data,aes(x=cyl, y=avg_mpg, colour = cyl))+geom_bar(stat ='identity')+labs( title="Average MPG by clylinder count",x="Number of cylinders in car",y="Avg. miles per gallon of fuel") +theme_minimal()
ggplot(summary_data,aes(x=cyl, y=avg_mpg, colour = cyl, fill = cyl))+geom_bar(stat ='identity')+labs( title="Average MPG by clylinder count",x="Number of cylinders in car",y="Avg. miles per gallon of fuel") +theme_minimal()
Step7: Alternative Visualization (Optional)
Instead of bar chart, we can use points and lines to show the trend
ggplot(summary_data, aes(x=cyl, y=avg_mpg, group=1)) +geom_point() +geom_line() +labs( title="Average MPG by clylinder count",x="Number of cylinders in car",y="Avg. miles per gallon of fuel")
ggplot(summary_data, aes(x=cyl, y=avg_mpg, group=1)) +geom_point() +geom_line() +labs( title="Average MPG by clylinder count",x="Number of cylinders in car",y="Avg. miles per gallon of fuel")+theme_minimal()
ggplot(summary_data, aes(x=cyl, y=avg_mpg, group=1, color=cyl)) +geom_point() +geom_line() +labs( title="Average MPG by clylinder count",x="Number of cylinders in car",y="Avg. miles per gallon of fuel")+theme_minimal()
Final Conclusion
Loaded the required ext libraries
Explored the dataset
Converted the numerical cyl column to factor (categorical)