Program_4

Author

Anusha Yogisha Kumara

4.Develop a script in R to produce a bar graph displaying the frequency distribution of categorical data in a given data set, grouped by a specific variable, using ggplot2.

##Steps In this program,we will follow the following steps. - Load the required libraries. - Load and inspect the data set. - perform EDA(Exploratory Data Analysis). - Convert the numerical variables to factors. - Examine the frequency distribution. - Convert the numerical variable into categorical variable. - Examine and interpret the graph.

Step 1:Load Required libraries

-We load ggplot2 ,used for data visualization in R.

#install.packages('ggplot2')
library(ggplot2)

Step 2: Load and inspect the dataset

We load the built in mtcars and view the first few rows to understand its structure.

data=mtcars
head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Step 3: Exploratory data analysis

Before creating any visualization, we explore the data set to understand the variable and types.

str(data)

'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

summary(data)

      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000

str(data) helps us to identify the datatype of each variable.
summary(data) provides statistical summaries.

Step 4: Covert the numerical variable into factors.

To correctly visualize categorical data, we convert relevant variables into factors.

data$cyl

 [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4

table(data$cyl)


 4  6  8 
11  7 14

data$gear

 [1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4

table(data$gear)


 3  4  5 
15 12  5

class(data$cyl)

[1] "numeric"

class(data$gear)

[1] "numeric"

data$cyl=as.factor(data$cyl)
data$gear=as.factor(data$gear)

class(data$cyl)

[1] "factor"

data$cyl

 [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
Levels: 4 6 8

class(data$gear)

[1] "factor"

data$gear

 [1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4
Levels: 3 4 5

summary(data)

      mpg        cyl         disp             hp             drat      
 Min.   :10.40   4:11   Min.   : 71.1   Min.   : 52.0   Min.   :2.760  
 1st Qu.:15.43   6: 7   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080  
 Median :19.20   8:14   Median :196.3   Median :123.0   Median :3.695  
 Mean   :20.09          Mean   :230.7   Mean   :146.7   Mean   :3.597  
 3rd Qu.:22.80          3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920  
 Max.   :33.90          Max.   :472.0   Max.   :335.0   Max.   :4.930  
       wt             qsec             vs               am         gear  
 Min.   :1.513   Min.   :14.50   Min.   :0.0000   Min.   :0.0000   3:15  
 1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000   1st Qu.:0.0000   4:12  
 Median :3.325   Median :17.71   Median :0.0000   Median :0.0000   5: 5  
 Mean   :3.217   Mean   :17.85   Mean   :0.4375   Mean   :0.4062         
 3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000   3rd Qu.:1.0000         
 Max.   :5.424   Max.   :22.90   Max.   :1.0000   Max.   :1.0000         
      carb      
 Min.   :1.000  
 1st Qu.:2.000  
 Median :2.000  
 Mean   :2.812  
 3rd Qu.:4.000  
 Max.   :8.000

str(data)

'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Step 5: Examine the frequency distribution

Before plotting we analye how the data is distributed across categories

 table(data$cyl)


 4  6  8 
11  7 14

table(data$gear)


 3  4  5 
15 12  5

table(data$cyl,data$gear)

-Helps us to understand the count of each category. -provide insight into relationships btw variables. -prepares us for interpreting the visualization.

Step:6 Create the bar graph.

 ggplot(data, aes(x=cyl,fill=gear))

ggplot(data, aes(x=cyl,fill=gear))+geom_bar()

ggplot(data, aes(x=cyl,fill=gear))+geom_bar(position="dodge")

ggplot(data, aes(x=cyl,fill=gear))+
  geom_bar(position="dodge")+
  theme_minimal()+
  theme(legend.position='top')

ggplot(data, aes(x=cyl,fill=gear))+
  geom_bar(position="dodge")+
  theme_minimal()+
  theme(legend.position='top')+
  labs(title="bar graph displaying the frequency distribution of categorical data", y="Count",x="Number of cylinders")