Contents => Descriptive analysis =>Exploratory analysis => => => =>

Descriptive analysis

library(dslabs)
library(gridExtra)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggthemes)
data(heights)
heights = read.csv("Heights.csv")
#checking to see the proportion of male to female in the datasets
female <- heights$sex =="Female"
male <-heights$sex == "Male"
ratio <- sum(male)/sum(female)
#ratio shows that for every female there are 3.411765 male in the dataset
#this is the visual description
male_count <- sum(male)
female_count <- sum(female)
#creating a new dataframe
ratio_df <- data.frame(Gender = c("Male", "Female"),
                       Count =c(male_count, female_count))
#bar plot
ggplot(ratio_df, aes(x= Gender, y= Count, fill= Gender))+
  geom_bar(stat = "Identity")+
labs(title = "Fig 1.0: Distribution of male and female in the data",x= "Gender",y="Count")+
  theme_economist()

from the above image we can see that the data is not fully balanced. it’d be unfair to conclude that the there might be a sampling bias in the data without considering the problem that might arise from data collection of differences in population from whence the data was collected.

moving on to give more analysis on the behavior of the data

#distribution of male height
#creating a building block on which the plots are going to built with
male_plot_percusor<- heights %>% filter(sex=="Male")
female_plot_percusor <-heights %>% filter(sex=="Female")
  #density plot for male height
male_height_dist<- male_plot_percusor %>% ggplot(aes(x=height))+
  geom_density(fill="blue", col="black")+
  xlab("male heights in inches")+
  ggtitle("fig1.2:Dist of male heights in inches")
#qqplot for male height
male_params <- male_plot_percusor %>% 
  summarize(mean= mean(height), sd=sd(height))
male_height_qqplot <- male_plot_percusor%>%ggplot(aes(sample=height))+
  geom_qq(dparams =male_params) +
  ggtitle("fig 1.4 qqplot of male height")+
  geom_abline()
#distribution of female height
female_height_dist <- female_plot_percusor %>%
  ggplot(aes(x=height))+geom_density(fill="pink", col="black")+
  xlab("female heights in inches")+
  ggtitle("fig 1.3:Height dist of female in inches")
#qqplot for female height
female_params <- female_plot_percusor %>%
  summarize(mean =mean(height), sd= sd(height))
female_height_qqplot<-female_plot_percusor %>% ggplot(aes(sample=height))+
  geom_qq(dparams =female_params) +
  ggtitle("fig 1.5 qq plot of female height")+
  geom_abline()
#using gridextra package to arrange the plot 
grid.arrange(male_height_dist,male_height_qqplot,female_height_dist,female_height_qqplot,ncol=2)

We can see the above diagrams that the data is normally distributed. to further explore the data, i’d be using boxplot to understand the distribution of the data in both sexes.

boxplot_colors <- c("pink", "blue")
boxplot(height~sex, data = heights,
        main="fig : 2.1 Boxplot height by sex in inches",
        xlab= "Sex", ylab="heights in inches",
        col=boxplot_colors)

from the above plot, we can see the interquatile range between the heights of male is larger than that of the female, and also the appearance of females whose height is above the average height of the male.

lastly i want to make a predictive model that shows the relationship between height and sex

model <- lm(height~sex, data= heights)
summary(model)
## 
## Call:
## lm(formula = height ~ sex, data = heights)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.3148  -2.3148  -0.3148   2.6852  14.0606 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  64.9394     0.2363  274.82   <2e-16 ***
## sexMale       4.3753     0.2687   16.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.645 on 1048 degrees of freedom
## Multiple R-squared:  0.2019, Adjusted R-squared:  0.2012 
## F-statistic: 265.1 on 1 and 1048 DF,  p-value: < 2.2e-16

From the standard error of 3.645 i conclude that the precision in estimating the population mean is less thus sex is not a perfect prediction of height with respect to the sample size.

thank you