Contents => Descriptive analysis =>Exploratory analysis => => => =>
Descriptive analysis
library(dslabs)
library(gridExtra)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggthemes)
data(heights)
heights = read.csv("Heights.csv")
#checking to see the proportion of male to female in the datasets
female <- heights$sex =="Female"
male <-heights$sex == "Male"
ratio <- sum(male)/sum(female)
#ratio shows that for every female there are 3.411765 male in the dataset
#this is the visual description
male_count <- sum(male)
female_count <- sum(female)
#creating a new dataframe
ratio_df <- data.frame(Gender = c("Male", "Female"),
Count =c(male_count, female_count))
#bar plot
ggplot(ratio_df, aes(x= Gender, y= Count, fill= Gender))+
geom_bar(stat = "Identity")+
labs(title = "Fig 1.0: Distribution of male and female in the data",x= "Gender",y="Count")+
theme_economist()
from the above image we can see that the data is not fully balanced.
it’d be unfair to conclude that the there might be a sampling bias in
the data without considering the problem that might arise from data
collection of differences in population from whence the data was
collected.
moving on to give more analysis on the behavior of the data
#distribution of male height
#creating a building block on which the plots are going to built with
male_plot_percusor<- heights %>% filter(sex=="Male")
female_plot_percusor <-heights %>% filter(sex=="Female")
#density plot for male height
male_height_dist<- male_plot_percusor %>% ggplot(aes(x=height))+
geom_density(fill="blue", col="black")+
xlab("male heights in inches")+
ggtitle("fig1.2:Dist of male heights in inches")
#qqplot for male height
male_params <- male_plot_percusor %>%
summarize(mean= mean(height), sd=sd(height))
male_height_qqplot <- male_plot_percusor%>%ggplot(aes(sample=height))+
geom_qq(dparams =male_params) +
ggtitle("fig 1.4 qqplot of male height")+
geom_abline()
#distribution of female height
female_height_dist <- female_plot_percusor %>%
ggplot(aes(x=height))+geom_density(fill="pink", col="black")+
xlab("female heights in inches")+
ggtitle("fig 1.3:Height dist of female in inches")
#qqplot for female height
female_params <- female_plot_percusor %>%
summarize(mean =mean(height), sd= sd(height))
female_height_qqplot<-female_plot_percusor %>% ggplot(aes(sample=height))+
geom_qq(dparams =female_params) +
ggtitle("fig 1.5 qq plot of female height")+
geom_abline()
#using gridextra package to arrange the plot
grid.arrange(male_height_dist,male_height_qqplot,female_height_dist,female_height_qqplot,ncol=2)
We can see the above diagrams that the data is normally distributed. to further explore the data, i’d be using boxplot to understand the distribution of the data in both sexes.
boxplot_colors <- c("pink", "blue")
boxplot(height~sex, data = heights,
main="fig : 2.1 Boxplot height by sex in inches",
xlab= "Sex", ylab="heights in inches",
col=boxplot_colors)
from the above plot, we can see the interquatile range between the
heights of male is larger than that of the female, and also the
appearance of females whose height is above the average height of the
male.
lastly i want to make a predictive model that shows the relationship between height and sex
model <- lm(height~sex, data= heights)
summary(model)
##
## Call:
## lm(formula = height ~ sex, data = heights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.3148 -2.3148 -0.3148 2.6852 14.0606
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.9394 0.2363 274.82 <2e-16 ***
## sexMale 4.3753 0.2687 16.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.645 on 1048 degrees of freedom
## Multiple R-squared: 0.2019, Adjusted R-squared: 0.2012
## F-statistic: 265.1 on 1 and 1048 DF, p-value: < 2.2e-16
From the standard error of 3.645 i conclude that the precision in estimating the population mean is less thus sex is not a perfect prediction of height with respect to the sample size.
thank you