library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data("starwars")
There are 87 cases (rows) and 14 variables (columns) in the data set
For columns 2-6, the height and mass are numeric. This is because height is a continuous numerical variable in cm and mass in kilograms. But hair color, skin color, and eye color are all categorical. Hair color is un-ordered and not binary. There is no natural order to colors. Skin color is also un-ordered and not binary. There is no inherent order and more than two options. Eye Color is also un-ordered and not binary. Eye color can have multiple un-ordered values.
summary(starwars$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 66.0 167.0 180.0 174.6 191.0 264.0 6
There are 6 characters in the star wars data set whose height is unknown.
summary(starwars$mass)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 15.00 55.60 79.00 97.31 84.50 1358.00 28
There are 28 characters whose mass is unknown.
The two measures of center shown are mean and median. The mean is 174.6 and shows the average height of all the characters with known values. The median is 180.0 and is the middle value when all the heights are ordered from smallest to largest. Both of these help summarize where the middle of a numeric distribution lies.
hist(starwars$height, col="gold", xlab= "Height of characters (in centimeters)", main= "Distribution of Characters Heights in Star Wars")
The distribution of heights in the data set is uni-modal. The histogram shows 1 clear main peak where most of the characters are concentrated. The tallest bars are grouped around the 170-190 cm range.
The distribution of heights appears to be slightly left skewed. It has a longer tail on the left side. This means the median is higher than the mean usually. In the output, the median is 180.0 and the mean is 174.6. This confirms the left skewness.
I would use the median as the measure of center to describe the height of the Star Wars characters in this data set. This is the best choice because the distribution is slightly left skewed. The mean is then pulled in the direction of the skew, which can distort the true center of the data. The median is also a good choice because it is resistant to outliers and better represents the “typical” height of the characters.
boxplot(starwars$height, col="skyblue", xlab= "Height of characters (in centimeters)", horizontal= TRUE)
The median is the measure of center depicted in a box plot. It appears as a line inside the box, dividing Q1 and Q3.
Yes, there are a few/handful of outliers in the height variable. These outliers are abnormally small and on the left side of the box plot.
boxplot(starwars$mass, col="lightgreen", xlab="Mass of characters (in kilograms)", horizontal=TRUE, main="Boxplot of Star Wars Character Mass")
The character with the very large body mass is Jabba Desilijic Tiure, who has a mass of 1358.0 kg. Extreme outlier in the dataset.
summary(starwars$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 66.0 167.0 180.0 174.6 191.0 264.0 6
The IQR= Q3-Q1 IQR= 191.0-167.0= 24.0
The histogram shows modality and skew, which the box plot does not. The box plot shows quartiles and individual outliers, which the histogram does not. Overall, a histogram shows the shape of the distribution, which includes if it is unimodal, multimodal, left skewed, right skewed, symmetric, etc. It also approximate frequency of height values in intervals/bins. But the box plot shows the 5-number summary and outliers.