We are interested in the Gapminder data set, which records measurements (such as life expectancy, GDP per capita, and population) for different countries over different years. Specifically, we will focus on the values from the year 2007. This will require us to create a new data set, gap_2007, which we will do here:
gap_2007 <- gap %>% filter(year == 2007)
Here, we calculate the dimensions of the data set and identify the names of the different variables in our gap_2007 data set. The results are recorded below:
names(gap_2007) # what are the variables?
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
dim(gap_2007) # how many rows/columns?
## [1] 142 6
head(gap_2007) # first 6 rows
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.8 31889923 975.
## 2 Albania Europe 2007 76.4 3600523 5937.
## 3 Algeria Africa 2007 72.3 33333216 6223.
## 4 Angola Africa 2007 42.7 12420476 4797.
## 5 Argentina Americas 2007 75.3 40301927 12779.
## 6 Australia Oceania 2007 81.2 20434176 34435.
str(gap_2007) # variable properties
## tibble [142 × 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 5 4 3 3 4 ...
## $ year : int [1:142] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
## $ lifeExp : num [1:142] 43.8 76.4 72.3 42.7 75.3 ...
## $ pop : int [1:142] 31889923 3600523 33333216 12420476 40301927 20434176 8199783 708573 150448339 10392226 ...
## $ gdpPercap: num [1:142] 975 5937 6223 4797 12779 ...
colSums(is.na(gap_2007))
## country continent year lifeExp pop gdpPercap
## 0 0 0 0 0 0
We are interested in looking at histograms of each of our quantitative variables. The results are below:
# Here is a histogram of GDP per capita
ggplot(gap_2007, aes(x = gdpPercap)) +
geom_histogram(bins = 30, alpha = 0.8, fill = "lightblue", color = "black") +
theme_minimal()
# Here is a histogram of Populations
ggplot(gap_2007, aes(x = pop)) +
geom_histogram(bins = 30, alpha = 0.8, fill = "lightblue", color = "black") +
theme_minimal()
# Here is a histogram of Life Expectancy
ggplot(gap_2007, aes(x = lifeExp)) +
geom_histogram(bins = 30, alpha = 0.8, fill = "lightblue", color = "black") +
theme_minimal()
We decide to hone in on one of our variables, namely gdpPercap/pop/lifeExp (choose one and erase the others). For this variable, we calculate the mean, median, IQR, and standard deviation in the space below:
#mean
mean(gap_2007$gdpPercap)
## [1] 11680.07
#median
median(gap_2007$gdpPercap)
## [1] 6124.371
#range
range(gap_2007$gdpPercap)
## [1] 277.5519 49357.1902
#IQR
IQR(gap_2007$gdpPercap)
## [1] 16383.99
#Standard deviation
sd(gap_2007$gdpPercap)
## [1] 12859.94
(Here, write a bit about the shape of the data – skewed right or left – and discuss the relationship between the mean and the median)
We can see that the histograms for the GDP per capita and pop are right skewed, for Life expectation it is left skewed.
For GDP per capita, the mean is closer to the right skew than the median, meaning the mean is more robust.