We are interested in the Gapminder data set, which records measurements (such as life expectancy, GDP per capita, and population) for different countries over different years. Specifically, we will focus on the values from the year 2007. This will require us to create a new data set, gap_2007, which we will do here:
gap_2007 <- gap %>% filter(year == 2007)
Here, we calculate the dimensions of the data set and identify the names of the different variables in our gap_2007 data set. The results are recorded below:
names(gap) # what are the variables?
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
dim(gap) # how many rows/columns?
## [1] 1704 6
head(gap) # first 6 rows
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
str(gap) # variable properties
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
We are interested in looking at histograms of each of our quantitative variables. The results are below:
ggplot(gap_2007, aes(x = gdpPercap)) +
geom_histogram(bins = 30, alpha = 0.8, fill = "lightblue", color = "black") +
theme_minimal()
ggplot(gap_2007, aes(x = pop)) +
geom_histogram(bins = 30, alpha = 0.8, fill = "lightblue", color = "black") +
theme_minimal()
ggplot(gap_2007, aes(x = lifeExp)) +
geom_histogram(bins = 30, alpha = 0.8, fill = "lightblue", color = "black") +
theme_minimal()
We decide to hone in on one of our variables, namely pop. For this variable, we calculate the mean, median, IQR, and standard deviation in the space below:
#mean
mean(gap_2007$pop)
## [1] 44021220
#median
median(gap_2007$pop)
## [1] 10517531
#range
range(gap_2007$pop)
## [1] 199579 1318683096
#IQR
IQR(gap_2007$pop)
## [1] 26702008
#standard deviation
sd(gap_2007$pop)
## [1] 147621398
The data we see in this histogram is unimodal and highly skewed to the right, confirming that the mean is higher than the median and creating a positive skew. There are some outliers on the right side of the graph, which could possibly be indicated by Iceland or Albania and other countries with a lower contributing population.