Scenario

We are interested in the Gapminder data set, which records measurements (such as life expectancy, GDP per capita, and population) for different countries over different years. Specifically, we will focus on the values from the year 2007. This will require us to create a new data set, gap_2007, which we will do here:

gap_2007 <- gap %>% filter(year == 2007)

Exploring the Data

Here, we calculate the dimensions of the data set and identify the names of the different variables in our gap_2007 data set. The results are recorded below:

names(gap_2007)      # what are the variables?
## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"
dim(gap_2007)        # how many rows/columns?
## [1] 142   6
head(gap_2007)       # first 6 rows
## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       2007    43.8 31889923      975.
## 2 Albania     Europe     2007    76.4  3600523     5937.
## 3 Algeria     Africa     2007    72.3 33333216     6223.
## 4 Angola      Africa     2007    42.7 12420476     4797.
## 5 Argentina   Americas   2007    75.3 40301927    12779.
## 6 Australia   Oceania    2007    81.2 20434176    34435.
str(gap_2007)        # variable properties
## tibble [142 × 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 4 1 1 2 5 4 3 3 4 ...
##  $ year     : int [1:142] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
##  $ lifeExp  : num [1:142] 43.8 76.4 72.3 42.7 75.3 ...
##  $ pop      : int [1:142] 31889923 3600523 33333216 12420476 40301927 20434176 8199783 708573 150448339 10392226 ...
##  $ gdpPercap: num [1:142] 975 5937 6223 4797 12779 ...
colSums(is.na(gap_2007))
##   country continent      year   lifeExp       pop gdpPercap 
##         0         0         0         0         0         0

We are interested in looking at histograms of each of our quantitative variables. The results are below:

# Here is a histogram of GDP per capita

ggplot(gap_2007, aes(x = gdpPercap)) +
  geom_histogram(bins = 30, alpha = 0.8, fill = "lightblue", color = "black") +
  theme_minimal()

# Here is a histogram of Populations
ggplot(gap_2007, aes(x = pop)) +
  geom_histogram(bins = 30, alpha = 0.8, fill = "lightblue", color = "black") +
  theme_minimal()

# Here is a histogram of Life Expectancy
ggplot(gap_2007, aes(x = lifeExp)) +
  geom_histogram(bins = 30, alpha = 0.8, fill = "lightblue", color = "black") +
  theme_minimal()

Calculating Statistics for one Variable

We decide to hone in on one of our variables, namely gdpPercap/pop/lifeExp (choose one and erase the others). For this variable, we calculate the mean, median, IQR, and standard deviation in the space below:

#mean
mean(gap_2007$gdpPercap)
## [1] 11680.07
#median
median(gap_2007$gdpPercap)
## [1] 6124.371
#range
range(gap_2007$gdpPercap)
## [1]   277.5519 49357.1902
#IQR
IQR(gap_2007$gdpPercap)
## [1] 16383.99
#Standard deviation
sd(gap_2007$gdpPercap)
## [1] 12859.94

Summary

(Here, write a bit about the shape of the data – skewed right or left – and discuss the relationship between the mean and the median)

We can see that the histograms for the GDP per capita and pop are right skewed, for Life expectation it is left skewed.

For GDP per capita, the mean is closer to the right skew than the median, meaning the mean is more robust.