We will analyze data collected that exists as part of the Gapminder data set. This data set includes economic indicators from different countries over different periods of time. We will focus entirely on the year 2007, which is the most recent year of data available. We will focus on the variable GDP per capita. Since GDP is a measurement of economic activity, higher GDP per capita typically indicates a country is better off economically.
Our goal is to examine the data in general, and to do some more detailed analysis for the “species” variable.
We start by looking at a histogram of the GDP data
ggplot(gap_2007, aes(x = gdpPercap)) +
geom_histogram(bins = 30, alpha = 0.8, fill = "lightblue", color = "black") +
theme_minimal()
As we can see, the data set is unimodal and skewed to the right, meaning that the majority of our data values are closer to the minimum, with a handful of data values (the skew/tail of the distribution) at the higher end of the distribution. Since our observable units are countries from the year 2007, our right-skewed GDP data indicates that most of our countries have lower GDP per capita (closer to the minimum value), while only a handful of countries have a high GDP per capita. are 344 rows in our data frame (i.e. 344 penguins), and there are 8 columns (i.e. 8 variables). We anticipate that this means the median will be LOWER than the mean, but we will confirm this in the next section. # Calculating statistics for our GDP data
We now calculate some statistics for our GDP data.
# Measures of Center
mean(gap_2007$gdpPercap)
## [1] 11680.07
median(gap_2007$gdpPercap)
## [1] 6124.371
Starting with our measures of center, we see that the mean of our data set was 11680.07 and the median of our data set was 6124.371. This confirms our earlier hypothesis that the mean is higher than the median, thanks to the right skew of the data.
# Measures of Spread
sd(gap_2007$gdpPercap)
## [1] 12859.94
range(gap_2007$gdpPercap)
## [1] 277.5519 49357.1902
IQR(gap_2007$gdpPercap)
## [1] 16383.99
We see that our standard deviation is 12859.94. This large number (it is as large as the mean) suggests that our data are relatively spread out. Our range goes from a minimum of 277.5519 to a maximum of 49357.1902, with an interquartile range of 16383.99.
We conclude this report with a 5-number summary and a box plot display.
summary(gap_2007$gdpPercap)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 277.6 1624.8 6124.4 11680.1 18008.8 49357.2
ggplot(gap_2007, aes(x = gdpPercap)) +
geom_boxplot(fill="lightblue") +
theme_minimal()
ggplot(gap_2007, aes(x = lifeExp)) +
geom_histogram(bins = 30, alpha = 0.8, fill = "lightblue", color = "black") +
theme_minimal()
# Measures of Center
mean(gap_2007$lifeExp)
## [1] 67.00742
median(gap_2007$lifeExp)
## [1] 71.9355
# Measures of Spread
sd(gap_2007$lifeExp)
## [1] 12.07302
range(gap_2007$lifeExp)
## [1] 39.613 82.603
IQR(gap_2007$lifeExp)
## [1] 19.253
summary(gap_2007$lifeExp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 39.61 57.16 71.94 67.01 76.41 82.60
ggplot(gap_2007, aes(x = lifeExp)) +
geom_boxplot(fill="lightblue") +
theme_minimal()
The histogram above is a bimodal shape with a left-skew. This is due to the data having two peaks on the right side of the data and a higher concentration of values on the right side of the data. The mean in this data is less than the median of the data set. This means that the average will be less than the middle data point of the set.
In this report, we analyzed data collected from countries all around the world in 2007. Our initial work focused on GDP Per Capita, while our later work focused on Life Expectancy. For each variable, we calculated measures of center; measures of spread; and looked at several graphs (histograms and boxplots).