We will analyze data collected that exists as part of the Gapminder data set. This data set includes economic indicators from different countries over different periods of time. We will focus entirely on the year 2007, which is the most recent year of data available. We will focus on the variable GDP per capita. Since GDP is a measurement of economic activity, higher GDP per capita typically indicates a country is better off economically.
Our goal is to examine the data in general, and to do some more detailed analysis for the “species” variable.
We start by looking at a histogram of the GDP data
ggplot(gap_2007, aes(x = gdpPercap)) +
geom_histogram(bins = 30, alpha = 0.8, fill = "lightblue", color = "black") +
theme_minimal()
As we can see, the data set is unimodal and skewed to the right, meaning that the majority of our data values are closer to the minimum, with a handful of data values (the skew/tail of the distribution) at the higher end of the distribution. Since our observable units are countries from the year 2007, our right-skewed GDP data indicates that most of our countries have lower GDP per capita (closer to the minimum value), while only a handful of countries have a high GDP per capita. are 344 rows in our data frame (i.e. 344 penguins), and there are 8 columns (i.e. 8 variables). We anticipate that this means the median will be LOWER than the mean, but we will confirm this in the next section. # Calculating statistics for our GDP data
We now calculate some statistics for our GDP data.
# Measures of Center
mean(gap_2007$gdpPercap)
## [1] 11680.07
median(gap_2007$gdpPercap)
## [1] 6124.371
Starting with our measures of center, we see that the mean of our data set was 11680.07 and the median of our data set was 6124.371. This confirms our earlier hypothesis that the mean is higher than the median, thanks to the right skew of the data.
# Measures of Spread
sd(gap_2007$gdpPercap)
## [1] 12859.94
range(gap_2007$gdpPercap)
## [1] 277.5519 49357.1902
IQR(gap_2007$gdpPercap)
## [1] 16383.99
We see that our standard deviation is 12859.94. This large number (it is as large as the mean) suggests that our data are relatively spread out. Our range goes from a minimum of 277.5519 to a maximum of 49357.1902, with an interquartile range of 16383.99.
We conclude this report with a 5-number summary and a box plot display.
summary(gap_2007$gdpPercap)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 277.6 1624.8 6124.4 11680.1 18008.8 49357.2
ggplot(gap_2007, aes(x = gdpPercap)) +
geom_boxplot(fill="lightblue") +
theme_minimal()
(Your work here. Copy/paste the code up above, but change the variable you’re looking at to lifeExp (which measures Life Expectancy). Include a histogram; boxplot; measures of center; and measures of spread. Briefly comment on the shape of the data, (what direction is the skew?), and note whether the mean is less than, or greater than, the median.)
ggplot(gap_2007, aes(x = lifeExp)) + geom_histogram(bins = 30, alpha = 0.8, fill = “lightgreen”, color = “black”) + theme_minimal()
mean(gap_2007\(lifeExp) median(gap_2007\)lifeExp)
sd(gap_2007\(lifeExp) range(gap_2007\)lifeExp) IQR(gap_2007$lifeExp)
summary(gap_2007$lifeExp)
ggplot(gap_2007, aes(x = lifeExp)) + geom_boxplot(fill = “lightgreen”) + theme_minimal()
In this report, we analyzed data collected from countries all around the world in 2007. Our initial work focused on GDP Per Capita, while our later work focused on Life Expectancy. For each variable, we calculated measures of center; measures of spread; and looked at several graphs (histograms and boxplots).