For this problem set we’ll use the Economic Freedom dataset, sourced from the Fraser Institute: https://www.fraserinstitute.org/economic-freedom/.
Your first order of action is loading the data into your R session. First download the data file and save it somewhere convenient on your computer (e.g. the same directory your code file is in). To load csv data you can use the read.csv() function. You must specify the file path to your data, in the form read.csv('path-to-my-data/mydata.csv'). If you don’t know how to find a file path, give it a google.
If you’re having trouble, try runing the command getwd()—this will show you what directory your current R session is in. You can also use setwd() to manually specify a working directory.
efwdata <- read.csv('efw.csv')
head(efwdata)
## year ISO country continent economic_freedom rank
## 1 2017 AGO Angola Africa 4.83 158
## 2 2017 ALB Albania Europe 7.67 30
## 3 2017 ARE United Arab Emirates Asia 7.17 61
## 4 2017 ARG Argentina South America 5.67 146
## 5 2017 ARM Armenia Asia 7.70 27
## 6 2017 ARM Armenia Europe 7.70 27
## quartile govt_size judicial_independence property_rights
## 1 4 6.76 1.44 3.30
## 2 1 7.53 2.48 4.57
## 3 2 5.85 7.16 7.41
## 4 4 5.71 3.59 4.38
## 5 1 7.40 4.03 5.79
## 6 1 7.40 4.03 5.79
## military_interference reliable_police gender_legal_rights money_growth
## 1 3.33 3.36 0.81 9.44
## 2 8.33 6.82 0.95 9.25
## 3 8.33 8.33 0.48 9.22
## 4 7.50 3.70 0.79 5.01
## 5 5.83 5.84 1.00 8.56
## 6 5.83 5.84 1.00 8.56
## inflation sound_money tariffs foreign_ownership_investment_restrictions
## 1 3.66 5.57 7.07 2.95
## 2 9.60 9.65 9.01 6.31
## 3 9.61 9.06 8.44 7.60
## 4 4.86 6.47 6.60 5.36
## 5 9.81 9.48 8.63 5.11
## 6 9.81 9.48 8.63 5.11
## freedom_to_trade_internationally credit_market_regulations
## 1 3.21 6.73
## 2 8.34 9.72
## 3 8.05 6.70
## 4 6.55 6.09
## 5 8.20 9.26
## 6 8.20 9.26
## tax_compliance business_regulation
## 1 6.78 4.88
## 2 7.18 6.65
## 3 9.87 8.31
## 4 6.51 5.72
## 5 7.06 6.95
## 6 7.06 6.95
Categorical: Year, ISO, Country, Continent,
Discrete: Rank, Quartile
Continuous: Everything else: Economic freedom, govt. size, judicial independence, property rights, military interference reliable police, gender rights, money growth, inflation, sound money, tarriffs …
unique() function on the year variable, as shown in the sample code below:unique(efwdata$year)
## [1] 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004
## [15] 2003 2002 2001 2000 1995 1990 1985 1980 1975 1970
colnames() to check the column names; use class() to check the data type of a particular column; use str() to check the structure of your dataset—this will produce a table showing each column, its data type(s), and a preview of its observations. How are different variables (e.g. continuous, categorical, etc) stored in R?colnames(efwdata)
## [1] "year"
## [2] "ISO"
## [3] "country"
## [4] "continent"
## [5] "economic_freedom"
## [6] "rank"
## [7] "quartile"
## [8] "govt_size"
## [9] "judicial_independence"
## [10] "property_rights"
## [11] "military_interference"
## [12] "reliable_police"
## [13] "gender_legal_rights"
## [14] "money_growth"
## [15] "inflation"
## [16] "sound_money"
## [17] "tariffs"
## [18] "foreign_ownership_investment_restrictions"
## [19] "freedom_to_trade_internationally"
## [20] "credit_market_regulations"
## [21] "tax_compliance"
## [22] "business_regulation"
str(efwdata)
## 'data.frame': 4032 obs. of 22 variables:
## $ year : int 2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
## $ ISO : Factor w/ 161 levels "AGO","ALB","ARE",..: 1 2 3 4 5 5 6 7 8 8 ...
## $ country : Factor w/ 161 levels "Albania","Algeria",..: 3 1 153 4 5 5 6 7 8 8 ...
## $ continent : Factor w/ 6 levels "Africa","Asia",..: 1 3 2 6 2 3 5 3 2 3 ...
## $ economic_freedom : num 4.83 7.67 7.17 5.67 7.7 7.7 8.07 7.71 6.34 6.34 ...
## $ rank : int 158 30 61 146 27 27 9 26 116 116 ...
## $ quartile : int 4 1 2 4 1 1 1 1 3 3 ...
## $ govt_size : num 6.76 7.53 5.85 5.71 7.4 7.4 6.96 5.66 4.84 4.84 ...
## $ judicial_independence : num 1.44 2.48 7.16 3.59 4.03 4.03 8.68 7.68 5.68 5.68 ...
## $ property_rights : num 3.3 4.57 7.41 4.38 5.79 5.79 8.16 8.18 6.32 6.32 ...
## $ military_interference : num 3.33 8.33 8.33 7.5 5.83 5.83 10 10 5 5 ...
## $ reliable_police : num 3.36 6.82 8.33 3.7 5.84 5.84 8.62 8.46 6.17 6.17 ...
## $ gender_legal_rights : num 0.81 0.95 0.48 0.79 1 1 1 1 0.67 0.67 ...
## $ money_growth : num 9.44 9.25 9.22 5.01 8.56 8.56 8.95 8.22 9.19 9.19 ...
## $ inflation : num 3.66 9.6 9.61 4.86 9.81 9.81 9.61 9.58 7.42 7.42 ...
## $ sound_money : num 5.57 9.65 9.06 6.47 9.48 9.48 9.46 9.42 6.85 6.85 ...
## $ tariffs : num 7.07 9.01 8.44 6.6 8.63 8.63 8.84 8.23 7.98 7.98 ...
## $ foreign_ownership_investment_restrictions: num 2.95 6.31 7.6 5.36 5.11 5.11 6.85 6.9 6.23 6.23 ...
## $ freedom_to_trade_internationally : num 3.21 8.34 8.05 6.55 8.2 8.2 7.56 8.09 7.29 7.29 ...
## $ credit_market_regulations : num 6.73 9.72 6.7 6.09 9.26 9.26 9.64 9.24 8 8 ...
## $ tax_compliance : num 6.78 7.18 9.87 6.51 7.06 7.06 8.82 8.53 8.22 8.22 ...
## $ business_regulation : num 4.88 6.65 8.31 5.72 6.95 6.95 8.05 7.5 7.46 7.46 ...
ISO, Country, Continent categories are stored as factors with 161 levels
Years, Rank and Quartile are stored as intergers
Continuous data are stored as just numbers
## one way
efw2017 <- filter(efwdata, year == 2017)
## another way, using a pipe
efw2017 <- efwdata %>% filter(year == 2017)
The latter method uses a pipe, %>%. This is a dplyr feature that forwards or “pipes” the values on its left hand side into the expressions(s) on its right hand side. The advantage of using a pipe may not be apparent in the simple example above, but when running many functions simultaneously it’s far superior to the other method (as you will soon see).
To make a plot use the ggplot() function (note although the package name is ggplot2, the function call is not appended with a “2”). The ggplot() function takes two arguments:
data – a data frame whose variables you want to plotmapping – an aesthetic mapping for which variable(s) go on which axesYou’ll also need to specify a geometric mapping, in the form + geom_XXX(), to specify the kind of plot you want (i.e. to geometrically map the data onto the \(x\) and \(y\) axes). The following examples will demonstrate how to use ggplot().
ggplot(data = efw2017, mapping = aes(x = economic_freedom)) +
geom_histogram(bins = 50)
aes(y = ..density..) in the geometric mapping. Note also the additional aesthetic parameters, which change the labels/colors/theme of the plot. ggplot(data = efw2017, mapping = aes(x = economic_freedom)) +
geom_histogram(bins = 50, aes(y = ..density..), fill = 'violet') +
ggtitle('distribution of economic freedom scores 2017') +
xlab('economic freedom index (out of 10)') +
theme_light()
The total area contained by this relative histogram = 1
facet_wrap() to create individual plots for each continent:ggplot(data = efw2017, aes(x = economic_freedom)) +
geom_histogram(bins = 50) +
facet_wrap(~continent)
The distrubtion between Africa, Asia and Europe is similar while the distribution between the North Americas, Oceania and South America are similar respectively.
economic_freedom (numeric) on continent (categorical). Use geom_boxplot().ggplot(data = efwdata, mapping = aes(x = continent, y = economic_freedom)) +
geom_boxplot()
## Warning: Removed 765 rows containing non-finite values (stat_boxplot).
The boxplot shows whether the distribution is skewed due to an outlier or outliers. Boxplots show the centre (median) and the spread in overall range. Furthermore, the “whiskers” in a boxplot show the minimum and the maximum point, showing the possibility of an outlier that skews the result.
Line plots are a good way to visualize how a variable evolves over time. To make a line plot you should specify + geom_line() as the geometric mapping.
economic_freedom (on the \(y\)-axis) on year (on the \(x\)-axis) for this country’s data. Use geom_line(). What do you see? Does the score change over the years recorded in the data?koreaefw <- filter(efwdata, country == "Korea, South")
ggplot(data = koreaefw, aes(x = year, y = economic_freedom)) +
geom_line()
The economic freedom score decreases from 1970-1975, then starts on an overall increasing trend. While there are ups and downs after 2005, the overall trend is positive.
ggplot(data = koreaefw, aes(x = year, y = economic_freedom)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
mapping = aes(x = ..., y = ..., color = country)—this will use a different color for each country.
The mean would be multiplied by 2 if all the observations are multiplied by 2. Also if the all the observations are added 5, then the mean should increase by 5
mean() and median() to compute the mean and median values of the variable sound_money in 2017 (use the filtered 2017 dataset you created earlier). Are the values similar, or are they different? If they are different, can you think of a reason why?mean(efw2017$sound_money)
## [1] 8.313214
median(efw2017$sound_money)
## [1] 8.825
The mean and mean are similar, but inherently different. The reason for the difference could be that there are outliers that skew the mean and make it smaller than the median
sound_money in 2017 to visualize its distribution. Add two vertical lines to your plot showing where the mean and median are. You can use geom_vline() to add vertical lines and geom_text() to annotate your plot. Below is some sample code to help you do this. ggplot(data = efw2017, aes(x = sound_money)) +
geom_histogram(bins = 30, aes(y = ..density..)) +
geom_vline(xintercept = mean(efw2017$sound_money), color = 'red') +
geom_text(x = mean(efw2017$sound_money)-0.5, y = 30, color = 'red', label = 'mean') +
geom_vline(xintercept = median(efw2017$sound_money), color = 'blue') +
geom_text(x = median(efw2017$sound_money)+0.5, y = 30, color = 'blue', label = 'median')
It’s the half of the data set that has more sound money than the medium. This means that half the data set has more sound money score than 8.825.
I feel like the median is a more appropriate measure of central tendency since it can be seen that there is a major outlier on the left that is pulling the mean down. Therefore, to measure “centra; tendeny” the median would be a better choice to use as it focuses more on “central” data rather than including the small outlier.
sd() to compute the standard deviation of the variable sound_money in 2017.sd(efw2017$sound_money)
## [1] 1.392292
The Standard deviation would multiply by 2 if all the oberrvation is multiplied by 2 as the distance is also from the mean to the data is also multiplied by 2. However, the standard deviation would stay the same if all values are added 5, since the distance from the value to the mean wouldn’t change.
continent and year, and you’ll want to calculate a mean for the nongrouping variable, economic_freedom. First, select the relevant variables from the full dataset. Then use group_by() to specify the grouping variables and summarize() to specify the function you want to perform on the nongrouping variables. Below is some sample code to help you do this. efw_aggregated <- efwdata %>%
select(continent, year, economic_freedom) %>%
group_by(continent, year) %>%
summarize(avg_economic_freedom = mean(economic_freedom, na.rm = TRUE))
avg_economic_freedom on year, using different colors for each continent. What do you see?ggplot(data = efw_aggregated, mapping = aes(x = year, y = avg_economic_freedom, color= continent)) +
geom_smooth() +
theme_light()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
\(H_1\): The true mean economic freedom score of 2017 isn’t 4.5
t.test() to perform the hypothesis test. Below is some sample code to help you:t.test(mu = 4.5, efw2017$economic_freedom)
##
## One Sample t-test
##
## data: efw2017$economic_freedom
## t = 32.377, df = 167, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 4.5
## 95 percent confidence interval:
## 6.675011 6.957489
## sample estimates:
## mean of x
## 6.81625
The observed value is 6.81625. It is very far off from the proposed 4.5
Since the P-value of the observed function is less than 0.05, there is less than 5% of the outcome from happening. Therefore, we reject the null hypothesis that the true mean economic freedom score of 2017 is 4
economic_freedom on business_regulation. Use geom_point(). What do you see? Are the variables associated? If so, what kind of relationship do they have?ggplot(data = efwdata, mapping = aes(x = business_regulation, y = economic_freedom)) +
geom_point()
## Warning: Removed 1450 rows containing missing values (geom_point).
cor(efwdata$business_regulation, efwdata$economic_freedom,
use = "complete.obs")
## [1] 0.7321065
When two variables are related, you can build a statistical model that identifies the mathematical relationship between them. When variables are linearly related, you can model them with a linear function (a straight line). This modeling technique is known as linear regression.
You may be familiar with the mathematical equation for a straight line, \(y = mx + c\), where \(m\) is the slope and \(c\) is the \(y\)-intercept. In this notation \(y\) is usually called the dependent variable and \(x\) the independent variable, since \(y\) is expressed as a function of \(x\).
In linear regression there is slightly different terminology—but the idea is the same. A simple linear model has one response variable (\(y\)) and one predictor variable or explanatory variable (\(x\)). The response variable is specified as a function of the predictor.
The \(x\) or explanatory variable would be business regulation and the \(y\) or response variable would be economic_freedom as in the previous graph it seemed like as business regulation increased, so did economic freedom.
geom_point(). Add another geometric mapping, stat_smooth(method = 'lm'). This will overlay the scatterplot with a straight line that is fitted using the “lm” method (linear model).ggplot(data = efwdata, mapping = aes(x = business_regulation, y = economic_freedom)) +
geom_point()+
stat_smooth(method = 'lm')
## Warning: Removed 1450 rows containing non-finite values (stat_smooth).
## Warning: Removed 1450 rows containing missing values (geom_point).
The sum of all the error terms or “residuals” is 0
lm() to compute the coefficients of the regression line you plotted in q2. Below is some sample code to help you. Report your results by calling summary() on the saved regression data. Are the results what you expect?reg1 <- lm(economic_freedom ~ business_regulation, data = efwdata)
summary(reg1)
##
## Call:
## lm(formula = economic_freedom ~ business_regulation, data = efwdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5380 -0.3992 0.0630 0.4634 1.9145
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.450669 0.063054 54.73 <2e-16 ***
## business_regulation 0.542963 0.009946 54.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6332 on 2580 degrees of freedom
## (1450 observations deleted due to missingness)
## Multiple R-squared: 0.536, Adjusted R-squared: 0.5358
## F-statistic: 2980 on 1 and 2580 DF, p-value: < 2.2e-16
\(y\) = 0.543\(x\) + 3.45
if business regulation = 7, 0.543*7 + 3.45 = 7.251 economic freedom score
The range of the data is from 2 to 10. Extrapolation is a bad idea since the correlation we have made may not work for that number as the correlation only works for the variables in the given data range.
lm() to create a multiple regression model, adding these predictors to the one you have already. Show the results.reg2 <- lm(economic_freedom ~ business_regulation + property_rights + sound_money, data = efwdata)
summary(reg2)
##
## Call:
## lm(formula = economic_freedom ~ business_regulation + property_rights +
## sound_money, data = efwdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.65323 -0.23488 0.01183 0.25243 1.13545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.171766 0.046344 46.86 <2e-16 ***
## business_regulation 0.238101 0.009106 26.15 <2e-16 ***
## property_rights 0.079963 0.005987 13.36 <2e-16 ***
## sound_money 0.341376 0.005872 58.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.365 on 2434 degrees of freedom
## (1594 observations deleted due to missingness)
## Multiple R-squared: 0.8365, Adjusted R-squared: 0.8363
## F-statistic: 4150 on 3 and 2434 DF, p-value: < 2.2e-16
Yes. Since there are 2 different variables also affecting the response variable, it would make sense that the coefficient (slope) will be different on what variable the graph chooses to focus on
reg3 <- lm(economic_freedom ~ business_regulation + property_rights + sound_money + continent, data = efwdata)
summary(reg3)
##
## Call:
## lm(formula = economic_freedom ~ business_regulation + property_rights +
## sound_money + continent, data = efwdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.46902 -0.20597 0.00467 0.21808 1.26360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.314391 0.045172 51.235 < 2e-16 ***
## business_regulation 0.216441 0.008756 24.718 < 2e-16 ***
## property_rights 0.085050 0.005620 15.133 < 2e-16 ***
## sound_money 0.310400 0.005751 53.977 < 2e-16 ***
## continentAsia 0.177680 0.020677 8.593 < 2e-16 ***
## continentEurope 0.317070 0.020927 15.151 < 2e-16 ***
## continentNorth America 0.440531 0.027466 16.039 < 2e-16 ***
## continentOceania 0.557941 0.052955 10.536 < 2e-16 ***
## continentSouth America 0.217968 0.028031 7.776 1.1e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3398 on 2429 degrees of freedom
## (1594 observations deleted due to missingness)
## Multiple R-squared: 0.8585, Adjusted R-squared: 0.8581
## F-statistic: 1843 on 8 and 2429 DF, p-value: < 2.2e-16