I – Some Warmup Questions

For this problem set we’ll use the Economic Freedom dataset, sourced from the Fraser Institute: https://www.fraserinstitute.org/economic-freedom/.

Your first order of action is loading the data into your R session. First download the data file and save it somewhere convenient on your computer (e.g. the same directory your code file is in). To load csv data you can use the read.csv() function. You must specify the file path to your data, in the form read.csv('path-to-my-data/mydata.csv'). If you don’t know how to find a file path, give it a google.

If you’re having trouble, try runing the command getwd()—this will show you what directory your current R session is in. You can also use setwd() to manually specify a working directory.

In the chunk below load the dataset into R. If your data file is in the same directory as your code file, the sample code below should work—simply uncomment it and run it.

efwdata <- read.csv('efw.csv')
head(efwdata)

##   year ISO              country     continent economic_freedom rank
## 1 2017 AGO               Angola        Africa             4.83  158
## 2 2017 ALB              Albania        Europe             7.67   30
## 3 2017 ARE United Arab Emirates          Asia             7.17   61
## 4 2017 ARG            Argentina South America             5.67  146
## 5 2017 ARM              Armenia          Asia             7.70   27
## 6 2017 ARM              Armenia        Europe             7.70   27
##   quartile govt_size judicial_independence property_rights
## 1        4      6.76                  1.44            3.30
## 2        1      7.53                  2.48            4.57
## 3        2      5.85                  7.16            7.41
## 4        4      5.71                  3.59            4.38
## 5        1      7.40                  4.03            5.79
## 6        1      7.40                  4.03            5.79
##   military_interference reliable_police gender_legal_rights money_growth
## 1                  3.33            3.36                0.81         9.44
## 2                  8.33            6.82                0.95         9.25
## 3                  8.33            8.33                0.48         9.22
## 4                  7.50            3.70                0.79         5.01
## 5                  5.83            5.84                1.00         8.56
## 6                  5.83            5.84                1.00         8.56
##   inflation sound_money tariffs foreign_ownership_investment_restrictions
## 1      3.66        5.57    7.07                                      2.95
## 2      9.60        9.65    9.01                                      6.31
## 3      9.61        9.06    8.44                                      7.60
## 4      4.86        6.47    6.60                                      5.36
## 5      9.81        9.48    8.63                                      5.11
## 6      9.81        9.48    8.63                                      5.11
##   freedom_to_trade_internationally credit_market_regulations
## 1                             3.21                      6.73
## 2                             8.34                      9.72
## 3                             8.05                      6.70
## 4                             6.55                      6.09
## 5                             8.20                      9.26
## 6                             8.20                      9.26
##   tax_compliance business_regulation
## 1           6.78                4.88
## 2           7.18                6.65
## 3           9.87                8.31
## 4           6.51                5.72
## 5           7.06                6.95
## 6           7.06                6.95

Identify some variables in your data that are categorical, discrete, and continuous.

Categorical: Year, ISO, Country, Continent,

Discrete: Rank, Quartile

Continuous: Everything else: Economic freedom, govt. size, judicial independence, property rights, military interference reliable police, gender rights, money growth, inflation, sound money, tarriffs …

Which years are recorded in the data? Use the unique() function on the year variable, as shown in the sample code below:

unique(efwdata$year)

##  [1] 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004
## [15] 2003 2002 2001 2000 1995 1990 1985 1980 1975 1970

(optional) Some additional checks you can run on your data: use colnames() to check the column names; use class() to check the data type of a particular column; use str() to check the structure of your dataset—this will produce a table showing each column, its data type(s), and a preview of its observations. How are different variables (e.g. continuous, categorical, etc) stored in R?

colnames(efwdata)

##  [1] "year"                                     
##  [2] "ISO"                                      
##  [3] "country"                                  
##  [4] "continent"                                
##  [5] "economic_freedom"                         
##  [6] "rank"                                     
##  [7] "quartile"                                 
##  [8] "govt_size"                                
##  [9] "judicial_independence"                    
## [10] "property_rights"                          
## [11] "military_interference"                    
## [12] "reliable_police"                          
## [13] "gender_legal_rights"                      
## [14] "money_growth"                             
## [15] "inflation"                                
## [16] "sound_money"                              
## [17] "tariffs"                                  
## [18] "foreign_ownership_investment_restrictions"
## [19] "freedom_to_trade_internationally"         
## [20] "credit_market_regulations"                
## [21] "tax_compliance"                           
## [22] "business_regulation"

str(efwdata)

## 'data.frame':    4032 obs. of  22 variables:
##  $ year                                     : int  2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
##  $ ISO                                      : Factor w/ 161 levels "AGO","ALB","ARE",..: 1 2 3 4 5 5 6 7 8 8 ...
##  $ country                                  : Factor w/ 161 levels "Albania","Algeria",..: 3 1 153 4 5 5 6 7 8 8 ...
##  $ continent                                : Factor w/ 6 levels "Africa","Asia",..: 1 3 2 6 2 3 5 3 2 3 ...
##  $ economic_freedom                         : num  4.83 7.67 7.17 5.67 7.7 7.7 8.07 7.71 6.34 6.34 ...
##  $ rank                                     : int  158 30 61 146 27 27 9 26 116 116 ...
##  $ quartile                                 : int  4 1 2 4 1 1 1 1 3 3 ...
##  $ govt_size                                : num  6.76 7.53 5.85 5.71 7.4 7.4 6.96 5.66 4.84 4.84 ...
##  $ judicial_independence                    : num  1.44 2.48 7.16 3.59 4.03 4.03 8.68 7.68 5.68 5.68 ...
##  $ property_rights                          : num  3.3 4.57 7.41 4.38 5.79 5.79 8.16 8.18 6.32 6.32 ...
##  $ military_interference                    : num  3.33 8.33 8.33 7.5 5.83 5.83 10 10 5 5 ...
##  $ reliable_police                          : num  3.36 6.82 8.33 3.7 5.84 5.84 8.62 8.46 6.17 6.17 ...
##  $ gender_legal_rights                      : num  0.81 0.95 0.48 0.79 1 1 1 1 0.67 0.67 ...
##  $ money_growth                             : num  9.44 9.25 9.22 5.01 8.56 8.56 8.95 8.22 9.19 9.19 ...
##  $ inflation                                : num  3.66 9.6 9.61 4.86 9.81 9.81 9.61 9.58 7.42 7.42 ...
##  $ sound_money                              : num  5.57 9.65 9.06 6.47 9.48 9.48 9.46 9.42 6.85 6.85 ...
##  $ tariffs                                  : num  7.07 9.01 8.44 6.6 8.63 8.63 8.84 8.23 7.98 7.98 ...
##  $ foreign_ownership_investment_restrictions: num  2.95 6.31 7.6 5.36 5.11 5.11 6.85 6.9 6.23 6.23 ...
##  $ freedom_to_trade_internationally         : num  3.21 8.34 8.05 6.55 8.2 8.2 7.56 8.09 7.29 7.29 ...
##  $ credit_market_regulations                : num  6.73 9.72 6.7 6.09 9.26 9.26 9.64 9.24 8 8 ...
##  $ tax_compliance                           : num  6.78 7.18 9.87 6.51 7.06 7.06 8.82 8.53 8.22 8.22 ...
##  $ business_regulation                      : num  4.88 6.65 8.31 5.72 6.95 6.95 8.05 7.5 7.46 7.46 ...

ISO, Country, Continent categories are stored as factors with 161 levels

Years, Rank and Quartile are stored as intergers

Continuous data are stored as just numbers

II – Simple Visualizations

Filter your dataset for observations from the year 2017, and assign your filtered observations to a new object with an appropriate name. Below is some sample code showing two ways to do this:

## one way 
efw2017 <- filter(efwdata, year == 2017)

## another way, using a pipe
efw2017 <- efwdata %>% filter(year == 2017)

The latter method uses a pipe, %>%. This is a dplyr feature that forwards or “pipes” the values on its left hand side into the expressions(s) on its right hand side. The advantage of using a pipe may not be apparent in the simple example above, but when running many functions simultaneously it’s far superior to the other method (as you will soon see).

To make a plot use the ggplot() function (note although the package name is ggplot2, the function call is not appended with a “2”). The ggplot() function takes two arguments:

data – a data frame whose variables you want to plot
mapping – an aesthetic mapping for which variable(s) go on which axes

You’ll also need to specify a geometric mapping, in the form + geom_XXX(), to specify the kind of plot you want (i.e. to geometrically map the data onto the \(x\) and \(y\) axes). The following examples will demonstrate how to use ggplot().

Let’s now visualize the distribution of economic freedom scores in 2017 by plotting a histogram. Below is some sample code to help you do this. Note histograms only require an aesthetic mapping for the \(x\)-axis (since the \(y\)-axis is simply frequency).

ggplot(data = efw2017, mapping = aes(x = economic_freedom)) + 
   geom_histogram(bins = 50)

Now run the following code, which produces a histogram similar to the one above, but with relative frequency on the \(y\)-axis instead of frequency. You can do this by specifying aes(y = ..density..) in the geometric mapping. Note also the additional aesthetic parameters, which change the labels/colors/theme of the plot.

 ggplot(data = efw2017, mapping = aes(x = economic_freedom)) + 
   geom_histogram(bins = 50, aes(y = ..density..), fill = 'violet') +
   ggtitle('distribution of economic freedom scores 2017') +
   xlab('economic freedom index (out of 10)') +
   theme_light()

What is the area contained by this relative frequency histogram?

The total area contained by this relative histogram = 1

Now run the following code, which uses facet_wrap() to create individual plots for each continent:

ggplot(data = efw2017, aes(x = economic_freedom)) +
   geom_histogram(bins = 50) +
   facet_wrap(~continent)

Describe what you see. Are the distributions similar across the categories?

The distrubtion between Africa, Asia and Europe is similar while the distribution between the North Americas, Oceania and South America are similar respectively.

Make a box and whisker plot of economic_freedom (numeric) on continent (categorical). Use geom_boxplot().

ggplot(data = efwdata, mapping = aes(x = continent, y = economic_freedom)) + 
  geom_boxplot()

## Warning: Removed 765 rows containing non-finite values (stat_boxplot).

What do box and whisker plots show that histograms don’t? What do the lines on a box and whisker plot represent?

The boxplot shows whether the distribution is skewed due to an outlier or outliers. Boxplots show the centre (median) and the spread in overall range. Furthermore, the “whiskers” in a boxplot show the minimum and the maximum point, showing the possibility of an outlier that skews the result.

Line plots are a good way to visualize how a variable evolves over time. To make a line plot you should specify + geom_line() as the geometric mapping.

Let’s visualize how the economic freedom score has evolved over time in a particular country. Choose a country from the dataset, and create a filtered dataset with observations from that country only. Then make a line plot of economic_freedom (on the \(y\)-axis) on year (on the \(x\)-axis) for this country’s data. Use geom_line(). What do you see? Does the score change over the years recorded in the data?

koreaefw <- filter(efwdata, country == "Korea, South")
ggplot(data = koreaefw, aes(x = year, y = economic_freedom)) + 
   geom_line()

The economic freedom score decreases from 1970-1975, then starts on an overall increasing trend. While there are ups and downs after 2005, the overall trend is positive.

(optional) Make the same plot as in q9, this time fitting a smoothed line to the data.

ggplot(data = koreaefw, aes(x = year, y = economic_freedom)) + 
   geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

(optional) Compare how economic freedom has evolved over time across several countries. Choose some countries from the dataset, and create a filtered dataset with observations from these countries only. Make a line plot similar to the ones above, this time specifying a color argument in the aesthetic mapping, e.g. mapping = aes(x = ..., y = ..., color = country)—this will use a different color for each country.

III – Summary Statistics

Say you had a set of observations, \(x_1, x_2, ..., x_n\), which had some mean \(\bar x\). How would you anticipate the mean changing if you multiplied all the observations by 2? What about if you added 5 to all the observations?

The mean would be multiplied by 2 if all the observations are multiplied by 2. Also if the all the observations are added 5, then the mean should increase by 5

Use mean() and median() to compute the mean and median values of the variable sound_money in 2017 (use the filtered 2017 dataset you created earlier). Are the values similar, or are they different? If they are different, can you think of a reason why?

mean(efw2017$sound_money)

## [1] 8.313214

median(efw2017$sound_money)

## [1] 8.825

The mean and mean are similar, but inherently different. The reason for the difference could be that there are outliers that skew the mean and make it smaller than the median

Plot a relative frequency histogram of sound_money in 2017 to visualize its distribution. Add two vertical lines to your plot showing where the mean and median are. You can use geom_vline() to add vertical lines and geom_text() to annotate your plot. Below is some sample code to help you do this.

 ggplot(data = efw2017, aes(x = sound_money)) +
   geom_histogram(bins = 30, aes(y = ..density..)) +
   geom_vline(xintercept = mean(efw2017$sound_money), color = 'red') +
   geom_text(x = mean(efw2017$sound_money)-0.5, y = 30, color = 'red', label = 'mean') + 
   geom_vline(xintercept = median(efw2017$sound_money), color = 'blue') +
   geom_text(x = median(efw2017$sound_money)+0.5, y = 30, color = 'blue', label = 'median')

In the above plot, what is the area contained by the histogram to the left of the median? What does this imply?

It’s the half of the data set that has more sound money than the medium. This means that half the data set has more sound money score than 8.825.

Based on your results, comment on whether the mean or median is a more appropriate measure of central tendency in this case.

I feel like the median is a more appropriate measure of central tendency since it can be seen that there is a major outlier on the left that is pulling the mean down. Therefore, to measure “centra; tendeny” the median would be a better choice to use as it focuses more on “central” data rather than including the small outlier.

Use sd() to compute the standard deviation of the variable sound_money in 2017.

sd(efw2017$sound_money)

## [1] 1.392292

How would you anticipate the standard deviation changing if you multiplied all the observations by 2? What about if you added 5 to all the observations?

The Standard deviation would multiply by 2 if all the oberrvation is multiplied by 2 as the distance is also from the mean to the data is also multiplied by 2. However, the standard deviation would stay the same if all values are added 5, since the distance from the value to the mean wouldn’t change.

(optional) Let’s say you want to compare the average economic freedom score across each of the continents for each year in the data. To do this you’ll need to create an aggregated dataset. In this case your grouping variables are continent and year, and you’ll want to calculate a mean for the nongrouping variable, economic_freedom. First, select the relevant variables from the full dataset. Then use group_by() to specify the grouping variables and summarize() to specify the function you want to perform on the nongrouping variables. Below is some sample code to help you do this.

 efw_aggregated <- efwdata %>%
   select(continent, year, economic_freedom) %>%
   group_by(continent, year) %>%
   summarize(avg_economic_freedom = mean(economic_freedom, na.rm = TRUE))

(optional) Using your aggregated dataset, make a smoothed line plot of avg_economic_freedom on year, using different colors for each continent. What do you see?

ggplot(data = efw_aggregated, mapping = aes(x = year, y = avg_economic_freedom, color= continent)) + 
   geom_smooth() + 
   theme_light()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

IV – Tests

For the test described above, state the null and alternative hypotheses.
\(H_0\): The true mean economic freedom score of 2017 is 4.5

\(H_1\): The true mean economic freedom score of 2017 isn’t 4.5

Use t.test() to perform the hypothesis test. Below is some sample code to help you:

t.test(mu = 4.5, efw2017$economic_freedom)

## 
##  One Sample t-test
## 
## data:  efw2017$economic_freedom
## t = 32.377, df = 167, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 4.5
## 95 percent confidence interval:
##  6.675011 6.957489
## sample estimates:
## mean of x 
##   6.81625

What is the observed value in this test? Is it close to your hypothesized value, or is it far off?

The observed value is 6.81625. It is very far off from the proposed 4.5

What is the \(p\)-value of this test? How might you interpret this \(p\)-value? Does it say the observed result is likely or unlikely under the null hypothesis? The P value is smaller than 2,2e^-16, meaning it’s smaller than 0.05. Therefore, the observed result is extremely unlikely under the null hypothesis.

Decide whether to reject your null hypothesis, and say why.

Since the P-value of the observed function is less than 0.05, there is less than 5% of the outcome from happening. Therefore, we reject the null hypothesis that the true mean economic freedom score of 2017 is 4

V – Association

Make a scatterplot of economic_freedom on business_regulation. Use geom_point(). What do you see? Are the variables associated? If so, what kind of relationship do they have?

ggplot(data = efwdata, mapping = aes(x = business_regulation, y = economic_freedom)) + 
   geom_point()

## Warning: Removed 1450 rows containing missing values (geom_point).

Compute Pearson’s correlation coefficient between the two variables you plotted in q1. What is the result? What does it imply?

cor(efwdata$business_regulation, efwdata$economic_freedom, 
    use = "complete.obs")

## [1] 0.7321065

VI – Linear Models

When two variables are related, you can build a statistical model that identifies the mathematical relationship between them. When variables are linearly related, you can model them with a linear function (a straight line). This modeling technique is known as linear regression.

You may be familiar with the mathematical equation for a straight line, \(y = mx + c\), where \(m\) is the slope and \(c\) is the \(y\)-intercept. In this notation \(y\) is usually called the dependent variable and \(x\) the independent variable, since \(y\) is expressed as a function of \(x\).

In linear regression there is slightly different terminology—but the idea is the same. A simple linear model has one response variable (\(y\)) and one predictor variable or explanatory variable (\(x\)). The response variable is specified as a function of the predictor.

First let’s try to visualize a linear model. Pick two variables you found in the previous section that were demonstrably correlated with each other. If you were going to model their relationship with a linear function, which variable should you use as the predictor (\(x\)) and which the response (\(y\))? Why?

The \(x\) or explanatory variable would be business regulation and the \(y\) or response variable would be economic_freedom as in the previous graph it seemed like as business regulation increased, so did economic freedom.

Make a scatterplot of these two variables using geom_point(). Add another geometric mapping, stat_smooth(method = 'lm'). This will overlay the scatterplot with a straight line that is fitted using the “lm” method (linear model).

ggplot(data = efwdata, mapping = aes(x = business_regulation, y = economic_freedom)) + 
   geom_point()+
   stat_smooth(method = 'lm')

## Warning: Removed 1450 rows containing non-finite values (stat_smooth).

## Warning: Removed 1450 rows containing missing values (geom_point).

The error term describes the vertical distance from the data points to the regression line. Each point has its own error term. What is the sum of all the error terms?

The sum of all the error terms or “residuals” is 0

Use lm() to compute the coefficients of the regression line you plotted in q2. Below is some sample code to help you. Report your results by calling summary() on the saved regression data. Are the results what you expect?

reg1 <- lm(economic_freedom ~ business_regulation, data = efwdata)
summary(reg1)

## 
## Call:
## lm(formula = economic_freedom ~ business_regulation, data = efwdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5380 -0.3992  0.0630  0.4634  1.9145 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.450669   0.063054   54.73   <2e-16 ***
## business_regulation 0.542963   0.009946   54.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6332 on 2580 degrees of freedom
##   (1450 observations deleted due to missingness)
## Multiple R-squared:  0.536,  Adjusted R-squared:  0.5358 
## F-statistic:  2980 on 1 and 2580 DF,  p-value: < 2.2e-16

Predict the value of your chosen response variable for some arbitrary value of your explanatory variable. You don’t need to write code to do this—the computation should be simple enough to do by hand (just plug in your coefficients to the linear function).

\(y\) = 0.543\(x\) + 3.45

if business regulation = 7, 0.543*7 + 3.45 = 7.251 economic freedom score

When answering q5, hopefully you chose a value for the predictor that is within the range of observed data. Take a look at your scatterplot—what is the range of values on the \(x\)-axis? Making predictions outside this range is known as extrapolation. Can you think of a reason why extrapolation might be a bad idea?

The range of the data is from 2 to 10. Extrapolation is a bad idea since the correlation we have made may not work for that number as the correlation only works for the variables in the given data range.

Choose a few numeric variables that are correlated with the response variable—other than the one you’ve already used. Use lm() to create a multiple regression model, adding these predictors to the one you have already. Show the results.

reg2 <- lm(economic_freedom ~ business_regulation + property_rights + sound_money, data = efwdata)
summary(reg2)

## 
## Call:
## lm(formula = economic_freedom ~ business_regulation + property_rights + 
##     sound_money, data = efwdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.65323 -0.23488  0.01183  0.25243  1.13545 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.171766   0.046344   46.86   <2e-16 ***
## business_regulation 0.238101   0.009106   26.15   <2e-16 ***
## property_rights     0.079963   0.005987   13.36   <2e-16 ***
## sound_money         0.341376   0.005872   58.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.365 on 2434 degrees of freedom
##   (1594 observations deleted due to missingness)
## Multiple R-squared:  0.8365, Adjusted R-squared:  0.8363 
## F-statistic:  4150 on 3 and 2434 DF,  p-value: < 2.2e-16

Is the coefficient on your original predictor (the one in the simple model from q4) different in the multiple model? If so, can you think of a reason why?

Yes. Since there are 2 different variables also affecting the response variable, it would make sense that the coefficient (slope) will be different on what variable the graph chooses to focus on

Add a categorical variable to your multiple regression model and show the results (don’t use country—try to choose one with relatively few categories, like continent). You’ll see that R treats each category as a separate predictor with its own coefficient.

reg3 <- lm(economic_freedom ~ business_regulation + property_rights + sound_money + continent, data = efwdata)
summary(reg3)

## 
## Call:
## lm(formula = economic_freedom ~ business_regulation + property_rights + 
##     sound_money + continent, data = efwdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.46902 -0.20597  0.00467  0.21808  1.26360 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.314391   0.045172  51.235  < 2e-16 ***
## business_regulation    0.216441   0.008756  24.718  < 2e-16 ***
## property_rights        0.085050   0.005620  15.133  < 2e-16 ***
## sound_money            0.310400   0.005751  53.977  < 2e-16 ***
## continentAsia          0.177680   0.020677   8.593  < 2e-16 ***
## continentEurope        0.317070   0.020927  15.151  < 2e-16 ***
## continentNorth America 0.440531   0.027466  16.039  < 2e-16 ***
## continentOceania       0.557941   0.052955  10.536  < 2e-16 ***
## continentSouth America 0.217968   0.028031   7.776  1.1e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3398 on 2429 degrees of freedom
##   (1594 observations deleted due to missingness)
## Multiple R-squared:  0.8585, Adjusted R-squared:  0.8581 
## F-statistic:  1843 on 8 and 2429 DF,  p-value: < 2.2e-16