Welcome to R. R is very different from SAS. R is an interpreted language, not a compiled one. This means, you type something into R and it does it. There is no data step. There are no procs. The SAS and R book is very useful for going between the two programs. R uses libraries to do different types of analysis, so we will need to install lots of different libraries to do different things. These need to be downloaded from the internet, using the install.packages() command. You only need to install a package once. E.g.

install.packages("lme4") will install the lme4 library. To use the functions within it, type

library(lme4)

Now you have access to those fuctions.

Below we will go through a simple R session where we basically review much of what we did last semester. We will load a dataset, print some cases, do some descriptice statistics and plots, t tests, and some linear models and diagnostics of those models.

#Load some libraries that I need
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library (car)
library(Hmisc)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: splines
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
library(sandwich)
library(multcomp)
## Loading required package: mvtnorm
## Loading required package: TH.data
#Read in a Commma separated values file, In this case, I am using the Population Reference Bureau's Population Data sheet from 2008.
dat<-read.csv("~/Google Drive/dem7283/PRB2008_All.csv")


#print all of the variable names in the data set
names(dat)
##  [1] "Y"                                
##  [2] "X"                                
##  [3] "ID"                               
##  [4] "Country"                          
##  [5] "Continent"                        
##  [6] "Region"                           
##  [7] "Year"                             
##  [8] "Population."                      
##  [9] "CBR"                              
## [10] "CDR"                              
## [11] "Rate.of.natural.increase"         
## [12] "Net.Migration.Rate"               
## [13] "ProjectedPopMid2025"              
## [14] "ProjectedPopMid2050"              
## [15] "ProjectedPopChange_08_50Perc"     
## [16] "IMR"                              
## [17] "WomandLifeTimeRiskMaternalDeath"  
## [18] "TFR"                              
## [19] "PercPopLT15"                      
## [20] "PercPopGT65"                      
## [21] "e0Total"                          
## [22] "e0Male"                           
## [23] "e0Female"                         
## [24] "PercUrban"                        
## [25] "PercPopinUrbanGT750k"             
## [26] "PercPop1549HIVAIDS2001"           
## [27] "PercPop1549HIVAIDS2007"           
## [28] "PercMarWomContraALL"              
## [29] "PercMarWomContraModern"           
## [30] "PercPpUnderNourished0204"         
## [31] "MotorVehper1000Pop0005"           
## [32] "PercPopwAccessImprovedWaterSource"
## [33] "GNIPPPperCapitaUSDollars"         
## [34] "PopDensPerSqKM"                   
## [35] "PopDensPerSqMile"
#look at the first 5 cases
head(dat, n=5)
##   Y X  ID     Country Continent             Region Year Population. CBR
## 1 1 1 115 Afghanistan      Asia South Central Asia 2008        32.7  47
## 2 2 2 178     Albania    Europe    Southern Europe 2008         3.2  13
## 3 3 3   1     Algeria    Africa  NORTHERN AFRICA   2008        34.7  22
## 4 4 4 179     Andorra    Europe    Southern Europe 2008         0.1  10
## 5 5 5  43      Angola    Africa      MIDDLE AFRICA 2008        16.8  47
##   CDR Rate.of.natural.increase Net.Migration.Rate ProjectedPopMid2025
## 1  21                      2.6                  0                50.3
## 2   6                      0.7                 -3                 3.5
## 3   4                      1.8                 -1                43.3
## 4   3                      0.7                 26                 0.1
## 5  21                      2.7                  2                26.2
##   ProjectedPopMid2050 ProjectedPopChange_08_50Perc   IMR
## 1                81.9                          150 163.0
## 2                 3.6                           11   8.0
## 3                50.1                           44  27.0
## 4                 0.1                           -4   2.5
## 5                42.7                          155 132.0
##   WomandLifeTimeRiskMaternalDeath TFR PercPopLT15 PercPopGT65 e0Total
## 1                               8 6.8          45           2      43
## 2                             490 1.6          27           8      75
## 3                             220 2.3          30           5      72
## 4                              NA 1.2          15          12      NA
## 5                              12 6.8          46           2      43
##   e0Male e0Female PercUrban PercPopinUrbanGT750k PercPop1549HIVAIDS2001
## 1     43       43        20                   12                     NA
## 2     72       79        45                   NA                     NA
## 3     71       74        63                   12                    0.1
## 4     NA       NA        90                   NA                     NA
## 5     41       44        57                   27                    1.6
##   PercPop1549HIVAIDS2007 PercMarWomContraALL PercMarWomContraModern
## 1                     NA                  10                      9
## 2                     NA                  75                      8
## 3                    0.1                  61                     52
## 4                     NA                  NA                     NA
## 5                    2.1                   6                      5
##   PercPpUnderNourished0204 MotorVehper1000Pop0005
## 1                       NA                      6
## 2                        6                     85
## 3                        4                     91
## 4                       NA                    750
## 5                       35                     NA
##   PercPopwAccessImprovedWaterSource GNIPPPperCapitaUSDollars
## 1                                22                       NA
## 2                                97                     6580
## 3                                85                     5490
## 4                               100                       NA
## 5                                51                     4400
##   PopDensPerSqKM PopDensPerSqMile
## 1             50           129.50
## 2            113           292.67
## 3             15            38.85
## 4            182           471.38
## 5             13            33.67
#Frequency Table of # of Contries by Continent
table(dat$Continent)
## 
##        Africa          Asia        Europe North America       Oceania 
##            56            51            45            27            17 
## South America 
##            13
#basic summary statistics
summary(dat)
##        Y             X             ID                       Country   
##  Min.   :  1   Min.   :  1   Min.   :  1.0   Afghanistan        :  1  
##  1st Qu.: 53   1st Qu.: 53   1st Qu.: 53.0   Albania            :  1  
##  Median :105   Median :105   Median :104.0   Algeria            :  1  
##  Mean   :105   Mean   :105   Mean   :105.2   Andorra            :  1  
##  3rd Qu.:157   3rd Qu.:157   3rd Qu.:158.0   Angola             :  1  
##  Max.   :209   Max.   :209   Max.   :210.0   Antigua and Barbuda:  1  
##                                              (Other)            :203  
##          Continent               Region         Year     
##  Africa       :56   EASTERN AFRICA  : 19   Min.   :2008  
##  Asia         :51   Western Asia    : 18   1st Qu.:2008  
##  Europe       :45   Carribean       : 17   Median :2008  
##  North America:27   Oceania         : 17   Mean   :2008  
##  Oceania      :17   WESTERN AFRICA  : 16   3rd Qu.:2008  
##  South America:13   Southern Europe : 15   Max.   :2008  
##                     (Other)         :107                 
##   Population.           CBR             CDR       
##  Min.   :   0.01   Min.   : 8.00   Min.   : 2.00  
##  1st Qu.:   0.90   1st Qu.:13.00   1st Qu.: 6.00  
##  Median :   5.90   Median :21.00   Median : 8.00  
##  Mean   :  32.08   Mean   :23.17   Mean   : 9.11  
##  3rd Qu.:  20.70   3rd Qu.:31.00   3rd Qu.:10.00  
##  Max.   :1324.70   Max.   :50.00   Max.   :31.00  
##                                                   
##  Rate.of.natural.increase Net.Migration.Rate ProjectedPopMid2025
##  Min.   :-0.600           Min.   :-18        Min.   :   0.01    
##  1st Qu.: 0.500           1st Qu.: -1        1st Qu.:   1.10    
##  Median : 1.400           Median :  0        Median :   7.70    
##  Mean   : 1.407           Mean   :  1        Mean   :  38.27    
##  3rd Qu.: 2.200           3rd Qu.:  2        3rd Qu.:  26.20    
##  Max.   : 3.600           Max.   : 41        Max.   :1476.00    
##                           NA's   :1                             
##  ProjectedPopMid2050 ProjectedPopChange_08_50Perc      IMR        
##  Min.   :   0.02     Min.   :-35.00               Min.   :  1.30  
##  1st Qu.:   1.30     1st Qu.:  6.00               1st Qu.:  7.20  
##  Median :   8.80     Median : 38.00               Median : 20.00  
##  Mean   :  44.74     Mean   : 47.84               Mean   : 35.39  
##  3rd Qu.:  34.70     3rd Qu.: 75.00               3rd Qu.: 58.50  
##  Max.   :1755.20     Max.   :263.00               Max.   :163.00  
##                                                   NA's   :2       
##  WomandLifeTimeRiskMaternalDeath      TFR         PercPopLT15   
##  Min.   :    7.0                 Min.   :1.000   Min.   :13.00  
##  1st Qu.:   46.5                 1st Qu.:1.775   1st Qu.:20.00  
##  Median :  290.0                 Median :2.500   Median :30.00  
##  Mean   : 3462.9                 Mean   :3.032   Mean   :29.93  
##  3rd Qu.: 4450.0                 3rd Qu.:4.000   3rd Qu.:39.00  
##  Max.   :47600.0                 Max.   :7.100   Max.   :49.00  
##  NA's   :38                      NA's   :1                      
##   PercPopGT65        e0Total          e0Male        e0Female    
##  Min.   : 1.000   Min.   :33.00   Min.   :33.0   Min.   :34.00  
##  1st Qu.: 4.000   1st Qu.:61.50   1st Qu.:59.0   1st Qu.:63.50  
##  Median : 6.000   Median :72.00   Median :69.0   Median :74.00  
##  Mean   : 7.512   Mean   :67.85   Mean   :65.6   Mean   :70.15  
##  3rd Qu.:11.000   3rd Qu.:75.00   3rd Qu.:73.0   3rd Qu.:79.00  
##  Max.   :22.000   Max.   :82.00   Max.   :80.0   Max.   :86.00  
##                   NA's   :2       NA's   :2      NA's   :2      
##    PercUrban      PercPopinUrbanGT750k PercPop1549HIVAIDS2001
##  Min.   : 10.00   Min.   :  3.00       Min.   : 0.100        
##  1st Qu.: 35.75   1st Qu.: 12.00       1st Qu.: 0.100        
##  Median : 57.50   Median : 19.00       Median : 0.500        
##  Mean   : 55.94   Mean   : 24.12       Mean   : 2.434        
##  3rd Qu.: 74.50   3rd Qu.: 31.00       3rd Qu.: 1.725        
##  Max.   :100.00   Max.   :100.00       Max.   :26.500        
##  NA's   :1        NA's   :87           NA's   :81            
##  PercPop1549HIVAIDS2007 PercMarWomContraALL PercMarWomContraModern
##  Min.   : 0.100         Min.   : 3.00       Min.   : 1.00         
##  1st Qu.: 0.100         1st Qu.:31.00       1st Qu.:17.75         
##  Median : 0.500         Median :53.00       Median :39.00         
##  Mean   : 2.145         Mean   :49.44       Mean   :38.61         
##  3rd Qu.: 1.700         3rd Qu.:70.00       3rd Qu.:58.00         
##  Max.   :26.100         Max.   :90.00       Max.   :90.00         
##  NA's   :69             NA's   :50          NA's   :41            
##  PercPpUnderNourished0204 MotorVehper1000Pop0005
##  Min.   : 2.50            Min.   :  1.0         
##  1st Qu.: 3.00            1st Qu.: 54.5         
##  Median : 8.00            Median :159.0         
##  Mean   :14.83            Mean   :249.8         
##  3rd Qu.:23.00            3rd Qu.:454.5         
##  Max.   :75.00            Max.   :787.0         
##  NA's   :36               NA's   :78            
##  PercPopwAccessImprovedWaterSource GNIPPPperCapitaUSDollars
##  Min.   : 22.00                    Min.   :  290           
##  1st Qu.: 73.00                    1st Qu.: 2110           
##  Median : 91.00                    Median : 5950           
##  Mean   : 84.04                    Mean   :11878           
##  3rd Qu.: 99.00                    3rd Qu.:15185           
##  Max.   :100.00                    Max.   :64400           
##  NA's   :31                        NA's   :33              
##  PopDensPerSqKM    PopDensPerSqMile  
##  Min.   :    2.0   Min.   :    5.18  
##  1st Qu.:   31.0   1st Qu.:   80.29  
##  Median :   77.0   Median :  199.43  
##  Mean   :  472.9   Mean   : 1224.67  
##  3rd Qu.:  195.0   3rd Qu.:  505.05  
##  Max.   :34000.0   Max.   :88060.00  
## 
#just want a mean
mean(dat$IMR, na.rm=T)
## [1] 35.38744
#standard deviation
sd(dat$IMR, na.rm=T)
## [1] 35.4421
#Quantiles
quantile(dat$IMR, na.rm=T)
##    0%   25%   50%   75%  100% 
##   1.3   7.2  20.0  58.5 163.0
#histogram of the infant mortality rate
hist(dat$IMR, main="Histogram of Infant Mortality Rate")

#Box plot for IMR* Continent
boxplot(IMR~Continent, dat,main="Boxplot of Infant Mortality Rate by continent")

#scatter plot of IMR * TFR
plot(IMR~TFR, data=dat, main="Bivariate Association between TFR and IMR")

#t-test for Africa vs Rest of the world
#Useing the I() funciton, which generates a T/F value, i.e. 2 groups, Africa or Not Africa
t.test(IMR~I(Continent=="Africa"), dat)
## 
##  Welch Two Sample t-test
## 
## data:  IMR by I(Continent == "Africa")
## t = -11.5883, df = 74.547, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -64.50901 -45.58185
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            20.76184            75.80727
#Simple ANOVA model version of a t-test
fit<-lm(IMR~I(Continent=="Africa"), dat)
summary(fit)
## 
## Call:
## lm(formula = IMR ~ I(Continent == "Africa"), data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.807 -15.612  -4.762  11.238 142.238 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    20.762      2.092   9.924   <2e-16 ***
## I(Continent == "Africa")TRUE   55.045      4.059  13.562   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.79 on 205 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.4729, Adjusted R-squared:  0.4703 
## F-statistic: 183.9 on 1 and 205 DF,  p-value: < 2.2e-16
#ANOVA model for IMR by continent
fit<-lm(IMR~Continent, dat)
anova(fit)
## Analysis of Variance Table
## 
## Response: IMR
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Continent   5 139806 27961.1  47.245 < 2.2e-16 ***
## Residuals 201 118960   591.8                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Post hoc tests using Tukey comparisons
glht(fit, linfct=mcp(Continent="Tukey"))
## 
##   General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Linear Hypotheses:
##                                    Estimate
## Asia - Africa == 0                  -43.084
## Europe - Africa == 0                -69.716
## North America - Africa == 0         -57.874
## Oceania - Africa == 0               -50.384
## South America - Africa == 0         -52.538
## Europe - Asia == 0                  -26.633
## North America - Asia == 0           -14.790
## Oceania - Asia == 0                  -7.300
## South America - Asia == 0            -9.454
## North America - Europe == 0          11.842
## Oceania - Europe == 0                19.333
## South America - Europe == 0          17.178
## Oceania - North America == 0          7.490
## South America - North America == 0    5.336
## South America - Oceania == 0         -2.154
#Basic OLS models
#fit the basic regression model
fit1<-lm(IMR ~ TFR + log(GNIPPPperCapitaUSDollars)+ log(PopDensPerSqMile), data=dat)
summary(fit1)
## 
## Call:
## lm(formula = IMR ~ TFR + log(GNIPPPperCapitaUSDollars) + log(PopDensPerSqMile), 
##     data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.374  -7.184   0.466   6.207  61.161 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    65.4360    16.1225   4.059 7.48e-05 ***
## TFR                            14.1539     1.1307  12.518  < 2e-16 ***
## log(GNIPPPperCapitaUSDollars)  -7.9888     1.3964  -5.721 4.60e-08 ***
## log(PopDensPerSqMile)          -0.5812     0.7857  -0.740     0.46    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.42 on 172 degrees of freedom
##   (33 observations deleted due to missingness)
## Multiple R-squared:  0.8331, Adjusted R-squared:  0.8302 
## F-statistic: 286.3 on 3 and 172 DF,  p-value: < 2.2e-16
#do some diagnostic plots
plot(fit1)

#Normality of residuals from the first fit?
shapiro.test(rstudent(fit1))
## 
##  Shapiro-Wilk normality test
## 
## data:  rstudent(fit1)
## W = 0.9562, p-value = 2.683e-05
plot(density(rstudent(fit1)))

#test for heteroskedasticity
bptest(fit1)
## 
##  studentized Breusch-Pagan test
## 
## data:  fit1
## BP = 26.6895, df = 3, p-value = 6.839e-06
#variance inflation factors
vif(fit1)
##                           TFR log(GNIPPPperCapitaUSDollars) 
##                      2.836245                      2.720001 
##         log(PopDensPerSqMile) 
##                      1.071165
#make White-corrected t-statistics and p values
coeftest(fit1, vcov=vcovHC(fit1, type = "HC0"))
## 
## t test of coefficients:
## 
##                               Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)                   65.43597   17.17594  3.8097 0.0001934 ***
## TFR                           14.15386    1.39263 10.1634 < 2.2e-16 ***
## log(GNIPPPperCapitaUSDollars) -7.98885    1.53293 -5.2115 5.318e-07 ***
## log(PopDensPerSqMile)         -0.58116    0.63402 -0.9166 0.3606190    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#now we fit a model that includes differences between continents
fit2<-lm(IMR ~ TFR + log(GNIPPPperCapitaUSDollars)+ log(PopDensPerSqMile)+ Continent, data=dat)
summary(fit2)
## 
## Call:
## lm(formula = IMR ~ TFR + log(GNIPPPperCapitaUSDollars) + log(PopDensPerSqMile) + 
##     Continent, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.344  -5.561  -0.035   5.857  58.758 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    83.6778    15.1184   5.535 1.19e-07 ***
## TFR                            11.6411     1.1912   9.772  < 2e-16 ***
## log(GNIPPPperCapitaUSDollars)  -7.8487     1.3118  -5.983 1.29e-08 ***
## log(PopDensPerSqMile)          -0.8502     0.7691  -1.105  0.27054    
## ContinentAsia                 -10.9400     3.4574  -3.164  0.00185 ** 
## ContinentEurope               -13.1368     4.3141  -3.045  0.00270 ** 
## ContinentNorth America        -17.6359     4.1083  -4.293 2.98e-05 ***
## ContinentOceania              -25.8905     4.7398  -5.462 1.68e-07 ***
## ContinentSouth America        -17.3111     4.8089  -3.600  0.00042 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.15 on 167 degrees of freedom
##   (33 observations deleted due to missingness)
## Multiple R-squared:  0.8653, Adjusted R-squared:  0.8588 
## F-statistic: 134.1 on 8 and 167 DF,  p-value: < 2.2e-16
#We can compare how model 2 performs compared to model 1 by using an F test
anova (fit1, fit2, test="F")
## Analysis of Variance Table
## 
## Model 1: IMR ~ TFR + log(GNIPPPperCapitaUSDollars) + log(PopDensPerSqMile)
## Model 2: IMR ~ TFR + log(GNIPPPperCapitaUSDollars) + log(PopDensPerSqMile) + 
##     Continent
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    172 35745                                  
## 2    167 28859  5    6886.1 7.9698 9.276e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Now we fit a model with an interaction term between continent and GDP, in R this is easy, all we have to do is use the * operation
fit3<-lm(IMR ~ TFR + log(GNIPPPperCapitaUSDollars)*Continent+ log(PopDensPerSqMile), data=dat)
summary(fit3)
## 
## Call:
## lm(formula = IMR ~ TFR + log(GNIPPPperCapitaUSDollars) * Continent + 
##     log(PopDensPerSqMile), data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.432  -6.263   0.448   5.030  59.455 
## 
## Coefficients:
##                                                      Estimate Std. Error
## (Intercept)                                           69.6034    23.2830
## TFR                                                   11.9535     1.3409
## log(GNIPPPperCapitaUSDollars)                         -6.4823     2.2798
## ContinentAsia                                         23.6205    22.9200
## ContinentEurope                                      -23.0881    38.1155
## ContinentNorth America                               -49.9613    34.0903
## ContinentOceania                                     -24.8931    36.7178
## ContinentSouth America                                63.9326    71.9055
## log(PopDensPerSqMile)                                 -0.3578     0.8147
## log(GNIPPPperCapitaUSDollars):ContinentAsia           -4.0986     2.7276
## log(GNIPPPperCapitaUSDollars):ContinentEurope          0.7370     3.9652
## log(GNIPPPperCapitaUSDollars):ContinentNorth America   3.4640     3.9116
## log(GNIPPPperCapitaUSDollars):ContinentOceania        -0.1979     4.4117
## log(GNIPPPperCapitaUSDollars):ContinentSouth America  -9.2290     8.1168
##                                                      t value Pr(>|t|)    
## (Intercept)                                            2.989  0.00323 ** 
## TFR                                                    8.914 9.83e-16 ***
## log(GNIPPPperCapitaUSDollars)                         -2.843  0.00504 ** 
## ContinentAsia                                          1.031  0.30428    
## ContinentEurope                                       -0.606  0.54554    
## ContinentNorth America                                -1.466  0.14471    
## ContinentOceania                                      -0.678  0.49877    
## ContinentSouth America                                 0.889  0.37526    
## log(PopDensPerSqMile)                                 -0.439  0.66116    
## log(GNIPPPperCapitaUSDollars):ContinentAsia           -1.503  0.13488    
## log(GNIPPPperCapitaUSDollars):ContinentEurope          0.186  0.85278    
## log(GNIPPPperCapitaUSDollars):ContinentNorth America   0.886  0.37716    
## log(GNIPPPperCapitaUSDollars):ContinentOceania        -0.045  0.96427    
## log(GNIPPPperCapitaUSDollars):ContinentSouth America  -1.137  0.25720    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.11 on 162 degrees of freedom
##   (33 observations deleted due to missingness)
## Multiple R-squared:   0.87,  Adjusted R-squared:  0.8596 
## F-statistic: 83.39 on 13 and 162 DF,  p-value: < 2.2e-16
#We can compare out model 2 with the interaction model to see if there is a significant difference in how GDP operates across continents
anova (fit2, fit3, test="F")
## Analysis of Variance Table
## 
## Model 1: IMR ~ TFR + log(GNIPPPperCapitaUSDollars) + log(PopDensPerSqMile) + 
##     Continent
## Model 2: IMR ~ TFR + log(GNIPPPperCapitaUSDollars) * Continent + log(PopDensPerSqMile)
##   Res.Df   RSS Df Sum of Sq     F Pr(>F)
## 1    167 28858                          
## 2    162 27848  5    1010.7 1.176 0.3232
#This test suggests that there is not a significant interaction between GDP and Continent

#examine a log transformed outcome, this can be easily done within the lm() function
lm2<-lm(log(IMR) ~ TFR + log(GNIPPPperCapitaUSDollars)+ log(PopDensPerSqMile), data=dat)
summary(lm2)
## 
## Call:
## lm(formula = log(IMR) ~ TFR + log(GNIPPPperCapitaUSDollars) + 
##     log(PopDensPerSqMile), data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.80051 -0.29549 -0.00936  0.28241  1.61830 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    7.76378    0.54737  14.184  < 2e-16 ***
## TFR                            0.21874    0.03839   5.698 5.16e-08 ***
## log(GNIPPPperCapitaUSDollars) -0.57417    0.04741 -12.111  < 2e-16 ***
## log(PopDensPerSqMile)         -0.08209    0.02667  -3.078  0.00243 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4894 on 172 degrees of freedom
##   (33 observations deleted due to missingness)
## Multiple R-squared:  0.8292, Adjusted R-squared:  0.8263 
## F-statistic: 278.4 on 3 and 172 DF,  p-value: < 2.2e-16
#Normality of errors
shapiro.test(rstudent(lm2))
## 
##  Shapiro-Wilk normality test
## 
## data:  rstudent(lm2)
## W = 0.9815, p-value = 0.01955
#Heterosckedasticity test
bptest(lm2)
## 
##  studentized Breusch-Pagan test
## 
## data:  lm2
## BP = 14.9121, df = 3, p-value = 0.001893
#diagnostic plots
plot(lm2)