R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

AllCountries <- read.csv("/Users/ShelsyChouakong/Downloads/AllCountries.csv")

str(AllCountries)
## 'data.frame':    217 obs. of  26 variables:
##  $ Country       : chr  "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##  $ Code          : chr  "AFG" "ALB" "DZA" "ASM" ...
##  $ LandArea      : num  652.86 27.4 2381.74 0.2 0.47 ...
##  $ Population    : num  37.172 2.866 42.228 0.055 0.077 ...
##  $ Density       : num  56.9 104.6 17.7 277.3 163.8 ...
##  $ GDP           : int  521 5254 4279 NA 42030 3432 16864 11653 4212 NA ...
##  $ Rural         : num  74.5 39.7 27.4 12.8 11.9 34.5 75.4 8.1 36.9 56.6 ...
##  $ CO2           : num  0.29 1.98 3.74 NA 5.83 1.29 5.74 4.78 1.9 8.41 ...
##  $ PumpPrice     : num  0.7 1.36 0.28 NA NA 0.97 NA 1.1 0.77 NA ...
##  $ Military      : num  3.72 4.08 13.81 NA NA ...
##  $ Health        : num  2.01 9.51 10.73 NA 14.02 ...
##  $ ArmedForces   : int  323 9 317 NA NA 117 0 105 49 NA ...
##  $ Internet      : num  11.4 71.8 47.7 NA 98.9 14.3 76 75.8 69.7 97.2 ...
##  $ Cell          : num  67.4 123.7 111 NA 104.4 ...
##  $ HIV           : num  NA 0.1 0.1 NA NA 1.9 NA 0.4 0.2 NA ...
##  $ Hunger        : num  30.3 5.5 4.7 NA NA 23.9 NA 3.8 4.3 NA ...
##  $ Diabetes      : num  9.6 10.1 6.7 NA 8 3.9 13.2 5.5 7.1 11.6 ...
##  $ BirthRate     : num  32.5 11.7 22.3 NA NA 41.3 16.1 17 13.1 11 ...
##  $ DeathRate     : num  6.6 7.5 4.8 NA NA 8.4 5.8 7.6 9.7 8.9 ...
##  $ ElderlyPop    : num  2.6 13.6 6.4 NA NA 2.5 7.2 11.3 11.4 13.6 ...
##  $ LifeExpectancy: num  64 78.5 76.3 NA NA 61.8 76.5 76.7 74.8 76 ...
##  $ FemaleLabor   : num  50.3 55.9 16.4 NA NA 76.4 NA 57.1 55.8 NA ...
##  $ Unemployment  : num  1.5 13.9 12.1 NA NA 7.3 NA 9.5 17.7 NA ...
##  $ Energy        : int  NA 808 1328 NA NA 545 NA 2030 1016 NA ...
##  $ Electricity   : int  NA 2309 1363 NA NA 312 NA 3075 1962 NA ...
##  $ Developed     : int  NA 1 1 NA NA 1 NA 2 1 NA ...

Question 1

AllCountries <- read.csv("/Users/ShelsyChouakong/Downloads/AllCountries.csv")
linear <- lm(LifeExpectancy ~ GDP, data = AllCountries)
summary(linear)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = AllCountries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

-Based off the results the intercept is the predicted life expentancy if the GDP is at 0, the the slope is the change in life expectancy everytime it increases so a higher life expectancy is equal to a higher gdp. The R^2 value shows the differences in life expectancy with the different countries so if the R^2 value is higher then this can be a larger difference between countries.

Question 2

AllCountries <- read.csv("/Users/ShelsyChouakong/Downloads/AllCountries.csv")
multi<-lm(LifeExpectancy ~ GDP + Health + Internet,
          data = AllCountries)
summary(multi)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

-The coefficient in this case can show the change in life expectancy every time the government health spending goes up. The adjusted r^2 is also larger than the model from question 1 so this suggests that if healthcare and internet are in place this can improve conditions.

Question 3

AllCountries <- read.csv("/Users/ShelsyChouakong/Downloads/AllCountries.csv")
model <- lm(LifeExpectancy ~ GDP, data = AllCountries)

plot(model$fitted.values, model$residuals,
     main = "Residual Plot",
     xlab = "Predicted Values",
     ylab = "Residuals")

hist(model$residuals,
     main = "History of Residuals",
     xlab = "Residuals")

-In order to check the homoscedasticity I would compare the residuals and the fitted values to see how the different points align and if they’re consistent and in order to have an ideal outcome, this would have to have no real pattern. In order to check the normality of residuals, I use the plot to see if it looks skewed a certain way to see if it lowers the reliability of the statistical conclusion.

Question 4

#AllCountri
AllCountries <- read.csv("/Users/ShelsyChouakong/Downloads/AllCountries.csv")
model <- lm(LifeExpectancy ~ GDP + Health + Internet,
                     data = AllCountries)
rmse <-sqrt(mean(model$residuals^2))
rmse
## [1] 4.056417

-The RMSE is to show how big the size of prediction flaws are and they show us the predicted life expectancy value is. A larger residual can lead to less accuracy in my predictions so this would make me look into the education infrastructure in those countries to see how strong it is in contrast.

Question 5

AllCountries <- read.csv("/Users/ShelsyChouakong/Downloads/AllCountries.csv")
hypo <-lm(CO2 ~ Energy + Electricity,
          data= AllCountries)
summary(hypo)
## 
## Call:
## lm(formula = CO2 ~ Energy + Electricity, data = AllCountries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.7559  -1.1406  -0.2020   0.7143   7.3751 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.998e-01  2.655e-01   3.012  0.00311 ** 
## Energy       3.122e-03  1.066e-04  29.290  < 2e-16 ***
## Electricity -7.044e-04  5.526e-05 -12.747  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.331 on 131 degrees of freedom
##   (83 observations deleted due to missingness)
## Multiple R-squared:  0.899,  Adjusted R-squared:  0.8974 
## F-statistic: 582.8 on 2 and 131 DF,  p-value: < 2.2e-16
cor(AllCountries$Energy,
    AllCountries$Electricity)
## [1] NA

-The correlation between energy and electricity might affect the interpretation of the regression coefficients because it can make it difficult to differentiate between the two because they move almost the same. This can lead to the model not being being reliable.