Global Carbon Emmissions

This report was compiled with the combined efforts of Sauban Kidwai, Jasmine Trinh and Marvel Sukadis

Introduction

Background

With the rise in industrialization and massive population growth, the world has experienced constant and rapid increases in CO2 emission levels. This has been a critical international issue for many years and must be managed accordingly, with CO2 emissions being one of the major drivers of climate change and other pressing environmental issues including air pollution, ocean acidification and global warming. Not only do these issues affect the environment, but rising CO2 emissions also have large impacts on the economy, agricultural systems, and human health when diseases and natural disasters increase with the warming planet. This has been widely agreed by scientific discussion to be caused by increased human activity, resulting in the release of huge amounts of greenhouse gasses. Due to fossil fuels increasing to meet energy demands, countries should be aware of where the trajectory of emissions levels are going and must regulate it accordingly with research, policies, and new laws to restrict industry-produced emissions as this could largely affect all corners of the globe for the worse. With frequent and careful analysis of CO2 emissions data, the research community can better understand and monitor the global carbon cycle, while the report provides a highly valuable resource within a climate policy framework (GCP, 2010).

Objective

The focus of this project will be to predict the global levels of CO2 and greenhouse gas emissions using various statistical modeling, visual analysis, and exploratory data analysis. By doing this, we can have an understanding on how the world can progress towards a more sustainable society by reducing production based annual emissions on a global scale. We will attempt to predict CO2 levels 2 years into the future using our ‘best’ predictive model that we have selected. This model will produce values outside of the given time range given by our dataset.

The Data

Our chosen dataset is sourced by the Global Carbon Project, a report produced by community researchers to announce a global carbon budget quantifying CO2 emission for the prior year (ICOS, 2016). The data ranges from the years 1750 to 2020, and is updated annually with global carbon dioxide emissions, specifically production based annual emissions from the burning of fossil fuels for energy (e.g., gas, oil, coal) and industrial production (e.g., steel, cement), balanced by carbon stored in land and ocean reservoirs. It does not include emissions from land use change due to uncertainties in data, as well as difficulties monitoring it to annually update it accurately. The data has been analysed before due to the increasing significance of managing and reducing greenhouse emissions contributed by each country. In the data, CO2 levels are numerical continuous and are plotted against discrete numerical points in time as years. Global CO2 levels will be separated by country as categorical, nominal data.

Methodology

Model selection

Since our goal is to predict global CO2 values, we decided it would be most valuable to collect data using a linear regression algorithm to yield specific numeric values. Due to our response variable being continuous (global CO2 transmissions), the model would be most suited for predictive regression, rather than classification on discrete data. By creating this regression model, parameters can be interpreted while supervised machine learning can focus on forecasting future values of our response variable as a function of explanatory variables. Predictive modeling is then implemented as a cross between the two disciplines. As there is plenty of historic past data from many years in the dataset, supervised machine learning is the best choice to predict future values. Using an algorithm, we will be able to analyze patterns from old data with the algorithm, rather than identify groupings and patterns in similar but different data.

Data Wrangling

Located on the “Our World in Data” website, the dataset was obtained as a .csv file and consisted of 25 191 rows and 60 columns. Below, Figure 1 shows all the column names columns contained in the original dataset and their index number which was extracted in RStudio. Before the dataset could be analyzed, it had to be cleaned by removing gaps, duplicate and missing values as these gaps would reduce prediction accuracy significantly. Here, we substituted missing NA values for dummy 0 values. The dataset originally had many unnecessary variables that did not need to be analyzed so was removed by deletion of certain columns and rows. By doing this, the analysis would be able to be more meaningful as a smaller number of variables can be examined more closely. Figure 1 is an extract of all the variables contained in the dataset before data cleaning.

Figure 1: Extract of column names of all variables within original dataset

Unnecessary variables were removed by deletion of certain columns and rows to create a revised dataset to be analyzed with a clearer focus on our selected problem. The dataset was then decomposed into smaller subsets of the USA, Australia, and China. By comparing 3 countries of significance, CO2 emissions would be able to be compared and specific relationships would be able to be identified.

Figure 2: Extract of column names of all variables within new dataset

Analysis

Correlation plots

Various correlation plots were produced for each subset to understand the relationship between the chosen variables (i.e., Co2 per capita, population and GDP) and proved to be significant in helping us find out the best explanatory variables to focus on. The variables, CO2 per capita, population and GDP appeared to be relatively interlinked as more people contribute to more CO2 emissions and GDP influences where the different populations live.

We used a scatterplot matrix to visualize multiple variables and determine which variables had the strongest correlation for prediction analysis. Talk about visual graphs Looking at the correlation coefficients, we were able to identify that each relationship was quite positive and strong, with the lowest correlation coefficient for every subset being CO2 per capita and GDP, and the highest correlation coefficient for every subset being CO2 per capita and population. Based on this, for all three countries we are analyzing, we used Co2 and population as our main 2 variables We also used line graphs to display the local trends of the response variable over points in time for each subset. This was useful to identify the general positive trend of CO2 emissions per capita (ppm) in a visual form and determine whether we needed to apply a standardized scale (i.e., logarithms) to reduce the effects of significant point differences. By plotting each country’s’ subset of Co2 per year, it was safe to assume the data did not contain many significant outliers that needed to be removed or needed a scale. With CO2 per capita and population having the strongest correlation consistently between each of the 3 subsets, as well as a positive trend over time as seen in the line graphs, it was safe to select these two variables to base our predictions on.

Figure 3: Correlation graph for Australia

Figure 4: Correlation graph for China

Figure 5: Correlation graph for USA

Diagnostic graphs

We then ran a linear mode and plotted it to produce various diagnostic graphs for each subset:

Residual vs Fitted: The graphs appeared to not show any linear patterns between the response and explanatory variables, with equally spread residuals and a horizontal line with no distinct patterns. This indicated the data did not have any linear relationships.
- China, however had a significantly large curve in its line and linear patterns in its residuals
Normal Q-Q: We analyzed the Q-Q plots for each subset, and it seemed that the residuals were normally distributed, with only minor deviations at the tail of the plot, deviating from a straight diagonal line as the values level out. Based on these plots, we assumed the data has normal distribution, however, has leveled out in recent years.
Scale-Location: Looking at the graphs, the residuals appeared to be spread equally along the range of predictors, with a relatively horizontal line to indicate the assumption of equal variance.
Residual vs Leverage: The plots were able to help us identify potentially influential outliers that could alter the results if not removed from the analysis. For Australia and USA, all the values were shown to sit well inside the Cook’s distance line, however for China, a value can be found well outside of the Cook’s distance line so can be identified as an influential case. By excluding this observation from the analysis, the slope coefficient changed from 2.14 to 2.68 and R2 from 0.7757 to 0.851.

Predictive analysis

In the project, simple linear regression was used to model a predictive model of global CO2 emissions. Supervised learning from training data sourced from the dataset was used to produce various predictive models for us. By using machine-based learning via R, we were able to analyze the data and predict future outcomes based on certain variables selected. First, the model was split into two sets that were made sure to not overlap: the training set which will contain 75% of data points and the testing set that will contain 25% of data points. These sets were used to train the models to predict future values and then be used to validate it. Using the training dataset, the correlation of the two variables - global CO2 emissions and time were plotted and despite having some residual, a linear regression line was able to be drawn. The model was then evaluated for its machine learning accuracy by using it on the test dataset. With this, we were able to tweak and adjust hyperparameters of the model and have values set before the algorithm would begin learning. The points produced seem to be more spread compared to the correlation plot for the training dataset, however, a linear relationship was clearly shown. Its performance seemed satisfactory. To evaluate the model, the model’s prediction interval was determined and checked to see if its values were within a certain level of certainty. The ‘best model’ was then selected to forecast the trajectory of global CO2 emissions 2 years into the future.

Results

Data Comparison

Looking at the results of the 3 subsets, we can see that the CO2 levels for China is significantly lower than Australia and the USA, peaking at around 8ppm by 2020. Whereas Australia and the USA have a much higher peak being at 20ppm around the 2000s. From what we have gathered, the USA has been producing the highest levels of CO2 between China and Australia. Reaching at 15ppm between 1900 to 1950. The USA’s CO2 levels are also much higher due to it rising a lot earlier than China and Australia. Starting its rise between 1800 to 1850, whereas China didn’t start rising dramatically till around 1960. From these graphs, we can predict that Australia will start lowering their CO2 levels. As it seems to hit its peak in the year 2000 and has been dropping down since. Whereas for China, we can predict a continual growth and rise of CO2 levels. This could be due to the fact that China’s CO2 levels did not rise till much later than Australia and the USA. China also seems to not have hit its peak from the data received, as its continual growth from 2020. With this data we can predict that once China hits 20ppm we would see a steady decrease with the CO2 levels, likewise to how Australia and the USA are. The USA has a major drop opposed to how China is. Peaking at 20ppm between the years of 1950 to 2000. While the USA has the least consistent CO2 levels due to it rising and dropping a lot, we can see that by the end of the 2000s it is still decreasing a significant amount. Figure 6, 7 and 8 show the line graph generated for Co2 verses year for each country

Figure 6: Australian Line Graph

Figure 7: China Line Graph

Figure 8: USA Line Graph

For Australia:

Based on the predictions from the actual model and our produced forward step model, our RMSEP is quite similar. The forward step model has 12% variance from the actual model which is acceptable because our model actually fits the model created before. We can safely assume that the predictions of our full model matches the forward step model and indicates that CO2 levels will still be high and will keep increasing. Figure 9 shows the test data RMSEP

Figure 9: RMSEP values for the Australian Dataset

For China:

The predictions from the full model vs the forward step model that was outputted was different to the graphs produced from the Australia data set. The RMSEP variance between the forward model and the predicted model is 33% higher. This means that although the model fits, it indicates a higher rate of CO2 which is predicted to be produced than Australia in the future. This makes sense as currently, China is one of the few countries that at the moment produce mass tonnes of CO2 and the more Co2 they produce the more damaging it is to Earth’s environment. This is concerning because If China doesn’t start reducing their CO2 levels, chances are there will be worse impacts to the climate than it is right now. Figure 10 shows the test data RMSEP.

Figure 10: RMSEP values for the China Dataset

For USA:

The predictions from the full model vs the forward step model for the USA is very similar to the output of the Australia dataset. The RMSEP variance from the forward model to the full model is only 7%. This shows that the model that was created works very well. It also shows us that in terms of the amount of CO2 produced by the USA, it is very steady. In terms of climate, it isn’t good but in terms of the analysis of our data it is good. Based on the predictive models produced, a forecast of total CO2 levels on a global scale was able to be produced for 2021, 2022 and 2023. Figure 11 shows the test data RMSEP.

Figure 11: RMSEP values for the USA Dataset

Our RMSEP values were in millions as we were measuring population and our analysis is based on the test set of our data which includes 25% of the data points.

Conclusion

In the project, we used a range of different statistical modeling and data analysis methods to train and test predictive models to forecast global levels of CO2 gas emissions using data extracted from a large dataset found on the Global Carbon Project website. As a result, from the tests conducted, it has led us to believe that the CO2 emission levels now are far greater, even with the managing and reducing of greenhouse emissions assisted by each country. With a linear regression algorithm, specific numeric values were able to be obtained and parameters were obtained to forecast future values based on the historic data collected over many years. By splitting the data into smaller sets specific to three countries of our choice, we based global CO2 emissions on them which may create inaccurate predictions. Since the subsets are much smaller than the actual size of the population (global CO2 emissions), there is some uncertainty regarding the predictive model’s accuracy. This could have been reduced by analyzing global CO2 emissions directly which includes all countries, rather than decomposing the data into the 3 subsets containing the countries of our choice - Australia, China and USA. In the future if we were to do more analyses, we could also examine the effects of GDP on CO2 levels due to its strong correlation coefficient as seen on the diagnostic plots or analyze global CO2 emissions based solely on its changes in points in time.

References

*Ritchie, H. and Roser, M. (2017). CO2 and other Greenhouse Gas Emissions. [online] Our World in Data. Available at: https://ourworldindata.org/co2-and-other-greenhouse-gas-emissions.

*careerfoundry.com. (2010). What Is Data Wrangling? A Complete Introductory Guide. [online] Available at: https://careerfoundry.com/en/blog/data-analytics/data-wrangling/#what-is-the-data-wrangling-process.

*IEA (2022), Global Energy Review: CO2 Emissions in 2021, IEA, Paris. [online] Available at: https://www.iea.org/reports/global-energy-review-co2-emissions-in-2021-2

*ICOS. (2018). Global Carbon Budget. [online] Available at: https://www.icos-cp.eu/science-and-impact/global-carbon-budget

Appendices

The following lines of code explain breifly what was done to acheive our results. Each bit of code has a comment associated with it to make it easier to understand what is being done.

Data <- read.csv("owid-co2-data.csv", header = TRUE) #start off by loading the CSV file

dim(Data) #check the amount of rows and columns

## [1] 25191    60

summary(Data) #get a summary for each column

##    iso_code           country               year           co2          
##  Length:25191       Length:25191       Min.   :1750   Min.   :    0.00  
##  Class :character   Class :character   1st Qu.:1924   1st Qu.:    0.53  
##  Mode  :character   Mode  :character   Median :1967   Median :    4.86  
##                                        Mean   :1953   Mean   :  267.86  
##                                        3rd Qu.:1995   3rd Qu.:   42.82  
##                                        Max.   :2020   Max.   :36702.50  
##                                                       NA's   :1242      
##  co2_per_capita      trade_co2           cement_co2       cement_co2_per_capita
##  Min.   :  0.000   Min.   :-1657.998   Min.   :   0.000   Min.   :0.000        
##  1st Qu.:  0.253   1st Qu.:   -0.892   1st Qu.:   0.129   1st Qu.:0.020        
##  Median :  1.250   Median :    1.953   Median :   0.557   Median :0.070        
##  Mean   :  4.171   Mean   :   -2.416   Mean   :  12.889   Mean   :0.113        
##  3rd Qu.:  4.657   3rd Qu.:    9.700   3rd Qu.:   2.897   3rd Qu.:0.156        
##  Max.   :748.639   Max.   : 1028.487   Max.   :1626.371   Max.   :2.738        
##  NA's   :1884      NA's   :21215       NA's   :12943      NA's   :12973        
##     coal_co2         coal_co2_per_capita  flaring_co2     
##  Min.   :    0.000   Min.   : 0.000      Min.   :  0.000  
##  1st Qu.:    0.322   1st Qu.: 0.054      1st Qu.:  0.253  
##  Median :    3.981   Median : 0.442      Median :  2.071  
##  Mean   :  175.358   Mean   : 1.552      Mean   : 15.000  
##  3rd Qu.:   35.533   3rd Qu.: 2.149      3rd Qu.: 12.604  
##  Max.   :15062.902   Max.   :34.184      Max.   :435.034  
##  NA's   :8003        NA's   :8331        NA's   :20809    
##  flaring_co2_per_capita    gas_co2         gas_co2_per_capita
##  Min.   : 0.000         Min.   :   0.000   Min.   : 0.000    
##  1st Qu.: 0.021         1st Qu.:   0.385   1st Qu.: 0.031    
##  Median : 0.068         Median :   4.199   Median : 0.282    
##  Mean   : 0.875         Mean   : 108.751   Mean   : 1.413    
##  3rd Qu.: 0.203         3rd Qu.:  30.830   3rd Qu.: 1.436    
##  Max.   :94.711         Max.   :7553.394   Max.   :52.484    
##  NA's   :20810          NA's   :16346      NA's   :16356     
##     oil_co2          oil_co2_per_capita other_industry_co2 other_co2_per_capita
##  Min.   :    0.000   Min.   :  0.000    Min.   :  0.000    Min.   :0.000       
##  1st Qu.:    0.311   1st Qu.:  0.121    1st Qu.:  0.748    1st Qu.:0.036       
##  Median :    2.100   Median :  0.630    Median :  2.861    Median :0.071       
##  Mean   :  106.254   Mean   :  2.635    Mean   : 15.754    Mean   :0.080       
##  3rd Qu.:   17.369   3rd Qu.:  2.474    3rd Qu.:  9.902    3rd Qu.:0.108       
##  Max.   :12229.642   Max.   :748.639    Max.   :303.858    Max.   :0.357       
##  NA's   :4652        NA's   :5010       NA's   :23192      NA's   :23192       
##  co2_growth_prct     co2_growth_abs       co2_per_gdp    co2_per_unit_energy
##  Min.   :   -99.64   Min.   :-1895.244   Min.   :0.000   Min.   :0.005      
##  1st Qu.:    -0.45   1st Qu.:   -0.011   1st Qu.:0.140   1st Qu.:0.178      
##  Median :     3.35   Median :    0.059   Median :0.276   Median :0.218      
##  Mean   :    21.10   Mean   :    5.147   Mean   :0.422   Mean   :0.239      
##  3rd Qu.:    10.46   3rd Qu.:    1.103   3rd Qu.:0.534   3rd Qu.:0.256      
##  Max.   :102318.51   Max.   : 1736.258   Max.   :7.776   Max.   :4.644      
##  NA's   :260         NA's   :1606        NA's   :9802    NA's   :16050      
##  consumption_co2    consumption_co2_per_capita consumption_co2_per_gdp
##  Min.   :    0.20   Min.   : 0.055             Min.   :0.006          
##  1st Qu.:   10.32   1st Qu.: 1.240             1st Qu.:0.216          
##  Median :   57.09   Median : 4.359             Median :0.315          
##  Mean   :  916.76   Mean   : 6.568             Mean   :0.370          
##  3rd Qu.:  276.38   3rd Qu.: 9.848             3rd Qu.:0.447          
##  Max.   :36702.50   Max.   :57.792             Max.   :3.543          
##  NA's   :21215      NA's   :21215              NA's   :21430          
##  cumulative_co2      cumulative_cement_co2 cumulative_coal_co2
##  Min.   :      0.0   Min.   :    0.00      Min.   :     0.0   
##  1st Qu.:      7.0   1st Qu.:    1.61      1st Qu.:     5.5   
##  Median :     91.3   Median :   10.45      Median :    98.2   
##  Mean   :  10357.1   Mean   :  307.76      Mean   :  8791.8   
##  3rd Qu.:   1147.5   3rd Qu.:   66.46      3rd Qu.:  1248.8   
##  Max.   :1696524.2   Max.   :43163.19      Max.   :788362.0   
##  NA's   :1242        NA's   :12943         NA's   :8003       
##  cumulative_flaring_co2 cumulative_gas_co2  cumulative_oil_co2
##  Min.   :    0.000      Min.   :     0.00   Min.   :     0.0  
##  1st Qu.:    4.071      1st Qu.:     3.24   1st Qu.:     3.9  
##  Median :   45.608      Median :    52.06   Median :    39.2  
##  Mean   :  425.699      Mean   :  2587.10   Mean   :  3296.6  
##  3rd Qu.:  281.485      3rd Qu.:   457.78   3rd Qu.:   372.7  
##  Max.   :17792.749      Max.   :245231.88   Max.   :592621.2  
##  NA's   :20809          NA's   :16346       NA's   :4652      
##  cumulative_other_co2 trade_co2_share   share_global_co2 
##  Min.   :   0.001     Min.   :-96.760   Min.   :  0.000  
##  1st Qu.:   7.709     1st Qu.: -1.758   1st Qu.:  0.010  
##  Median :  35.644     Median : 11.675   Median :  0.060  
##  Mean   : 293.588     Mean   : 22.961   Mean   :  4.984  
##  3rd Qu.: 159.188     3rd Qu.: 36.382   3rd Qu.:  0.600  
##  Max.   :7725.988     Max.   :366.150   Max.   :100.000  
##  NA's   :23192        NA's   :21215     NA's   :1242     
##  share_global_cement_co2 share_global_coal_co2 share_global_flaring_co2
##  Min.   :  0.000         Min.   :  0.000       Min.   :  0.000         
##  1st Qu.:  0.050         1st Qu.:  0.010       1st Qu.:  0.090         
##  Median :  0.200         Median :  0.115       Median :  0.700         
##  Mean   :  4.419         Mean   :  6.990       Mean   :  5.862         
##  3rd Qu.:  1.000         3rd Qu.:  1.250       3rd Qu.:  4.440         
##  Max.   :100.000         Max.   :100.000       Max.   :100.000         
##  NA's   :12943           NA's   :8003          NA's   :20809           
##  share_global_gas_co2 share_global_oil_co2 share_global_other_co2
##  Min.   :  0.000      Min.   :  0.000      Min.   :  0.00        
##  1st Qu.:  0.030      1st Qu.:  0.010      1st Qu.:  0.30        
##  Median :  0.200      Median :  0.080      Median :  1.34        
##  Mean   :  5.406      Mean   :  2.993      Mean   : 14.29        
##  3rd Qu.:  1.250      3rd Qu.:  0.550      3rd Qu.: 10.09        
##  Max.   :100.000      Max.   :100.000      Max.   :100.00        
##  NA's   :16346        NA's   :4652         NA's   :23192         
##  share_global_cumulative_co2 share_global_cumulative_cement_co2
##  Min.   :  0.000             Min.   :  0.000                   
##  1st Qu.:  0.000             1st Qu.:  0.040                   
##  Median :  0.030             Median :  0.200                   
##  Mean   :  5.127             Mean   :  4.462                   
##  3rd Qu.:  0.410             3rd Qu.:  0.930                   
##  Max.   :100.000             Max.   :100.000                   
##  NA's   :1242                NA's   :12943                     
##  share_global_cumulative_coal_co2 share_global_cumulative_flaring_co2
##  Min.   :  0.000                  Min.   :  0.000                    
##  1st Qu.:  0.000                  1st Qu.:  0.060                    
##  Median :  0.070                  Median :  0.540                    
##  Mean   :  7.212                  Mean   :  5.622                    
##  3rd Qu.:  0.910                  3rd Qu.:  3.567                    
##  Max.   :100.000                  Max.   :100.000                    
##  NA's   :8003                     NA's   :20809                      
##  share_global_cumulative_gas_co2 share_global_cumulative_oil_co2
##  Min.   :  0.000                 Min.   :  0.000                
##  1st Qu.:  0.010                 1st Qu.:  0.010                
##  Median :  0.110                 Median :  0.070                
##  Mean   :  5.242                 Mean   :  3.002                
##  3rd Qu.:  0.820                 3rd Qu.:  0.530                
##  Max.   :100.000                 Max.   :100.000                
##  NA's   :16346                   NA's   :4652                   
##  share_global_cumulative_other_co2   total_ghg        ghg_per_capita   
##  Min.   :  0.000                   Min.   : -178.71   Min.   :-31.485  
##  1st Qu.:  0.190                   1st Qu.:    8.03   1st Qu.:  2.656  
##  Median :  0.840                   Median :   33.90   Median :  5.386  
##  Mean   : 13.404                   Mean   :  420.52   Mean   :  7.608  
##  3rd Qu.:  7.985                   3rd Qu.:  115.03   3rd Qu.:  9.604  
##  Max.   :100.000                   Max.   :48939.71   Max.   : 74.729  
##  NA's   :23192                     NA's   :19540      NA's   :19540    
##  total_ghg_excluding_lucf ghg_excluding_lucf_per_capita    methane        
##  Min.   :    0.01         Min.   : 0.101                Min.   :   0.000  
##  1st Qu.:    6.85         1st Qu.: 2.095                1st Qu.:   2.005  
##  Median :   28.08         Median : 4.442                Median :   8.530  
##  Mean   :  406.51         Mean   : 6.871                Mean   :  79.072  
##  3rd Qu.:   92.60         3rd Qu.: 8.975                3rd Qu.:  30.025  
##  Max.   :47552.14         Max.   :53.650                Max.   :8298.270  
##  NA's   :19540            NA's   :19540                 NA's   :19536     
##  methane_per_capita nitrous_oxide     nitrous_oxide_per_capita
##  Min.   : 0.000     Min.   :   0.00   Min.   : 0.000          
##  1st Qu.: 0.691     1st Qu.:   0.51   1st Qu.: 0.221          
##  Median : 1.077     Median :   3.46   Median : 0.377          
##  Mean   : 1.902     Mean   :  29.09   Mean   : 0.602          
##  3rd Qu.: 1.619     3rd Qu.:  11.20   3rd Qu.: 0.589          
##  Max.   :39.795     Max.   :3078.27   Max.   :10.056          
##  NA's   :19536      NA's   :19536     NA's   :19536           
##    population             gdp            primary_energy_consumption
##  Min.   :1.490e+03   Min.   :5.543e+07   Min.   :     0.0          
##  1st Qu.:1.287e+06   1st Qu.:9.829e+09   1st Qu.:     7.0          
##  Median :4.870e+06   Median :3.037e+10   Median :    61.4          
##  Mean   :7.068e+07   Mean   :2.877e+11   Mean   :  1569.1          
##  3rd Qu.:1.758e+07   3rd Qu.:1.269e+11   3rd Qu.:   352.9          
##  Max.   :7.795e+09   Max.   :1.136e+14   Max.   :162194.3          
##  NA's   :2299        NA's   :11653       NA's   :16501             
##  energy_per_capita energy_per_gdp  
##  Min.   :     0    Min.   : 0.050  
##  1st Qu.:  3270    1st Qu.: 0.856  
##  Median : 13701    Median : 1.407  
##  Mean   : 25569    Mean   : 1.850  
##  3rd Qu.: 35494    3rd Qu.: 2.351  
##  Max.   :317583    Max.   :13.493  
##  NA's   :16510     NA's   :18388

head(Data) #print the first 6 rows of the csv file

##   iso_code     country year   co2 co2_per_capita trade_co2 cement_co2
## 1      AFG Afghanistan 1949 0.015          0.002        NA         NA
## 2      AFG Afghanistan 1950 0.084          0.011        NA         NA
## 3      AFG Afghanistan 1951 0.092          0.012        NA         NA
## 4      AFG Afghanistan 1952 0.092          0.012        NA         NA
## 5      AFG Afghanistan 1953 0.106          0.013        NA         NA
## 6      AFG Afghanistan 1954 0.106          0.013        NA         NA
##   cement_co2_per_capita coal_co2 coal_co2_per_capita flaring_co2
## 1                    NA    0.015               0.002          NA
## 2                    NA    0.021               0.003          NA
## 3                    NA    0.026               0.003          NA
## 4                    NA    0.032               0.004          NA
## 5                    NA    0.038               0.005          NA
## 6                    NA    0.043               0.005          NA
##   flaring_co2_per_capita gas_co2 gas_co2_per_capita oil_co2 oil_co2_per_capita
## 1                     NA      NA                 NA      NA                 NA
## 2                     NA      NA                 NA   0.063              0.008
## 3                     NA      NA                 NA   0.066              0.008
## 4                     NA      NA                 NA   0.060              0.008
## 5                     NA      NA                 NA   0.068              0.008
## 6                     NA      NA                 NA   0.064              0.008
##   other_industry_co2 other_co2_per_capita co2_growth_prct co2_growth_abs
## 1                 NA                   NA              NA             NA
## 2                 NA                   NA           475.0          0.070
## 3                 NA                   NA             8.7          0.007
## 4                 NA                   NA             0.0          0.000
## 5                 NA                   NA            16.0          0.015
## 6                 NA                   NA             0.0          0.000
##   co2_per_gdp co2_per_unit_energy consumption_co2 consumption_co2_per_capita
## 1          NA                  NA              NA                         NA
## 2       0.009                  NA              NA                         NA
## 3       0.010                  NA              NA                         NA
## 4       0.009                  NA              NA                         NA
## 5       0.010                  NA              NA                         NA
## 6       0.010                  NA              NA                         NA
##   consumption_co2_per_gdp cumulative_co2 cumulative_cement_co2
## 1                      NA          0.015                    NA
## 2                      NA          0.099                    NA
## 3                      NA          0.191                    NA
## 4                      NA          0.282                    NA
## 5                      NA          0.388                    NA
## 6                      NA          0.495                    NA
##   cumulative_coal_co2 cumulative_flaring_co2 cumulative_gas_co2
## 1               0.015                     NA                 NA
## 2               0.036                     NA                 NA
## 3               0.061                     NA                 NA
## 4               0.093                     NA                 NA
## 5               0.131                     NA                 NA
## 6               0.174                     NA                 NA
##   cumulative_oil_co2 cumulative_other_co2 trade_co2_share share_global_co2
## 1                 NA                   NA              NA                0
## 2              0.063                   NA              NA                0
## 3              0.129                   NA              NA                0
## 4              0.189                   NA              NA                0
## 5              0.257                   NA              NA                0
## 6              0.321                   NA              NA                0
##   share_global_cement_co2 share_global_coal_co2 share_global_flaring_co2
## 1                      NA                     0                       NA
## 2                      NA                     0                       NA
## 3                      NA                     0                       NA
## 4                      NA                     0                       NA
## 5                      NA                     0                       NA
## 6                      NA                     0                       NA
##   share_global_gas_co2 share_global_oil_co2 share_global_other_co2
## 1                   NA                   NA                     NA
## 2                   NA                    0                     NA
## 3                   NA                    0                     NA
## 4                   NA                    0                     NA
## 5                   NA                    0                     NA
## 6                   NA                    0                     NA
##   share_global_cumulative_co2 share_global_cumulative_cement_co2
## 1                           0                                 NA
## 2                           0                                 NA
## 3                           0                                 NA
## 4                           0                                 NA
## 5                           0                                 NA
## 6                           0                                 NA
##   share_global_cumulative_coal_co2 share_global_cumulative_flaring_co2
## 1                                0                                  NA
## 2                                0                                  NA
## 3                                0                                  NA
## 4                                0                                  NA
## 5                                0                                  NA
## 6                                0                                  NA
##   share_global_cumulative_gas_co2 share_global_cumulative_oil_co2
## 1                              NA                              NA
## 2                              NA                               0
## 3                              NA                               0
## 4                              NA                               0
## 5                              NA                               0
## 6                              NA                               0
##   share_global_cumulative_other_co2 total_ghg ghg_per_capita
## 1                                NA        NA             NA
## 2                                NA        NA             NA
## 3                                NA        NA             NA
## 4                                NA        NA             NA
## 5                                NA        NA             NA
## 6                                NA        NA             NA
##   total_ghg_excluding_lucf ghg_excluding_lucf_per_capita methane
## 1                       NA                            NA      NA
## 2                       NA                            NA      NA
## 3                       NA                            NA      NA
## 4                       NA                            NA      NA
## 5                       NA                            NA      NA
## 6                       NA                            NA      NA
##   methane_per_capita nitrous_oxide nitrous_oxide_per_capita population
## 1                 NA            NA                       NA    7624058
## 2                 NA            NA                       NA    7752117
## 3                 NA            NA                       NA    7840151
## 4                 NA            NA                       NA    7935996
## 5                 NA            NA                       NA    8039684
## 6                 NA            NA                       NA    8151316
##           gdp primary_energy_consumption energy_per_capita energy_per_gdp
## 1          NA                         NA                NA             NA
## 2  9421400000                         NA                NA             NA
## 3  9692280000                         NA                NA             NA
## 4 10017325000                         NA                NA             NA
## 5 10630520000                         NA                NA             NA
## 6 10866360000                         NA                NA             NA

library(tidyverse) #import the tidyverse library

## Warning: package 'tidyverse' was built under R version 4.1.3

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## Warning: package 'tidyr' was built under R version 4.1.3

## Warning: package 'readr' was built under R version 4.1.3

## Warning: package 'purrr' was built under R version 4.1.3

## Warning: package 'dplyr' was built under R version 4.1.3

## Warning: package 'forcats' was built under R version 4.1.3

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

colnames(Data) #checkout the names of each column

##  [1] "iso_code"                            "country"                            
##  [3] "year"                                "co2"                                
##  [5] "co2_per_capita"                      "trade_co2"                          
##  [7] "cement_co2"                          "cement_co2_per_capita"              
##  [9] "coal_co2"                            "coal_co2_per_capita"                
## [11] "flaring_co2"                         "flaring_co2_per_capita"             
## [13] "gas_co2"                             "gas_co2_per_capita"                 
## [15] "oil_co2"                             "oil_co2_per_capita"                 
## [17] "other_industry_co2"                  "other_co2_per_capita"               
## [19] "co2_growth_prct"                     "co2_growth_abs"                     
## [21] "co2_per_gdp"                         "co2_per_unit_energy"                
## [23] "consumption_co2"                     "consumption_co2_per_capita"         
## [25] "consumption_co2_per_gdp"             "cumulative_co2"                     
## [27] "cumulative_cement_co2"               "cumulative_coal_co2"                
## [29] "cumulative_flaring_co2"              "cumulative_gas_co2"                 
## [31] "cumulative_oil_co2"                  "cumulative_other_co2"               
## [33] "trade_co2_share"                     "share_global_co2"                   
## [35] "share_global_cement_co2"             "share_global_coal_co2"              
## [37] "share_global_flaring_co2"            "share_global_gas_co2"               
## [39] "share_global_oil_co2"                "share_global_other_co2"             
## [41] "share_global_cumulative_co2"         "share_global_cumulative_cement_co2" 
## [43] "share_global_cumulative_coal_co2"    "share_global_cumulative_flaring_co2"
## [45] "share_global_cumulative_gas_co2"     "share_global_cumulative_oil_co2"    
## [47] "share_global_cumulative_other_co2"   "total_ghg"                          
## [49] "ghg_per_capita"                      "total_ghg_excluding_lucf"           
## [51] "ghg_excluding_lucf_per_capita"       "methane"                            
## [53] "methane_per_capita"                  "nitrous_oxide"                      
## [55] "nitrous_oxide_per_capita"            "population"                         
## [57] "gdp"                                 "primary_energy_consumption"         
## [59] "energy_per_capita"                   "energy_per_gdp"

library(dplyr)

rev_Data <- read.csv("co2_data_revised.csv", header = TRUE) #start off by loading the CSV file

colnames(rev_Data) #display the coloumn names

##  [1] "iso_code"                 "country"                 
##  [3] "year"                     "co2"                     
##  [5] "co2_per_capita"           "coal_co2"                
##  [7] "coal_co2_per_capita"      "flaring_co2"             
##  [9] "flaring_co2_per_capita"   "gas_co2"                 
## [11] "gas_co2_per_capita"       "oil_co2"                 
## [13] "oil_co2_per_capita"       "total_ghg"               
## [15] "ghg_per_capita"           "methane"                 
## [17] "methane_per_capita"       "nitrous_oxide"           
## [19] "nitrous_oxide_per_capita" "population"              
## [21] "gdp"

#Extract all info for Australia
Aus_data <- subset(rev_Data, subset = (country == "Australia"))
Aus_co2_p_year <- data.frame(Aus_data$year, Aus_data$co2_per_capita)
Aus_data = subset(Aus_data, select = -c(iso_code, country))
Aus_data[is.na(Aus_data)] <- 0



#Extract all info for China
China_data <- subset(rev_Data, subset = (country == "China"))
China_co2_p_year <- data.frame(China_data$year, China_data$co2_per_capita)
China_data = subset(China_data, select = -c(iso_code, country))
China_data[is.na(China_data)] <- 0


#Extract all info for USA
USA_data <- subset(rev_Data, subset = (country == "United States"))
USA_co2_p_year <- data.frame(USA_data$year, USA_data$co2_per_capita)
USA_data = subset(USA_data, select = -c(iso_code, country))
USA_data[is.na(USA_data)] <- 0

#Australian Correlation Graph
Aus_Corr <- Aus_data[, c("co2_per_capita", "population", "gdp")]

library(PerformanceAnalytics)

## Warning: package 'PerformanceAnalytics' was built under R version 4.1.3

## Loading required package: xts

## Warning: package 'xts' was built under R version 4.1.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.1.3

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

## 
## Attaching package: 'PerformanceAnalytics'

## The following object is masked from 'package:graphics':
## 
##     legend

chart.Correlation(Aus_Corr)

#China Correlation Graph
China_Corr <- China_data[, c("co2_per_capita", "population", "gdp")]

library(PerformanceAnalytics)
chart.Correlation(China_Corr)

## Warning in breaks[-1L] + breaks[-nB]: NAs produced by integer overflow

#USA Correlation Graph
USA_Corr <- USA_data[, c("co2_per_capita", "population", "gdp")]

library(PerformanceAnalytics)
chart.Correlation(USA_Corr)

#plot Australian CO2 per year

plot(co2_per_capita ~ year, data = Aus_data, xlab = "Year", main = "Co2 levels in Australia from 1800 to 2020", ylab = "Co2 levels per capita (ppm)",
    pch = 16, type = "b", col = "royalblue3")

#plot China CO2 per year

plot(co2_per_capita ~ year, data = China_data, xlab = "Year", main = "Co2 levels in China from 1900 to 2020", ylab = "Co2 levels per capita (ppm)",
    pch = 16, type = "b", col = "royalblue3")

#plot USA CO2 per year

plot(co2_per_capita ~ year, data = USA_data, xlab = "Year", main = "Co2 levels in USA from 1800 to 2020", ylab = "Co2 levels per capita (ppm)",
    pch = 16, type = "b", col = "royalblue3")

# Running a linear model for the Australian Dataset

Aus_lm_co2_p_yr <- lm(population ~ co2_per_capita, data = Aus_data)

plot(Aus_lm_co2_p_yr)

# Running a linear model for the china dataset

China_lm_co2_p_yr <- lm(population ~ co2_per_capita, data = China_data)

plot(China_lm_co2_p_yr)

# Running a linear model for the USA dataset

USA_lm_co2_p_yr <- lm(population ~ co2_per_capita, data = USA_data)

plot(USA_lm_co2_p_yr)

#Making a Training and Testing dataset for Australia
Index <- sample(nrow(Aus_data), floor(0.25 * nrow(Aus_data)))

Train.Aus_data<- Aus_data[-Index, ]
Test.Aus_data <- Aus_data[Index, ]

co2_for_aus.lm <- lm(population ~ co2_per_capita, data = Train.Aus_data)
summary(co2_for_aus.lm)

## 
## Call:
## lm(formula = population ~ co2_per_capita, data = Train.Aus_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2997483 -1048170  -232934   486717  7941198 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1441581     257439    5.60  1.4e-07 ***
## co2_per_capita  1048744      24895   42.13  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1808000 on 119 degrees of freedom
## Multiple R-squared:  0.9372, Adjusted R-squared:  0.9366 
## F-statistic:  1775 on 1 and 119 DF,  p-value: < 2.2e-16

plot(co2_for_aus.lm)

par(mfrow=c(2,2)) ## create 2x2 frames for the diagnostics plots

#forward step model for co2 and population

lm0 <- lm(population ~ 1, data = Train.Aus_data)
lmall <- lm(population ~ ., data = Train.Aus_data)
lmfwd <- step(lm0, scope = formula(lmall), direction = "forward", trace = 0)
summary(lmfwd)

## 
## Call:
## lm(formula = population ~ co2 + year + coal_co2 + coal_co2_per_capita + 
##     ghg_per_capita + flaring_co2 + methane + nitrous_oxide + 
##     gas_co2_per_capita, data = Train.Aus_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -403887  -92651    8432   78759  379135 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -139325218    3350477 -41.584  < 2e-16 ***
## co2                      29837       1673  17.837  < 2e-16 ***
## year                     75459       1789  42.173  < 2e-16 ***
## coal_co2                 11554       4480   2.579   0.0112 *  
## coal_co2_per_capita    -292018      29086 -10.040  < 2e-16 ***
## ghg_per_capita         -291303      24579 -11.852  < 2e-16 ***
## flaring_co2             120403      11785  10.217  < 2e-16 ***
## methane                  80897       7737  10.455  < 2e-16 ***
## nitrous_oxide           -53650       6253  -8.580 6.46e-14 ***
## gas_co2_per_capita     -680484     105664  -6.440 3.16e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 145500 on 111 degrees of freedom
## Multiple R-squared:  0.9996, Adjusted R-squared:  0.9996 
## F-statistic: 3.248e+04 on 9 and 111 DF,  p-value: < 2.2e-16

plot(lmfwd)

par(mfrow=c(2,2)) ## create 2x2 frames for the diagnostics plots

#Making a Training and Testing dataset for China
Index <- sample(nrow(China_data), floor(0.25 * nrow(China_data)))

Train.China_data<- China_data[-Index, ]
Test.China_data <- China_data[Index, ]

co2_for_China.lm <- lm(population ~ co2_per_capita, data = Train.China_data)
summary(co2_for_China.lm)

## 
## Call:
## lm(formula = population ~ co2_per_capita, data = Train.China_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -226197981 -121337073  -66480485  151780419  328476373 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    576516138   21400593   26.94   <2e-16 ***
## co2_per_capita 145653080    8068823   18.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 167200000 on 90 degrees of freedom
## Multiple R-squared:  0.7836, Adjusted R-squared:  0.7812 
## F-statistic: 325.9 on 1 and 90 DF,  p-value: < 2.2e-16

plot(co2_for_China.lm)

par(mfrow=c(2,2)) ## create 2x2 frames for the diagnostics plots

#forward step model for co2 and population

lm0 <- lm(population ~ 1, data = Train.China_data)
lmall <- lm(population ~ ., data = Train.China_data)
lmfwd <- step(lm0, scope = formula(lmall), direction = "forward", trace = 0)
summary(lmfwd)

## 
## Call:
## lm(formula = population ~ year + oil_co2_per_capita + gas_co2 + 
##     co2 + coal_co2 + flaring_co2_per_capita + methane_per_capita + 
##     ghg_per_capita + gdp + oil_co2 + methane + gas_co2_per_capita + 
##     flaring_co2, data = Train.China_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -29053174  -9980064   2173502   9878843  36921053 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -5.682e+09  2.758e+08 -20.602  < 2e-16 ***
## year                    3.196e+06  1.432e+05  22.316  < 2e-16 ***
## oil_co2_per_capita      2.882e+09  3.496e+08   8.244 3.17e-12 ***
## gas_co2                 6.173e+06  2.413e+06   2.558 0.012469 *  
## co2                     6.675e+05  1.690e+05   3.950 0.000170 ***
## coal_co2               -6.715e+05  1.801e+05  -3.729 0.000363 ***
## flaring_co2_per_capita -3.558e+10  1.844e+10  -1.929 0.057340 .  
## methane_per_capita      3.166e+09  5.862e+08   5.401 6.94e-07 ***
## ghg_per_capita         -1.072e+08  1.557e+07  -6.886 1.30e-09 ***
## gdp                     4.847e-05  5.006e-06   9.681 5.18e-15 ***
## oil_co2                -2.393e+06  3.981e+05  -6.012 5.54e-08 ***
## methane                -2.216e+06  5.210e+05  -4.254 5.79e-05 ***
## gas_co2_per_capita     -1.088e+10  3.553e+09  -3.063 0.003004 ** 
## flaring_co2             3.679e+07  1.829e+07   2.012 0.047714 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15360000 on 78 degrees of freedom
## Multiple R-squared:  0.9984, Adjusted R-squared:  0.9982 
## F-statistic:  3784 on 13 and 78 DF,  p-value: < 2.2e-16

plot(lmfwd)

par(mfrow=c(2,2)) ## create 2x2 frames for the diagnostics plots

#Making a Training and Testing dataset for USA
Index <- sample(nrow(USA_data), floor(0.25 * nrow(USA_data)))

Train.USA_data<- USA_data[-Index, ]
Test.USA_data <- USA_data[Index, ]

co2_for_USA.lm <- lm(population ~ co2_per_capita, data = Train.USA_data)
summary(co2_for_USA.lm)

## 
## Call:
## lm(formula = population ~ co2_per_capita, data = Train.USA_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -81791053 -25703298   2549724   9666070 163133553 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4639341    5206956   0.891    0.374    
## co2_per_capita 11464374     407373  28.142   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41410000 on 164 degrees of freedom
## Multiple R-squared:  0.8284, Adjusted R-squared:  0.8274 
## F-statistic:   792 on 1 and 164 DF,  p-value: < 2.2e-16

plot(co2_for_USA.lm)

par(mfrow=c(2,2)) ## create 2x2 frames for the diagnostics plots

#forward step model for co2 and population

lm0 <- lm(population ~ 1, data = Train.USA_data)
lmall <- lm(population ~ ., data = Train.USA_data)
lmfwd <- step(lm0, scope = formula(lmall), direction = "forward", trace = 0)
summary(lmfwd)

## 
## Call:
## lm(formula = population ~ co2 + year + flaring_co2 + flaring_co2_per_capita + 
##     methane_per_capita + gdp + oil_co2 + coal_co2_per_capita + 
##     ghg_per_capita + oil_co2_per_capita, data = Train.USA_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -11925634  -3282922   -461093   2727157  13630415 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -1.262e+09  4.914e+07 -25.675  < 2e-16 ***
## co2                     4.391e+04  4.029e+03  10.899  < 2e-16 ***
## year                    6.985e+05  2.681e+04  26.059  < 2e-16 ***
## flaring_co2             1.667e+06  1.774e+05   9.397  < 2e-16 ***
## flaring_co2_per_capita -2.601e+08  3.353e+07  -7.757 1.09e-12 ***
## methane_per_capita      3.583e+06  4.635e+06   0.773  0.44072    
## gdp                     2.428e-06  3.303e-07   7.350 1.07e-11 ***
## oil_co2                -5.502e+04  7.579e+03  -7.259 1.77e-11 ***
## coal_co2_per_capita    -2.571e+06  4.729e+05  -5.437 2.07e-07 ***
## ghg_per_capita         -1.919e+06  6.375e+05  -3.011  0.00304 ** 
## oil_co2_per_capita      1.962e+06  1.081e+06   1.815  0.07139 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4722000 on 155 degrees of freedom
## Multiple R-squared:  0.9979, Adjusted R-squared:  0.9978 
## F-statistic:  7336 on 10 and 155 DF,  p-value: < 2.2e-16

plot(lmfwd)

par(mfrow=c(2,2)) ## create 2x2 frames for the diagnostics plots

. For this one we check to see if the model fits the Test data:

# Predictions from the model with all variables for AUS
Aus_pred <- predict(lmall, newdata = Test.Aus_data)

# Predictions from the model selected by forward selection for Aus
Aus_fwdpred <- predict(lmfwd, newdata = Test.Aus_data)

Actual <- Test.Aus_data$population
RMSEP.all <- sqrt(sum((Actual - Aus_pred)^2)/length(Actual))
RMSEP.all

## [1] 785064146

RMSEP.fwd <- sqrt(sum((Actual - Aus_fwdpred)^2)/length(Actual))
RMSEP.fwd

## [1] 68448122

par(mfrow = c(1, 2))  # side-by-side plots
par(pty = "s")  # square plots
Range <- range(c(Actual, Aus_pred, Aus_fwdpred))
plot(Actual ~ Aus_pred, xlab = "predictions from full model", ylab = "Co2 per capita",
    xlim = Range, ylim = Range)
abline(0, 1)
text(Range[1] + 1, Range[2] - 2, labels = paste("RMSEP: ", round(RMSEP.all, 2)),
    adj = 0)
plot(Actual ~ Aus_fwdpred, xlab = "predictions from forward stepwise model", ylab = "Co2 per Capita",
    xlim = Range, ylim = Range)
text(Range[1] + 1, Range[2] - 2, labels = paste("RMSEP: ", round(RMSEP.fwd, 2)),
    adj = 0)
abline(0, 1)

plot(Actual, Test.Aus_data$population)

For this one we check to see if the model fits the train data:

# Predictions from the model with all variables FOR AUS
Aus_pred1 <- predict(lmall, newdata = Train.Aus_data)

# Predictions from the model selected by forward selection for Aus
Aus_fwdpred <- predict(lmfwd, newdata = Train.Aus_data)

Actual <- Train.Aus_data$population
RMSEP.all <- sqrt(sum((Actual - Aus_pred1)^2)/length(Actual))
RMSEP.all

## [1] 906265878

RMSEP.fwd <- sqrt(sum((Actual - Aus_fwdpred)^2)/length(Actual))
RMSEP.fwd

## [1] 65150763

par(mfrow = c(1, 2))  # side-by-side plots
par(pty = "s")  # square plots
Range <- range(c(Actual, Aus_pred1, Aus_fwdpred))
plot(Actual ~ Aus_pred1, xlab = "predictions from full model", ylab = "Co2 per capita",
    xlim = Range, ylim = Range)
abline(0, 1)
text(Range[1] + 1, Range[2] - 2, labels = paste("RMSEP: ", round(RMSEP.all, 2)),
    adj = 0)
plot(Actual ~ Aus_fwdpred, xlab = "predictions from forward stepwise model", ylab = "Co2 per Capita",
    xlim = Range, ylim = Range)
text(Range[1] + 1, Range[2] - 2, labels = paste("RMSEP: ", round(RMSEP.fwd, 2)),
    adj = 0)
abline(0, 1)

plot(Actual, Train.Aus_data$population)

For the China Dataset:

# Predictions from the model with all variables FOR China
China_pred <- predict(lmall, newdata = Test.China_data)

# Predictions from the model selected by forward selection for China
China_fwdpred <- predict(lmfwd, newdata = Test.China_data)

Actual <- Test.China_data$population
RMSEP.all <- sqrt(sum((Actual - China_pred)^2)/length(Actual))
RMSEP.all

## [1] 466377654

RMSEP.fwd <- sqrt(sum((Actual - China_fwdpred)^2)/length(Actual))
RMSEP.fwd

## [1] 689291246

par(mfrow = c(1, 2))  # side-by-side plots
par(pty = "s")  # square plots
Range <- range(c(Actual, China_pred, China_fwdpred))
plot(Actual ~ China_pred, xlab = "predictions from full model", ylab = "Co2 per capita",
    xlim = Range, ylim = Range)
abline(0, 1)
text(Range[1] + 1, Range[2] - 2, labels = paste("RMSEP: ", round(RMSEP.all, 2)),
    adj = 0)
plot(Actual ~ China_fwdpred, xlab = "predictions from forward stepwise model", ylab = "Co2 per Capita",
    xlim = Range, ylim = Range)
text(Range[1] + 1, Range[2] - 2, labels = paste("RMSEP: ", round(RMSEP.fwd, 2)),
    adj = 0)
abline(0, 1)

plot(Actual, Test.China_data$population)

# Predictions from the model with all variables FOR China
China_pred1 <- predict(lmall, newdata = Train.China_data)

# Predictions from the model selected by forward selection for China
China_fwdpred <- predict(lmfwd, newdata = Train.China_data)

Actual <- Train.China_data$population
RMSEP.all <- sqrt(sum((Actual - China_pred)^2)/length(Actual))

## Warning in Actual - China_pred: longer object length is not a multiple of
## shorter object length

RMSEP.all

## [1] 724784011

RMSEP.fwd <- sqrt(sum((Actual - China_fwdpred)^2)/length(Actual))
RMSEP.fwd

## [1] 667331933

par(mfrow = c(1, 2))  # side-by-side plots
par(pty = "s")  # square plots
Range <- range(c(Actual, China_pred, China_fwdpred))
plot(Actual ~ China_pred1, xlab = "predictions from full model", ylab = "Co2 per capita",
    xlim = Range, ylim = Range)
abline(0, 1)
text(Range[1] + 1, Range[2] - 2, labels = paste("RMSEP: ", round(RMSEP.all, 2)),
    adj = 0)
plot(Actual ~ China_fwdpred, xlab = "predictions from forward stepwise model", ylab = "Co2 per Capita",
    xlim = Range, ylim = Range)
text(Range[1] + 1, Range[2] - 2, labels = paste("RMSEP: ", round(RMSEP.fwd, 2)),
    adj = 0)
abline(0, 1)

plot(Actual, Train.China_data$population)

For the USA Dataset

# Predictions from the model with all variables FOR USA
USA_pred <- predict(lmall, newdata = Test.USA_data)

# Predictions from the model selected by forward selection for USA
USA_fwdpred <- predict(lmfwd, newdata = Test.USA_data)

Actual <- Test.USA_data$population
RMSEP.all <- sqrt(sum((Actual - USA_pred)^2)/length(Actual))
RMSEP.all

## [1] 5528257

RMSEP.fwd <- sqrt(sum((Actual - USA_fwdpred)^2)/length(Actual))
RMSEP.fwd

## [1] 5208647

par(mfrow = c(1, 2))  # side-by-side plots
par(pty = "s")  # square plots
Range <- range(c(Actual, USA_pred, USA_fwdpred))
plot(Actual ~ USA_pred, xlab = "predictions from full model", ylab = "Co2 per capita",
    xlim = Range, ylim = Range)
abline(0, 1)
text(Range[1] + 1, Range[2] - 2, labels = paste("RMSEP: ", round(RMSEP.all, 2)),
    adj = 0)
plot(Actual ~ USA_fwdpred, xlab = "predictions from forward stepwise model", ylab = "Co2 per Capita",
    xlim = Range, ylim = Range)
text(Range[1] + 1, Range[2] - 2, labels = paste("RMSEP: ", round(RMSEP.fwd, 2)),
    adj = 0)
abline(0, 1)

plot(Actual, Test.USA_data$population)

# Predictions from the model with all variables FOR USA
USA_pred1 <- predict(lmall, newdata = Train.USA_data)

# Predictions from the model selected by forward selection for USA
USA_fwdpred <- predict(lmfwd, newdata = Train.USA_data)

Actual <- Train.USA_data$population
RMSEP.all <- sqrt(sum((Actual - USA_pred1)^2)/length(Actual))
RMSEP.all

## [1] 4424327

RMSEP.fwd <- sqrt(sum((Actual - USA_fwdpred)^2)/length(Actual))
RMSEP.fwd

## [1] 4563012

par(mfrow = c(1, 2))  # side-by-side plots
par(pty = "s")  # square plots
Range <- range(c(Actual, USA_pred1, USA_fwdpred))
plot(Actual ~ USA_pred1, xlab = "predictions from full model", ylab = "Co2 per capita",
    xlim = Range, ylim = Range)
abline(0, 1)
text(Range[1] + 1, Range[2] - 2, labels = paste("RMSEP: ", round(RMSEP.all, 2)),
    adj = 0)
plot(Actual ~ USA_fwdpred, xlab = "predictions from forward stepwise model", ylab = "Co2 per Capita",
    xlim = Range, ylim = Range)
text(Range[1] + 1, Range[2] - 2, labels = paste("RMSEP: ", round(RMSEP.fwd, 2)),
    adj = 0)
abline(0, 1)

plot(Actual, Train.USA_data$population)

STAT1003 - Final Assignment for Group SJM

27th May 2022