With the rise in industrialization and massive population growth, the world has experienced constant and rapid increases in CO2 emission levels. This has been a critical international issue for many years and must be managed accordingly, with CO2 emissions being one of the major drivers of climate change and other pressing environmental issues including air pollution, ocean acidification and global warming. Not only do these issues affect the environment, but rising CO2 emissions also have large impacts on the economy, agricultural systems, and human health when diseases and natural disasters increase with the warming planet. This has been widely agreed by scientific discussion to be caused by increased human activity, resulting in the release of huge amounts of greenhouse gasses. Due to fossil fuels increasing to meet energy demands, countries should be aware of where the trajectory of emissions levels are going and must regulate it accordingly with research, policies, and new laws to restrict industry-produced emissions as this could largely affect all corners of the globe for the worse. With frequent and careful analysis of CO2 emissions data, the research community can better understand and monitor the global carbon cycle, while the report provides a highly valuable resource within a climate policy framework (GCP, 2010).
The focus of this project will be to predict the global levels of CO2 and greenhouse gas emissions using various statistical modeling, visual analysis, and exploratory data analysis. By doing this, we can have an understanding on how the world can progress towards a more sustainable society by reducing production based annual emissions on a global scale. We will attempt to predict CO2 levels 2 years into the future using our ‘best’ predictive model that we have selected. This model will produce values outside of the given time range given by our dataset.
Our chosen dataset is sourced by the Global Carbon Project, a report produced by community researchers to announce a global carbon budget quantifying CO2 emission for the prior year (ICOS, 2016). The data ranges from the years 1750 to 2020, and is updated annually with global carbon dioxide emissions, specifically production based annual emissions from the burning of fossil fuels for energy (e.g., gas, oil, coal) and industrial production (e.g., steel, cement), balanced by carbon stored in land and ocean reservoirs. It does not include emissions from land use change due to uncertainties in data, as well as difficulties monitoring it to annually update it accurately. The data has been analysed before due to the increasing significance of managing and reducing greenhouse emissions contributed by each country. In the data, CO2 levels are numerical continuous and are plotted against discrete numerical points in time as years. Global CO2 levels will be separated by country as categorical, nominal data.
Since our goal is to predict global CO2 values, we decided it would be most valuable to collect data using a linear regression algorithm to yield specific numeric values. Due to our response variable being continuous (global CO2 transmissions), the model would be most suited for predictive regression, rather than classification on discrete data. By creating this regression model, parameters can be interpreted while supervised machine learning can focus on forecasting future values of our response variable as a function of explanatory variables. Predictive modeling is then implemented as a cross between the two disciplines. As there is plenty of historic past data from many years in the dataset, supervised machine learning is the best choice to predict future values. Using an algorithm, we will be able to analyze patterns from old data with the algorithm, rather than identify groupings and patterns in similar but different data.
Located on the “Our World in Data” website, the dataset was obtained as a .csv file and consisted of 25 191 rows and 60 columns. Below, Figure 1 shows all the column names columns contained in the original dataset and their index number which was extracted in RStudio. Before the dataset could be analyzed, it had to be cleaned by removing gaps, duplicate and missing values as these gaps would reduce prediction accuracy significantly. Here, we substituted missing NA values for dummy 0 values. The dataset originally had many unnecessary variables that did not need to be analyzed so was removed by deletion of certain columns and rows. By doing this, the analysis would be able to be more meaningful as a smaller number of variables can be examined more closely. Figure 1 is an extract of all the variables contained in the dataset before data cleaning.
Figure 1: Extract of column names of all variables within original dataset
Unnecessary variables were removed by deletion of certain columns and rows to create a revised dataset to be analyzed with a clearer focus on our selected problem. The dataset was then decomposed into smaller subsets of the USA, Australia, and China. By comparing 3 countries of significance, CO2 emissions would be able to be compared and specific relationships would be able to be identified.
Figure 2: Extract of column names of all variables within new dataset
Various correlation plots were produced for each subset to understand the relationship between the chosen variables (i.e., Co2 per capita, population and GDP) and proved to be significant in helping us find out the best explanatory variables to focus on. The variables, CO2 per capita, population and GDP appeared to be relatively interlinked as more people contribute to more CO2 emissions and GDP influences where the different populations live.
We used a scatterplot matrix to visualize multiple variables and determine which variables had the strongest correlation for prediction analysis. Talk about visual graphs Looking at the correlation coefficients, we were able to identify that each relationship was quite positive and strong, with the lowest correlation coefficient for every subset being CO2 per capita and GDP, and the highest correlation coefficient for every subset being CO2 per capita and population. Based on this, for all three countries we are analyzing, we used Co2 and population as our main 2 variables We also used line graphs to display the local trends of the response variable over points in time for each subset. This was useful to identify the general positive trend of CO2 emissions per capita (ppm) in a visual form and determine whether we needed to apply a standardized scale (i.e., logarithms) to reduce the effects of significant point differences. By plotting each country’s’ subset of Co2 per year, it was safe to assume the data did not contain many significant outliers that needed to be removed or needed a scale. With CO2 per capita and population having the strongest correlation consistently between each of the 3 subsets, as well as a positive trend over time as seen in the line graphs, it was safe to select these two variables to base our predictions on.
Figure 3: Correlation graph for Australia
Figure 4: Correlation graph for China
Figure 5: Correlation graph for USA
We then ran a linear mode and plotted it to produce various diagnostic graphs for each subset:
Residual vs Fitted: The graphs appeared to not show any linear patterns between the response and explanatory variables, with equally spread residuals and a horizontal line with no distinct patterns. This indicated the data did not have any linear relationships.
Normal Q-Q: We analyzed the Q-Q plots for each subset, and it seemed that the residuals were normally distributed, with only minor deviations at the tail of the plot, deviating from a straight diagonal line as the values level out. Based on these plots, we assumed the data has normal distribution, however, has leveled out in recent years.
Scale-Location: Looking at the graphs, the residuals appeared to be spread equally along the range of predictors, with a relatively horizontal line to indicate the assumption of equal variance.
Residual vs Leverage: The plots were able to help us identify potentially influential outliers that could alter the results if not removed from the analysis. For Australia and USA, all the values were shown to sit well inside the Cook’s distance line, however for China, a value can be found well outside of the Cook’s distance line so can be identified as an influential case. By excluding this observation from the analysis, the slope coefficient changed from 2.14 to 2.68 and R2 from 0.7757 to 0.851.
In the project, simple linear regression was used to model a predictive model of global CO2 emissions. Supervised learning from training data sourced from the dataset was used to produce various predictive models for us. By using machine-based learning via R, we were able to analyze the data and predict future outcomes based on certain variables selected. First, the model was split into two sets that were made sure to not overlap: the training set which will contain 75% of data points and the testing set that will contain 25% of data points. These sets were used to train the models to predict future values and then be used to validate it. Using the training dataset, the correlation of the two variables - global CO2 emissions and time were plotted and despite having some residual, a linear regression line was able to be drawn. The model was then evaluated for its machine learning accuracy by using it on the test dataset. With this, we were able to tweak and adjust hyperparameters of the model and have values set before the algorithm would begin learning. The points produced seem to be more spread compared to the correlation plot for the training dataset, however, a linear relationship was clearly shown. Its performance seemed satisfactory. To evaluate the model, the model’s prediction interval was determined and checked to see if its values were within a certain level of certainty. The ‘best model’ was then selected to forecast the trajectory of global CO2 emissions 2 years into the future.
Looking at the results of the 3 subsets, we can see that the CO2 levels for China is significantly lower than Australia and the USA, peaking at around 8ppm by 2020. Whereas Australia and the USA have a much higher peak being at 20ppm around the 2000s. From what we have gathered, the USA has been producing the highest levels of CO2 between China and Australia. Reaching at 15ppm between 1900 to 1950. The USA’s CO2 levels are also much higher due to it rising a lot earlier than China and Australia. Starting its rise between 1800 to 1850, whereas China didn’t start rising dramatically till around 1960. From these graphs, we can predict that Australia will start lowering their CO2 levels. As it seems to hit its peak in the year 2000 and has been dropping down since. Whereas for China, we can predict a continual growth and rise of CO2 levels. This could be due to the fact that China’s CO2 levels did not rise till much later than Australia and the USA. China also seems to not have hit its peak from the data received, as its continual growth from 2020. With this data we can predict that once China hits 20ppm we would see a steady decrease with the CO2 levels, likewise to how Australia and the USA are. The USA has a major drop opposed to how China is. Peaking at 20ppm between the years of 1950 to 2000. While the USA has the least consistent CO2 levels due to it rising and dropping a lot, we can see that by the end of the 2000s it is still decreasing a significant amount. Figure 6, 7 and 8 show the line graph generated for Co2 verses year for each country
Figure 6: Australian Line Graph
Figure 7: China Line Graph
Figure 8: USA Line Graph
Based on the predictions from the actual model and our produced forward step model, our RMSEP is quite similar. The forward step model has 12% variance from the actual model which is acceptable because our model actually fits the model created before. We can safely assume that the predictions of our full model matches the forward step model and indicates that CO2 levels will still be high and will keep increasing. Figure 9 shows the test data RMSEP
Figure 9: RMSEP values for the Australian Dataset
The predictions from the full model vs the forward step model that was outputted was different to the graphs produced from the Australia data set. The RMSEP variance between the forward model and the predicted model is 33% higher. This means that although the model fits, it indicates a higher rate of CO2 which is predicted to be produced than Australia in the future. This makes sense as currently, China is one of the few countries that at the moment produce mass tonnes of CO2 and the more Co2 they produce the more damaging it is to Earth’s environment. This is concerning because If China doesn’t start reducing their CO2 levels, chances are there will be worse impacts to the climate than it is right now. Figure 10 shows the test data RMSEP.
Figure 10: RMSEP values for the China Dataset
The predictions from the full model vs the forward step model for the USA is very similar to the output of the Australia dataset. The RMSEP variance from the forward model to the full model is only 7%. This shows that the model that was created works very well. It also shows us that in terms of the amount of CO2 produced by the USA, it is very steady. In terms of climate, it isn’t good but in terms of the analysis of our data it is good. Based on the predictive models produced, a forecast of total CO2 levels on a global scale was able to be produced for 2021, 2022 and 2023. Figure 11 shows the test data RMSEP.
Figure 11: RMSEP values for the USA Dataset
Our RMSEP values were in millions as we were measuring population and our analysis is based on the test set of our data which includes 25% of the data points.
In the project, we used a range of different statistical modeling and data analysis methods to train and test predictive models to forecast global levels of CO2 gas emissions using data extracted from a large dataset found on the Global Carbon Project website. As a result, from the tests conducted, it has led us to believe that the CO2 emission levels now are far greater, even with the managing and reducing of greenhouse emissions assisted by each country. With a linear regression algorithm, specific numeric values were able to be obtained and parameters were obtained to forecast future values based on the historic data collected over many years. By splitting the data into smaller sets specific to three countries of our choice, we based global CO2 emissions on them which may create inaccurate predictions. Since the subsets are much smaller than the actual size of the population (global CO2 emissions), there is some uncertainty regarding the predictive model’s accuracy. This could have been reduced by analyzing global CO2 emissions directly which includes all countries, rather than decomposing the data into the 3 subsets containing the countries of our choice - Australia, China and USA. In the future if we were to do more analyses, we could also examine the effects of GDP on CO2 levels due to its strong correlation coefficient as seen on the diagnostic plots or analyze global CO2 emissions based solely on its changes in points in time.
*Ritchie, H. and Roser, M. (2017). CO2 and other Greenhouse Gas Emissions. [online] Our World in Data. Available at: https://ourworldindata.org/co2-and-other-greenhouse-gas-emissions.
*careerfoundry.com. (2010). What Is Data Wrangling? A Complete Introductory Guide. [online] Available at: https://careerfoundry.com/en/blog/data-analytics/data-wrangling/#what-is-the-data-wrangling-process.
*IEA (2022), Global Energy Review: CO2 Emissions in 2021, IEA, Paris. [online] Available at: https://www.iea.org/reports/global-energy-review-co2-emissions-in-2021-2
*ICOS. (2018). Global Carbon Budget. [online] Available at: https://www.icos-cp.eu/science-and-impact/global-carbon-budget
The following lines of code explain breifly what was done to acheive our results. Each bit of code has a comment associated with it to make it easier to understand what is being done.
Data <- read.csv("owid-co2-data.csv", header = TRUE) #start off by loading the CSV file
dim(Data) #check the amount of rows and columns
## [1] 25191 60
summary(Data) #get a summary for each column
## iso_code country year co2
## Length:25191 Length:25191 Min. :1750 Min. : 0.00
## Class :character Class :character 1st Qu.:1924 1st Qu.: 0.53
## Mode :character Mode :character Median :1967 Median : 4.86
## Mean :1953 Mean : 267.86
## 3rd Qu.:1995 3rd Qu.: 42.82
## Max. :2020 Max. :36702.50
## NA's :1242
## co2_per_capita trade_co2 cement_co2 cement_co2_per_capita
## Min. : 0.000 Min. :-1657.998 Min. : 0.000 Min. :0.000
## 1st Qu.: 0.253 1st Qu.: -0.892 1st Qu.: 0.129 1st Qu.:0.020
## Median : 1.250 Median : 1.953 Median : 0.557 Median :0.070
## Mean : 4.171 Mean : -2.416 Mean : 12.889 Mean :0.113
## 3rd Qu.: 4.657 3rd Qu.: 9.700 3rd Qu.: 2.897 3rd Qu.:0.156
## Max. :748.639 Max. : 1028.487 Max. :1626.371 Max. :2.738
## NA's :1884 NA's :21215 NA's :12943 NA's :12973
## coal_co2 coal_co2_per_capita flaring_co2
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.322 1st Qu.: 0.054 1st Qu.: 0.253
## Median : 3.981 Median : 0.442 Median : 2.071
## Mean : 175.358 Mean : 1.552 Mean : 15.000
## 3rd Qu.: 35.533 3rd Qu.: 2.149 3rd Qu.: 12.604
## Max. :15062.902 Max. :34.184 Max. :435.034
## NA's :8003 NA's :8331 NA's :20809
## flaring_co2_per_capita gas_co2 gas_co2_per_capita
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.021 1st Qu.: 0.385 1st Qu.: 0.031
## Median : 0.068 Median : 4.199 Median : 0.282
## Mean : 0.875 Mean : 108.751 Mean : 1.413
## 3rd Qu.: 0.203 3rd Qu.: 30.830 3rd Qu.: 1.436
## Max. :94.711 Max. :7553.394 Max. :52.484
## NA's :20810 NA's :16346 NA's :16356
## oil_co2 oil_co2_per_capita other_industry_co2 other_co2_per_capita
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. :0.000
## 1st Qu.: 0.311 1st Qu.: 0.121 1st Qu.: 0.748 1st Qu.:0.036
## Median : 2.100 Median : 0.630 Median : 2.861 Median :0.071
## Mean : 106.254 Mean : 2.635 Mean : 15.754 Mean :0.080
## 3rd Qu.: 17.369 3rd Qu.: 2.474 3rd Qu.: 9.902 3rd Qu.:0.108
## Max. :12229.642 Max. :748.639 Max. :303.858 Max. :0.357
## NA's :4652 NA's :5010 NA's :23192 NA's :23192
## co2_growth_prct co2_growth_abs co2_per_gdp co2_per_unit_energy
## Min. : -99.64 Min. :-1895.244 Min. :0.000 Min. :0.005
## 1st Qu.: -0.45 1st Qu.: -0.011 1st Qu.:0.140 1st Qu.:0.178
## Median : 3.35 Median : 0.059 Median :0.276 Median :0.218
## Mean : 21.10 Mean : 5.147 Mean :0.422 Mean :0.239
## 3rd Qu.: 10.46 3rd Qu.: 1.103 3rd Qu.:0.534 3rd Qu.:0.256
## Max. :102318.51 Max. : 1736.258 Max. :7.776 Max. :4.644
## NA's :260 NA's :1606 NA's :9802 NA's :16050
## consumption_co2 consumption_co2_per_capita consumption_co2_per_gdp
## Min. : 0.20 Min. : 0.055 Min. :0.006
## 1st Qu.: 10.32 1st Qu.: 1.240 1st Qu.:0.216
## Median : 57.09 Median : 4.359 Median :0.315
## Mean : 916.76 Mean : 6.568 Mean :0.370
## 3rd Qu.: 276.38 3rd Qu.: 9.848 3rd Qu.:0.447
## Max. :36702.50 Max. :57.792 Max. :3.543
## NA's :21215 NA's :21215 NA's :21430
## cumulative_co2 cumulative_cement_co2 cumulative_coal_co2
## Min. : 0.0 Min. : 0.00 Min. : 0.0
## 1st Qu.: 7.0 1st Qu.: 1.61 1st Qu.: 5.5
## Median : 91.3 Median : 10.45 Median : 98.2
## Mean : 10357.1 Mean : 307.76 Mean : 8791.8
## 3rd Qu.: 1147.5 3rd Qu.: 66.46 3rd Qu.: 1248.8
## Max. :1696524.2 Max. :43163.19 Max. :788362.0
## NA's :1242 NA's :12943 NA's :8003
## cumulative_flaring_co2 cumulative_gas_co2 cumulative_oil_co2
## Min. : 0.000 Min. : 0.00 Min. : 0.0
## 1st Qu.: 4.071 1st Qu.: 3.24 1st Qu.: 3.9
## Median : 45.608 Median : 52.06 Median : 39.2
## Mean : 425.699 Mean : 2587.10 Mean : 3296.6
## 3rd Qu.: 281.485 3rd Qu.: 457.78 3rd Qu.: 372.7
## Max. :17792.749 Max. :245231.88 Max. :592621.2
## NA's :20809 NA's :16346 NA's :4652
## cumulative_other_co2 trade_co2_share share_global_co2
## Min. : 0.001 Min. :-96.760 Min. : 0.000
## 1st Qu.: 7.709 1st Qu.: -1.758 1st Qu.: 0.010
## Median : 35.644 Median : 11.675 Median : 0.060
## Mean : 293.588 Mean : 22.961 Mean : 4.984
## 3rd Qu.: 159.188 3rd Qu.: 36.382 3rd Qu.: 0.600
## Max. :7725.988 Max. :366.150 Max. :100.000
## NA's :23192 NA's :21215 NA's :1242
## share_global_cement_co2 share_global_coal_co2 share_global_flaring_co2
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.050 1st Qu.: 0.010 1st Qu.: 0.090
## Median : 0.200 Median : 0.115 Median : 0.700
## Mean : 4.419 Mean : 6.990 Mean : 5.862
## 3rd Qu.: 1.000 3rd Qu.: 1.250 3rd Qu.: 4.440
## Max. :100.000 Max. :100.000 Max. :100.000
## NA's :12943 NA's :8003 NA's :20809
## share_global_gas_co2 share_global_oil_co2 share_global_other_co2
## Min. : 0.000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.030 1st Qu.: 0.010 1st Qu.: 0.30
## Median : 0.200 Median : 0.080 Median : 1.34
## Mean : 5.406 Mean : 2.993 Mean : 14.29
## 3rd Qu.: 1.250 3rd Qu.: 0.550 3rd Qu.: 10.09
## Max. :100.000 Max. :100.000 Max. :100.00
## NA's :16346 NA's :4652 NA's :23192
## share_global_cumulative_co2 share_global_cumulative_cement_co2
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.040
## Median : 0.030 Median : 0.200
## Mean : 5.127 Mean : 4.462
## 3rd Qu.: 0.410 3rd Qu.: 0.930
## Max. :100.000 Max. :100.000
## NA's :1242 NA's :12943
## share_global_cumulative_coal_co2 share_global_cumulative_flaring_co2
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.060
## Median : 0.070 Median : 0.540
## Mean : 7.212 Mean : 5.622
## 3rd Qu.: 0.910 3rd Qu.: 3.567
## Max. :100.000 Max. :100.000
## NA's :8003 NA's :20809
## share_global_cumulative_gas_co2 share_global_cumulative_oil_co2
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.010 1st Qu.: 0.010
## Median : 0.110 Median : 0.070
## Mean : 5.242 Mean : 3.002
## 3rd Qu.: 0.820 3rd Qu.: 0.530
## Max. :100.000 Max. :100.000
## NA's :16346 NA's :4652
## share_global_cumulative_other_co2 total_ghg ghg_per_capita
## Min. : 0.000 Min. : -178.71 Min. :-31.485
## 1st Qu.: 0.190 1st Qu.: 8.03 1st Qu.: 2.656
## Median : 0.840 Median : 33.90 Median : 5.386
## Mean : 13.404 Mean : 420.52 Mean : 7.608
## 3rd Qu.: 7.985 3rd Qu.: 115.03 3rd Qu.: 9.604
## Max. :100.000 Max. :48939.71 Max. : 74.729
## NA's :23192 NA's :19540 NA's :19540
## total_ghg_excluding_lucf ghg_excluding_lucf_per_capita methane
## Min. : 0.01 Min. : 0.101 Min. : 0.000
## 1st Qu.: 6.85 1st Qu.: 2.095 1st Qu.: 2.005
## Median : 28.08 Median : 4.442 Median : 8.530
## Mean : 406.51 Mean : 6.871 Mean : 79.072
## 3rd Qu.: 92.60 3rd Qu.: 8.975 3rd Qu.: 30.025
## Max. :47552.14 Max. :53.650 Max. :8298.270
## NA's :19540 NA's :19540 NA's :19536
## methane_per_capita nitrous_oxide nitrous_oxide_per_capita
## Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.691 1st Qu.: 0.51 1st Qu.: 0.221
## Median : 1.077 Median : 3.46 Median : 0.377
## Mean : 1.902 Mean : 29.09 Mean : 0.602
## 3rd Qu.: 1.619 3rd Qu.: 11.20 3rd Qu.: 0.589
## Max. :39.795 Max. :3078.27 Max. :10.056
## NA's :19536 NA's :19536 NA's :19536
## population gdp primary_energy_consumption
## Min. :1.490e+03 Min. :5.543e+07 Min. : 0.0
## 1st Qu.:1.287e+06 1st Qu.:9.829e+09 1st Qu.: 7.0
## Median :4.870e+06 Median :3.037e+10 Median : 61.4
## Mean :7.068e+07 Mean :2.877e+11 Mean : 1569.1
## 3rd Qu.:1.758e+07 3rd Qu.:1.269e+11 3rd Qu.: 352.9
## Max. :7.795e+09 Max. :1.136e+14 Max. :162194.3
## NA's :2299 NA's :11653 NA's :16501
## energy_per_capita energy_per_gdp
## Min. : 0 Min. : 0.050
## 1st Qu.: 3270 1st Qu.: 0.856
## Median : 13701 Median : 1.407
## Mean : 25569 Mean : 1.850
## 3rd Qu.: 35494 3rd Qu.: 2.351
## Max. :317583 Max. :13.493
## NA's :16510 NA's :18388
head(Data) #print the first 6 rows of the csv file
## iso_code country year co2 co2_per_capita trade_co2 cement_co2
## 1 AFG Afghanistan 1949 0.015 0.002 NA NA
## 2 AFG Afghanistan 1950 0.084 0.011 NA NA
## 3 AFG Afghanistan 1951 0.092 0.012 NA NA
## 4 AFG Afghanistan 1952 0.092 0.012 NA NA
## 5 AFG Afghanistan 1953 0.106 0.013 NA NA
## 6 AFG Afghanistan 1954 0.106 0.013 NA NA
## cement_co2_per_capita coal_co2 coal_co2_per_capita flaring_co2
## 1 NA 0.015 0.002 NA
## 2 NA 0.021 0.003 NA
## 3 NA 0.026 0.003 NA
## 4 NA 0.032 0.004 NA
## 5 NA 0.038 0.005 NA
## 6 NA 0.043 0.005 NA
## flaring_co2_per_capita gas_co2 gas_co2_per_capita oil_co2 oil_co2_per_capita
## 1 NA NA NA NA NA
## 2 NA NA NA 0.063 0.008
## 3 NA NA NA 0.066 0.008
## 4 NA NA NA 0.060 0.008
## 5 NA NA NA 0.068 0.008
## 6 NA NA NA 0.064 0.008
## other_industry_co2 other_co2_per_capita co2_growth_prct co2_growth_abs
## 1 NA NA NA NA
## 2 NA NA 475.0 0.070
## 3 NA NA 8.7 0.007
## 4 NA NA 0.0 0.000
## 5 NA NA 16.0 0.015
## 6 NA NA 0.0 0.000
## co2_per_gdp co2_per_unit_energy consumption_co2 consumption_co2_per_capita
## 1 NA NA NA NA
## 2 0.009 NA NA NA
## 3 0.010 NA NA NA
## 4 0.009 NA NA NA
## 5 0.010 NA NA NA
## 6 0.010 NA NA NA
## consumption_co2_per_gdp cumulative_co2 cumulative_cement_co2
## 1 NA 0.015 NA
## 2 NA 0.099 NA
## 3 NA 0.191 NA
## 4 NA 0.282 NA
## 5 NA 0.388 NA
## 6 NA 0.495 NA
## cumulative_coal_co2 cumulative_flaring_co2 cumulative_gas_co2
## 1 0.015 NA NA
## 2 0.036 NA NA
## 3 0.061 NA NA
## 4 0.093 NA NA
## 5 0.131 NA NA
## 6 0.174 NA NA
## cumulative_oil_co2 cumulative_other_co2 trade_co2_share share_global_co2
## 1 NA NA NA 0
## 2 0.063 NA NA 0
## 3 0.129 NA NA 0
## 4 0.189 NA NA 0
## 5 0.257 NA NA 0
## 6 0.321 NA NA 0
## share_global_cement_co2 share_global_coal_co2 share_global_flaring_co2
## 1 NA 0 NA
## 2 NA 0 NA
## 3 NA 0 NA
## 4 NA 0 NA
## 5 NA 0 NA
## 6 NA 0 NA
## share_global_gas_co2 share_global_oil_co2 share_global_other_co2
## 1 NA NA NA
## 2 NA 0 NA
## 3 NA 0 NA
## 4 NA 0 NA
## 5 NA 0 NA
## 6 NA 0 NA
## share_global_cumulative_co2 share_global_cumulative_cement_co2
## 1 0 NA
## 2 0 NA
## 3 0 NA
## 4 0 NA
## 5 0 NA
## 6 0 NA
## share_global_cumulative_coal_co2 share_global_cumulative_flaring_co2
## 1 0 NA
## 2 0 NA
## 3 0 NA
## 4 0 NA
## 5 0 NA
## 6 0 NA
## share_global_cumulative_gas_co2 share_global_cumulative_oil_co2
## 1 NA NA
## 2 NA 0
## 3 NA 0
## 4 NA 0
## 5 NA 0
## 6 NA 0
## share_global_cumulative_other_co2 total_ghg ghg_per_capita
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## total_ghg_excluding_lucf ghg_excluding_lucf_per_capita methane
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## methane_per_capita nitrous_oxide nitrous_oxide_per_capita population
## 1 NA NA NA 7624058
## 2 NA NA NA 7752117
## 3 NA NA NA 7840151
## 4 NA NA NA 7935996
## 5 NA NA NA 8039684
## 6 NA NA NA 8151316
## gdp primary_energy_consumption energy_per_capita energy_per_gdp
## 1 NA NA NA NA
## 2 9421400000 NA NA NA
## 3 9692280000 NA NA NA
## 4 10017325000 NA NA NA
## 5 10630520000 NA NA NA
## 6 10866360000 NA NA NA
library(tidyverse) #import the tidyverse library
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'forcats' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
colnames(Data) #checkout the names of each column
## [1] "iso_code" "country"
## [3] "year" "co2"
## [5] "co2_per_capita" "trade_co2"
## [7] "cement_co2" "cement_co2_per_capita"
## [9] "coal_co2" "coal_co2_per_capita"
## [11] "flaring_co2" "flaring_co2_per_capita"
## [13] "gas_co2" "gas_co2_per_capita"
## [15] "oil_co2" "oil_co2_per_capita"
## [17] "other_industry_co2" "other_co2_per_capita"
## [19] "co2_growth_prct" "co2_growth_abs"
## [21] "co2_per_gdp" "co2_per_unit_energy"
## [23] "consumption_co2" "consumption_co2_per_capita"
## [25] "consumption_co2_per_gdp" "cumulative_co2"
## [27] "cumulative_cement_co2" "cumulative_coal_co2"
## [29] "cumulative_flaring_co2" "cumulative_gas_co2"
## [31] "cumulative_oil_co2" "cumulative_other_co2"
## [33] "trade_co2_share" "share_global_co2"
## [35] "share_global_cement_co2" "share_global_coal_co2"
## [37] "share_global_flaring_co2" "share_global_gas_co2"
## [39] "share_global_oil_co2" "share_global_other_co2"
## [41] "share_global_cumulative_co2" "share_global_cumulative_cement_co2"
## [43] "share_global_cumulative_coal_co2" "share_global_cumulative_flaring_co2"
## [45] "share_global_cumulative_gas_co2" "share_global_cumulative_oil_co2"
## [47] "share_global_cumulative_other_co2" "total_ghg"
## [49] "ghg_per_capita" "total_ghg_excluding_lucf"
## [51] "ghg_excluding_lucf_per_capita" "methane"
## [53] "methane_per_capita" "nitrous_oxide"
## [55] "nitrous_oxide_per_capita" "population"
## [57] "gdp" "primary_energy_consumption"
## [59] "energy_per_capita" "energy_per_gdp"
library(dplyr)
rev_Data <- read.csv("co2_data_revised.csv", header = TRUE) #start off by loading the CSV file
colnames(rev_Data) #display the coloumn names
## [1] "iso_code" "country"
## [3] "year" "co2"
## [5] "co2_per_capita" "coal_co2"
## [7] "coal_co2_per_capita" "flaring_co2"
## [9] "flaring_co2_per_capita" "gas_co2"
## [11] "gas_co2_per_capita" "oil_co2"
## [13] "oil_co2_per_capita" "total_ghg"
## [15] "ghg_per_capita" "methane"
## [17] "methane_per_capita" "nitrous_oxide"
## [19] "nitrous_oxide_per_capita" "population"
## [21] "gdp"
#Extract all info for Australia
Aus_data <- subset(rev_Data, subset = (country == "Australia"))
Aus_co2_p_year <- data.frame(Aus_data$year, Aus_data$co2_per_capita)
Aus_data = subset(Aus_data, select = -c(iso_code, country))
Aus_data[is.na(Aus_data)] <- 0
#Extract all info for China
China_data <- subset(rev_Data, subset = (country == "China"))
China_co2_p_year <- data.frame(China_data$year, China_data$co2_per_capita)
China_data = subset(China_data, select = -c(iso_code, country))
China_data[is.na(China_data)] <- 0
#Extract all info for USA
USA_data <- subset(rev_Data, subset = (country == "United States"))
USA_co2_p_year <- data.frame(USA_data$year, USA_data$co2_per_capita)
USA_data = subset(USA_data, select = -c(iso_code, country))
USA_data[is.na(USA_data)] <- 0
#Australian Correlation Graph
Aus_Corr <- Aus_data[, c("co2_per_capita", "population", "gdp")]
library(PerformanceAnalytics)
## Warning: package 'PerformanceAnalytics' was built under R version 4.1.3
## Loading required package: xts
## Warning: package 'xts' was built under R version 4.1.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.1.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
chart.Correlation(Aus_Corr)
#China Correlation Graph
China_Corr <- China_data[, c("co2_per_capita", "population", "gdp")]
library(PerformanceAnalytics)
chart.Correlation(China_Corr)
## Warning in breaks[-1L] + breaks[-nB]: NAs produced by integer overflow
#USA Correlation Graph
USA_Corr <- USA_data[, c("co2_per_capita", "population", "gdp")]
library(PerformanceAnalytics)
chart.Correlation(USA_Corr)
#plot Australian CO2 per year
plot(co2_per_capita ~ year, data = Aus_data, xlab = "Year", main = "Co2 levels in Australia from 1800 to 2020", ylab = "Co2 levels per capita (ppm)",
pch = 16, type = "b", col = "royalblue3")
#plot China CO2 per year
plot(co2_per_capita ~ year, data = China_data, xlab = "Year", main = "Co2 levels in China from 1900 to 2020", ylab = "Co2 levels per capita (ppm)",
pch = 16, type = "b", col = "royalblue3")
#plot USA CO2 per year
plot(co2_per_capita ~ year, data = USA_data, xlab = "Year", main = "Co2 levels in USA from 1800 to 2020", ylab = "Co2 levels per capita (ppm)",
pch = 16, type = "b", col = "royalblue3")
# Running a linear model for the Australian Dataset
Aus_lm_co2_p_yr <- lm(population ~ co2_per_capita, data = Aus_data)
plot(Aus_lm_co2_p_yr)
# Running a linear model for the china dataset
China_lm_co2_p_yr <- lm(population ~ co2_per_capita, data = China_data)
plot(China_lm_co2_p_yr)
# Running a linear model for the USA dataset
USA_lm_co2_p_yr <- lm(population ~ co2_per_capita, data = USA_data)
plot(USA_lm_co2_p_yr)
#Making a Training and Testing dataset for Australia
Index <- sample(nrow(Aus_data), floor(0.25 * nrow(Aus_data)))
Train.Aus_data<- Aus_data[-Index, ]
Test.Aus_data <- Aus_data[Index, ]
co2_for_aus.lm <- lm(population ~ co2_per_capita, data = Train.Aus_data)
summary(co2_for_aus.lm)
##
## Call:
## lm(formula = population ~ co2_per_capita, data = Train.Aus_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2997483 -1048170 -232934 486717 7941198
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1441581 257439 5.60 1.4e-07 ***
## co2_per_capita 1048744 24895 42.13 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1808000 on 119 degrees of freedom
## Multiple R-squared: 0.9372, Adjusted R-squared: 0.9366
## F-statistic: 1775 on 1 and 119 DF, p-value: < 2.2e-16
plot(co2_for_aus.lm)
par(mfrow=c(2,2)) ## create 2x2 frames for the diagnostics plots
#forward step model for co2 and population
lm0 <- lm(population ~ 1, data = Train.Aus_data)
lmall <- lm(population ~ ., data = Train.Aus_data)
lmfwd <- step(lm0, scope = formula(lmall), direction = "forward", trace = 0)
summary(lmfwd)
##
## Call:
## lm(formula = population ~ co2 + year + coal_co2 + coal_co2_per_capita +
## ghg_per_capita + flaring_co2 + methane + nitrous_oxide +
## gas_co2_per_capita, data = Train.Aus_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -403887 -92651 8432 78759 379135
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -139325218 3350477 -41.584 < 2e-16 ***
## co2 29837 1673 17.837 < 2e-16 ***
## year 75459 1789 42.173 < 2e-16 ***
## coal_co2 11554 4480 2.579 0.0112 *
## coal_co2_per_capita -292018 29086 -10.040 < 2e-16 ***
## ghg_per_capita -291303 24579 -11.852 < 2e-16 ***
## flaring_co2 120403 11785 10.217 < 2e-16 ***
## methane 80897 7737 10.455 < 2e-16 ***
## nitrous_oxide -53650 6253 -8.580 6.46e-14 ***
## gas_co2_per_capita -680484 105664 -6.440 3.16e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 145500 on 111 degrees of freedom
## Multiple R-squared: 0.9996, Adjusted R-squared: 0.9996
## F-statistic: 3.248e+04 on 9 and 111 DF, p-value: < 2.2e-16
plot(lmfwd)
par(mfrow=c(2,2)) ## create 2x2 frames for the diagnostics plots
#Making a Training and Testing dataset for China
Index <- sample(nrow(China_data), floor(0.25 * nrow(China_data)))
Train.China_data<- China_data[-Index, ]
Test.China_data <- China_data[Index, ]
co2_for_China.lm <- lm(population ~ co2_per_capita, data = Train.China_data)
summary(co2_for_China.lm)
##
## Call:
## lm(formula = population ~ co2_per_capita, data = Train.China_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -226197981 -121337073 -66480485 151780419 328476373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 576516138 21400593 26.94 <2e-16 ***
## co2_per_capita 145653080 8068823 18.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 167200000 on 90 degrees of freedom
## Multiple R-squared: 0.7836, Adjusted R-squared: 0.7812
## F-statistic: 325.9 on 1 and 90 DF, p-value: < 2.2e-16
plot(co2_for_China.lm)
par(mfrow=c(2,2)) ## create 2x2 frames for the diagnostics plots
#forward step model for co2 and population
lm0 <- lm(population ~ 1, data = Train.China_data)
lmall <- lm(population ~ ., data = Train.China_data)
lmfwd <- step(lm0, scope = formula(lmall), direction = "forward", trace = 0)
summary(lmfwd)
##
## Call:
## lm(formula = population ~ year + oil_co2_per_capita + gas_co2 +
## co2 + coal_co2 + flaring_co2_per_capita + methane_per_capita +
## ghg_per_capita + gdp + oil_co2 + methane + gas_co2_per_capita +
## flaring_co2, data = Train.China_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29053174 -9980064 2173502 9878843 36921053
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.682e+09 2.758e+08 -20.602 < 2e-16 ***
## year 3.196e+06 1.432e+05 22.316 < 2e-16 ***
## oil_co2_per_capita 2.882e+09 3.496e+08 8.244 3.17e-12 ***
## gas_co2 6.173e+06 2.413e+06 2.558 0.012469 *
## co2 6.675e+05 1.690e+05 3.950 0.000170 ***
## coal_co2 -6.715e+05 1.801e+05 -3.729 0.000363 ***
## flaring_co2_per_capita -3.558e+10 1.844e+10 -1.929 0.057340 .
## methane_per_capita 3.166e+09 5.862e+08 5.401 6.94e-07 ***
## ghg_per_capita -1.072e+08 1.557e+07 -6.886 1.30e-09 ***
## gdp 4.847e-05 5.006e-06 9.681 5.18e-15 ***
## oil_co2 -2.393e+06 3.981e+05 -6.012 5.54e-08 ***
## methane -2.216e+06 5.210e+05 -4.254 5.79e-05 ***
## gas_co2_per_capita -1.088e+10 3.553e+09 -3.063 0.003004 **
## flaring_co2 3.679e+07 1.829e+07 2.012 0.047714 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15360000 on 78 degrees of freedom
## Multiple R-squared: 0.9984, Adjusted R-squared: 0.9982
## F-statistic: 3784 on 13 and 78 DF, p-value: < 2.2e-16
plot(lmfwd)
par(mfrow=c(2,2)) ## create 2x2 frames for the diagnostics plots
#Making a Training and Testing dataset for USA
Index <- sample(nrow(USA_data), floor(0.25 * nrow(USA_data)))
Train.USA_data<- USA_data[-Index, ]
Test.USA_data <- USA_data[Index, ]
co2_for_USA.lm <- lm(population ~ co2_per_capita, data = Train.USA_data)
summary(co2_for_USA.lm)
##
## Call:
## lm(formula = population ~ co2_per_capita, data = Train.USA_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -81791053 -25703298 2549724 9666070 163133553
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4639341 5206956 0.891 0.374
## co2_per_capita 11464374 407373 28.142 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41410000 on 164 degrees of freedom
## Multiple R-squared: 0.8284, Adjusted R-squared: 0.8274
## F-statistic: 792 on 1 and 164 DF, p-value: < 2.2e-16
plot(co2_for_USA.lm)
par(mfrow=c(2,2)) ## create 2x2 frames for the diagnostics plots
#forward step model for co2 and population
lm0 <- lm(population ~ 1, data = Train.USA_data)
lmall <- lm(population ~ ., data = Train.USA_data)
lmfwd <- step(lm0, scope = formula(lmall), direction = "forward", trace = 0)
summary(lmfwd)
##
## Call:
## lm(formula = population ~ co2 + year + flaring_co2 + flaring_co2_per_capita +
## methane_per_capita + gdp + oil_co2 + coal_co2_per_capita +
## ghg_per_capita + oil_co2_per_capita, data = Train.USA_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11925634 -3282922 -461093 2727157 13630415
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.262e+09 4.914e+07 -25.675 < 2e-16 ***
## co2 4.391e+04 4.029e+03 10.899 < 2e-16 ***
## year 6.985e+05 2.681e+04 26.059 < 2e-16 ***
## flaring_co2 1.667e+06 1.774e+05 9.397 < 2e-16 ***
## flaring_co2_per_capita -2.601e+08 3.353e+07 -7.757 1.09e-12 ***
## methane_per_capita 3.583e+06 4.635e+06 0.773 0.44072
## gdp 2.428e-06 3.303e-07 7.350 1.07e-11 ***
## oil_co2 -5.502e+04 7.579e+03 -7.259 1.77e-11 ***
## coal_co2_per_capita -2.571e+06 4.729e+05 -5.437 2.07e-07 ***
## ghg_per_capita -1.919e+06 6.375e+05 -3.011 0.00304 **
## oil_co2_per_capita 1.962e+06 1.081e+06 1.815 0.07139 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4722000 on 155 degrees of freedom
## Multiple R-squared: 0.9979, Adjusted R-squared: 0.9978
## F-statistic: 7336 on 10 and 155 DF, p-value: < 2.2e-16
plot(lmfwd)
par(mfrow=c(2,2)) ## create 2x2 frames for the diagnostics plots
. For this one we check to see if the model fits the Test data:
# Predictions from the model with all variables for AUS
Aus_pred <- predict(lmall, newdata = Test.Aus_data)
# Predictions from the model selected by forward selection for Aus
Aus_fwdpred <- predict(lmfwd, newdata = Test.Aus_data)
Actual <- Test.Aus_data$population
RMSEP.all <- sqrt(sum((Actual - Aus_pred)^2)/length(Actual))
RMSEP.all
## [1] 785064146
RMSEP.fwd <- sqrt(sum((Actual - Aus_fwdpred)^2)/length(Actual))
RMSEP.fwd
## [1] 68448122
par(mfrow = c(1, 2)) # side-by-side plots
par(pty = "s") # square plots
Range <- range(c(Actual, Aus_pred, Aus_fwdpred))
plot(Actual ~ Aus_pred, xlab = "predictions from full model", ylab = "Co2 per capita",
xlim = Range, ylim = Range)
abline(0, 1)
text(Range[1] + 1, Range[2] - 2, labels = paste("RMSEP: ", round(RMSEP.all, 2)),
adj = 0)
plot(Actual ~ Aus_fwdpred, xlab = "predictions from forward stepwise model", ylab = "Co2 per Capita",
xlim = Range, ylim = Range)
text(Range[1] + 1, Range[2] - 2, labels = paste("RMSEP: ", round(RMSEP.fwd, 2)),
adj = 0)
abline(0, 1)
plot(Actual, Test.Aus_data$population)
For this one we check to see if the model fits the train data:
# Predictions from the model with all variables FOR AUS
Aus_pred1 <- predict(lmall, newdata = Train.Aus_data)
# Predictions from the model selected by forward selection for Aus
Aus_fwdpred <- predict(lmfwd, newdata = Train.Aus_data)
Actual <- Train.Aus_data$population
RMSEP.all <- sqrt(sum((Actual - Aus_pred1)^2)/length(Actual))
RMSEP.all
## [1] 906265878
RMSEP.fwd <- sqrt(sum((Actual - Aus_fwdpred)^2)/length(Actual))
RMSEP.fwd
## [1] 65150763
par(mfrow = c(1, 2)) # side-by-side plots
par(pty = "s") # square plots
Range <- range(c(Actual, Aus_pred1, Aus_fwdpred))
plot(Actual ~ Aus_pred1, xlab = "predictions from full model", ylab = "Co2 per capita",
xlim = Range, ylim = Range)
abline(0, 1)
text(Range[1] + 1, Range[2] - 2, labels = paste("RMSEP: ", round(RMSEP.all, 2)),
adj = 0)
plot(Actual ~ Aus_fwdpred, xlab = "predictions from forward stepwise model", ylab = "Co2 per Capita",
xlim = Range, ylim = Range)
text(Range[1] + 1, Range[2] - 2, labels = paste("RMSEP: ", round(RMSEP.fwd, 2)),
adj = 0)
abline(0, 1)
plot(Actual, Train.Aus_data$population)
For the China Dataset:
# Predictions from the model with all variables FOR China
China_pred <- predict(lmall, newdata = Test.China_data)
# Predictions from the model selected by forward selection for China
China_fwdpred <- predict(lmfwd, newdata = Test.China_data)
Actual <- Test.China_data$population
RMSEP.all <- sqrt(sum((Actual - China_pred)^2)/length(Actual))
RMSEP.all
## [1] 466377654
RMSEP.fwd <- sqrt(sum((Actual - China_fwdpred)^2)/length(Actual))
RMSEP.fwd
## [1] 689291246
par(mfrow = c(1, 2)) # side-by-side plots
par(pty = "s") # square plots
Range <- range(c(Actual, China_pred, China_fwdpred))
plot(Actual ~ China_pred, xlab = "predictions from full model", ylab = "Co2 per capita",
xlim = Range, ylim = Range)
abline(0, 1)
text(Range[1] + 1, Range[2] - 2, labels = paste("RMSEP: ", round(RMSEP.all, 2)),
adj = 0)
plot(Actual ~ China_fwdpred, xlab = "predictions from forward stepwise model", ylab = "Co2 per Capita",
xlim = Range, ylim = Range)
text(Range[1] + 1, Range[2] - 2, labels = paste("RMSEP: ", round(RMSEP.fwd, 2)),
adj = 0)
abline(0, 1)
plot(Actual, Test.China_data$population)
# Predictions from the model with all variables FOR China
China_pred1 <- predict(lmall, newdata = Train.China_data)
# Predictions from the model selected by forward selection for China
China_fwdpred <- predict(lmfwd, newdata = Train.China_data)
Actual <- Train.China_data$population
RMSEP.all <- sqrt(sum((Actual - China_pred)^2)/length(Actual))
## Warning in Actual - China_pred: longer object length is not a multiple of
## shorter object length
RMSEP.all
## [1] 724784011
RMSEP.fwd <- sqrt(sum((Actual - China_fwdpred)^2)/length(Actual))
RMSEP.fwd
## [1] 667331933
par(mfrow = c(1, 2)) # side-by-side plots
par(pty = "s") # square plots
Range <- range(c(Actual, China_pred, China_fwdpred))
plot(Actual ~ China_pred1, xlab = "predictions from full model", ylab = "Co2 per capita",
xlim = Range, ylim = Range)
abline(0, 1)
text(Range[1] + 1, Range[2] - 2, labels = paste("RMSEP: ", round(RMSEP.all, 2)),
adj = 0)
plot(Actual ~ China_fwdpred, xlab = "predictions from forward stepwise model", ylab = "Co2 per Capita",
xlim = Range, ylim = Range)
text(Range[1] + 1, Range[2] - 2, labels = paste("RMSEP: ", round(RMSEP.fwd, 2)),
adj = 0)
abline(0, 1)
plot(Actual, Train.China_data$population)
For the USA Dataset
# Predictions from the model with all variables FOR USA
USA_pred <- predict(lmall, newdata = Test.USA_data)
# Predictions from the model selected by forward selection for USA
USA_fwdpred <- predict(lmfwd, newdata = Test.USA_data)
Actual <- Test.USA_data$population
RMSEP.all <- sqrt(sum((Actual - USA_pred)^2)/length(Actual))
RMSEP.all
## [1] 5528257
RMSEP.fwd <- sqrt(sum((Actual - USA_fwdpred)^2)/length(Actual))
RMSEP.fwd
## [1] 5208647
par(mfrow = c(1, 2)) # side-by-side plots
par(pty = "s") # square plots
Range <- range(c(Actual, USA_pred, USA_fwdpred))
plot(Actual ~ USA_pred, xlab = "predictions from full model", ylab = "Co2 per capita",
xlim = Range, ylim = Range)
abline(0, 1)
text(Range[1] + 1, Range[2] - 2, labels = paste("RMSEP: ", round(RMSEP.all, 2)),
adj = 0)
plot(Actual ~ USA_fwdpred, xlab = "predictions from forward stepwise model", ylab = "Co2 per Capita",
xlim = Range, ylim = Range)
text(Range[1] + 1, Range[2] - 2, labels = paste("RMSEP: ", round(RMSEP.fwd, 2)),
adj = 0)
abline(0, 1)
plot(Actual, Test.USA_data$population)
# Predictions from the model with all variables FOR USA
USA_pred1 <- predict(lmall, newdata = Train.USA_data)
# Predictions from the model selected by forward selection for USA
USA_fwdpred <- predict(lmfwd, newdata = Train.USA_data)
Actual <- Train.USA_data$population
RMSEP.all <- sqrt(sum((Actual - USA_pred1)^2)/length(Actual))
RMSEP.all
## [1] 4424327
RMSEP.fwd <- sqrt(sum((Actual - USA_fwdpred)^2)/length(Actual))
RMSEP.fwd
## [1] 4563012
par(mfrow = c(1, 2)) # side-by-side plots
par(pty = "s") # square plots
Range <- range(c(Actual, USA_pred1, USA_fwdpred))
plot(Actual ~ USA_pred1, xlab = "predictions from full model", ylab = "Co2 per capita",
xlim = Range, ylim = Range)
abline(0, 1)
text(Range[1] + 1, Range[2] - 2, labels = paste("RMSEP: ", round(RMSEP.all, 2)),
adj = 0)
plot(Actual ~ USA_fwdpred, xlab = "predictions from forward stepwise model", ylab = "Co2 per Capita",
xlim = Range, ylim = Range)
text(Range[1] + 1, Range[2] - 2, labels = paste("RMSEP: ", round(RMSEP.fwd, 2)),
adj = 0)
abline(0, 1)
plot(Actual, Train.USA_data$population)