The dataset contains historic average Alaskan air temperatures across seasons in different regions of Alaska. In the dataset, there are temperatures of thirteen Alaskan regions, collected from 1900 to 2015, four times a year, one for each season. The datatset is composed by the following variables:
The following figure gives an idea of the general situation in terms of temperature for the Alaskan territory:
Alaskan Air Temperature.
These data were derived from the “Historical Monthly Temperature - 1 km CRU TS” dataset, available at http://ckan.snap.uaf.edu/dataset/historical-monthly-temperature-1-km-cru-ts.
With the command “install.packages(”dataone“)”, we can install the package for the datasets available on the platform dataone. The dataone R package enables R scripts to search, download and upload science data and metadata from/to the DataONE Federation. The website describes DataOne as “a community driven project providing access to data across multiple member repositories, supporting enhanced search and discovery of Earth and environmental data”. Thus, we set the library both for this package and for others.
library(dataone)
library(ggplot2)
library(tidyverse)
library(rmarkdown)
library(memisc)
In our analysis, we are going to concentrate on the time series evolution of air temperatures for the region of Yukon. Yukon is one of the largest Alaskan regions and its territory is divided between Alaska and Canada.
Map of Yukon Region.
Our varaibles of interest are the Region, the Year, the Month, the Season and the air temperature. More details about the dataset and the variables can be found on the web page http://ckan.snap.uaf.edu/dataset/historical-monthly-temperature-1-km-cru-ts.
First of all, we use the following R script to search the ID of our data-package of interest. To find it in an easier way, we search the keywords “Alaskan air temperatures” in the abstract. Then R gives us a dataframe with a list of all dataone datasets that contain the specified keywords in the abstract.
d1c <- D1Client("PROD", "urn:node:KNB")
# Ask for the id, title and abstract
queryParams <- list(q="abstract:\"Alaskan air temperatures\"", fq="id:doi*",
fq="formatType:METADATA", fl="id,title")
result <- query(d1c@mn, solrQuery=queryParams, as="data.frame", parse=FALSE)
pid <- result[1,'id']
dataObj <- getDataObject(d1c, pid)
bytes <- getData(dataObj)
metadataXML <- rawToChar(bytes)
paged_table(result)
The dataone data-packages usually contain more than one dataset, thus, using the right ID, we can dowload a zip, which contains all the datasets, in cvs format, related to our data-package of interest; the following R commands displays us the direction of folder in which R has downloaded the datasets:
cn <- CNode()
mn <- getMNode(cn, "urn:node:KNB")
bagitFileName <- getPackage(mn, id="doi:10.5063/F1MK6B60")
bagitFileName
## [1] "C:\\Users\\loren\\AppData\\Local\\Temp\\Rtmp6XkNsj\\0972e0b8-0608-11ea-b4d2-d3b44de60042-5788733826e3.zip"
Then, using the “UnZip” command provided by the memisc package, we can unzip the downloaded file.
UnZip("0a55d2ae-fbce-11e9-882c-2dc8b3ab35e1-2ed055d159c4.zip","air_temp_seasonal_long.csv")
Finally, setting correctly the directory in the unzipped folder, we can download the selected csv:
ATSLfile <- "air_temp_seasonal_long.csv"
ATSL <- read.csv(ATSLfile, sep=";", skip=2, dec=",", stringsAsFactors=FALSE)
We can rename the variable, using the following commands:
names(ATSL)[names(ATSL) == "X"] <- "Region"
names(ATSL)[names(ATSL) == "X.1"] <- "RegionID"
names(ATSL)[names(ATSL) == "X.2"] <- "Temperature"
names(ATSL)[names(ATSL) == "X.3"] <- "Year"
names(ATSL)[names(ATSL) == "X.4"] <- "Month"
names(ATSL)[names(ATSL) == "X.5"] <- "Season"
paged_table(ATSL)
The following R script is used to create the mean of temperatures at the level of region. As we can see from the table, Yukon, our region of interest, is the third coldest region in Alska with an average temperature of -4.54 celsius degrees. The coldest region in Alaska is the Arctic, while the hottest is Kodiak.
MeanRegion <- aggregate(x = ATSL$Temperature, by = list(ATSL$Region), FUN = mean)
names(MeanRegion)[names(MeanRegion) == "Group.1"] <- "Region"
names(MeanRegion)[names(MeanRegion) == "x"] <- "MeanTemperature"
paged_table(MeanRegion)
The following bar-chart represents the average temperature of Alaskan regions from a graphic perspective
ggplot(MeanRegion, aes(Region, MeanTemperature, fill = Region)) +
geom_bar(stat="identity") + theme_minimal() + xlab("Region") + ylab("Temperature") + ggtitle("Region Average Air Temperature") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())
With the following command, we create a subset of the initial dataset; more precisely, we are generating the subset for the time series analysis of the Yukon region, that is the one on which we are interesting in.
Yukon <- ATSL[which(ATSL$Region == "Yukon"),names(ATSL) %in% c("Region","RegionID","Temperature","Year","Month","Season")]
paged_table(Yukon)
With the following command we can show the general informations about the structure of the Yukon data-frame:
str(Yukon)
## 'data.frame': 461 obs. of 6 variables:
## $ Region : chr "Yukon" "Yukon" "Yukon" "Yukon" ...
## $ RegionID : int 13 13 13 13 13 13 13 13 13 13 ...
## $ Temperature: num -23.44 -4.99 11.14 -4.95 -18.56 ...
## $ Year : int 1900 1901 1901 1901 1901 1902 1902 1902 1902 1903 ...
## $ Month : int 12 3 6 9 12 3 6 9 12 3 ...
## $ Season : chr "winter" "spring" "summer" "fall" ...
Now we can create a subset of the Yukon data-frame, in particular we display only the season and the temperature for each observation:
YukonTempSeas <- Yukon[,c("Season", "Temperature")]
paged_table(YukonTempSeas)
Here we display another subset for the observations with negative temperatures:
YukonFilter <- filter(Yukon, Temperature < 0)
paged_table(YukonFilter)
We can show the observations for ascendent order of temperatures:
YukonOrderTemperature <- Yukon[order(Yukon$Temperature),]
paged_table(YukonOrderTemperature)
Alternatively, we can display observations for descendent order of temperatures:
YukonUnOrderTemperature <- Yukon[order(- Yukon$Temperature),]
paged_table(YukonUnOrderTemperature)
Here we compute the mean of the temperature for the region of Yukon:
MeanTemperature <- mean(Yukon$Temperature)
print(MeanTemperature)
## [1] -4.539925
We are also interested about the minimum and the maximum level of temperature reached in Yukon between 1900 and 2015:
MinTemperature <- min(Yukon$Temperature)
print(MinTemperature)
## [1] -25.95014
MaxTemperature <- max(Yukon$Temperature)
print(MaxTemperature)
## [1] 14.19767
Thus, we compute the variance and the standard deviation of the temperature in Yukon:
VarTemperature <- var(Yukon$Temperature)
print(VarTemperature)
## [1] 130.3119
SdTemperature <- sd(Yukon$Temperature)
print(SdTemperature)
## [1] 11.41542
Finally, using the mean and the standard deviation of temperatures, we can also compute manually, the confidence intervals:
n <- length(Yukon$Temperature)
error <- qnorm(0.975)* SdTemperature /sqrt(n)
Left <- MeanTemperature - error
Right <- MeanTemperature + error
print(Left)
## [1] -5.581977
print(Right)
## [1] -3.497872
We display the graph of the overall set of temperatures for Yukon, from 1900 to 2015:
ggplot(Yukon, aes(Yukon$Year, Yukon$Temperature, col='red')) + geom_point(aes(Yukon$Year, Yukon$Temperature), col = "red")+ xlab("Year") + ylab("Temperature") + ggtitle("Yukon Air Temperature")
Now we concentrate on the seasonal variation, computing the graph for each specific season and analysing the related time series graphs. As we can denote, for each season we have an increase in the level of temperatures from 1900 to 2015:
YukonW <- Yukon[which(Yukon$Season == "winter"),names(Yukon) %in% c("Year","Month","Season","Temperature")]
ggplot(YukonW, aes(YukonW$Year, YukonW$Temperature, col='blue'))+ xlab("Year") + ylab("Temperature") + ggtitle("Winter Yukon Air Temperature") + stat_smooth(colour='blue', span=0.2)
YukonSpr <- Yukon[which(Yukon$Season == "spring"),names(Yukon) %in% c("Year","Month","Season","Temperature")]
ggplot(YukonSpr, aes(YukonSpr$Year, YukonSpr$Temperature, col='blue'))+ xlab("Year") + ylab("Temperature") + ggtitle("Spring Yukon Air Temperature") + stat_smooth(colour='green', span=0.2)
YukonS <- Yukon[which(Yukon$Season == "summer"),names(Yukon) %in% c("Year","Month","Season","Temperature")]
ggplot(YukonS, aes(YukonS$Year, YukonS$Temperature, col='blue'))+ xlab("Year") + ylab("Temperature") + ggtitle("Summer Yukon Air Temperature") + stat_smooth(colour='red', span=0.2)
YukonA <- Yukon[which(Yukon$Season == "fall"),names(Yukon) %in% c("Year","Month","Season","Temperature")]
ggplot(YukonA, aes(YukonA$Year, YukonA$Temperature, col='blue'))+ xlab("Year") + ylab("Temperature") + ggtitle("Autumn Yukon Air Temperature") + stat_smooth(colour='orange', span=0.2)
The native data format of the dataset is “long”, in fact each line is composed by a set of observations, which inlclude the region, the year, the month and the season.
With the following procedure we can transform our tall dataset in a spread version; Here we have selected a subset of our dataset, concentrating on observations for the period between 1910 and 1920:
Yukon1920 <- Yukon[which(Yukon$Year < 1920 & Yukon$Year > 1910), names(Yukon) %in% c("Year","Season","Temperature")]
paged_table(Yukon1920)
Yukon1920Wide <- spread(Yukon1920, key = Season, value = Temperature)
paged_table(Yukon1920Wide)
Our dataset describes Helath spending composition of a set of OECD countries. Health spending measures the final consumption of health care goods and services, including personal health care (curative care, rehabilitative care, long-term care, ancillary services and medical goods) and collective services (prevention and public health services as well as health administration), but excluding spending on investments. Health care is financed through a mix of financing arrangements including:
Our dataset contains data on:
More information can be found on the OECD we site: https://data.oecd.org/healthres/health-spending.htm
With the command “install.packages(”OECD“)”, we can install the package for the datasets available on the platform OECD. The OECD R package enables R scripts to search, download and upload data and metadata from/to the OECD (Organisation for Economic Co-operation and Development). Thus, we set the library both for this package and for others.
library(OECD)
library(ggplot2)
library(tidyverse)
library(rmarkdown)
library(dplyr)
library(corrplot)
Our analysis is related to the differences in composition of health spending among OECD countries. We will understand which countries depend more on compulsory rather than voluntary expenditure and viceversa. The OECD countries included in our analysis are highlighted in the following map:
OECD Countries.
The variables of interest are the country, the expenditure type (voluntary, compulsory or out of pocket) and the value of health spending. More detailes can be found on https://www.oecd-ilibrary.org/social-issues-migration-health/health-at-a-glance_19991312
First of all, we search fot the ID of our dataset of interest, using the following command; R gives us a list of all datasets related to the theme “Health”
search_dataset("Health", data = get_datasets(), ignore.case = TRUE)
## # A tibble: 11 x 2
## id title
## <fct> <fct>
## 1 HEALTH_STAT Health Status
## 2 HEALTH_REAC Health Care Resources
## 3 HEALTH_PROC Health Care Utilisation
## 4 HEALTH_HCQI Health Care Quality Indicators
## 5 HEALTH_LVNG Non-Medical Determinants of Health
## 6 SHA Health expenditure and financing
## 7 EBDAG Expenditure by disease, age and gender under the System of ~
## 8 HEALTH_WFMI Health Workforce Migration
## 9 SHA_FP Input costs for health care provision
## 10 SHA_FS Revenues of health care financing schemes
## 11 SHA_HK Gross fixed capital formation in the health care system
Using the right ID and adding the filters that are needed to obtain our data of interest, R directly dowload the dataframe in its environment; then, we rename variables and the names of “ExpenditureType”. The filter can be found on the OECD Statistics web-sites (https://stats.oecd.org/):
dataOrigin <- get_dataset("SHA", filter = "HF1+HF2+HF3.HCTOT.HPTOT.PPPPER.AUS+AUT+BEL+CAN+CHL+CZE+DNK+EST+FIN+FRA+DEU+GRC+HUN+ISL+IRL+ISR+ITA+KOR+LVA+LTU+LUX+MEX+NLD+NZL+NOR+POL+PRT+SVK+SVN+ESP+SWE+CHE+TUR+GBR+USA+NMEC+BRA+CHN+COL+CRI+IND+IDN+RUS+ZAF", start_time = 2018, end_time = 2018)
OECD_Health_Exp <- select(dataOrigin, -HC, -HP, -MEASURE, -TIME_FORMAT, -POWERCODE, -OBS_STATUS)
names(OECD_Health_Exp)[names(OECD_Health_Exp) == "HF"] <- "ExpenditureType"
names(OECD_Health_Exp)[names(OECD_Health_Exp) == "LOCATION"] <- "Country"
names(OECD_Health_Exp)[names(OECD_Health_Exp) == "obsTime"] <- "Year"
names(OECD_Health_Exp)[names(OECD_Health_Exp) == "obsValue"] <- "Value"
OECD_Health_Exp$ExpenditureType[OECD_Health_Exp$ExpenditureType == "HF2"] <- "Voluntary"
OECD_Health_Exp$ExpenditureType[OECD_Health_Exp$ExpenditureType == "HF1"] <- "Compulsory"
OECD_Health_Exp$ExpenditureType[OECD_Health_Exp$ExpenditureType == "HF3"] <- "Out-Of-Pocket"
paged_table(OECD_Health_Exp)
With the following command we can show the general informations about the structure of the Yukon data-frame:
str(OECD_Health_Exp)
## 'data.frame': 68 obs. of 5 variables:
## $ ExpenditureType: chr "Compulsory" "Compulsory" "Voluntary" "Voluntary" ...
## $ Country : chr "GRC" "NOR" "NOR" "CAN" ...
## $ Year : int 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
## $ UNIT : chr "USD" "USD" "USD" "USD" ...
## $ Value : num 1348.8 5288.8 21.4 759.1 749.3 ...
Now we can create a subset, limiting our analysis on country and value of health expenditures:
Country_Value <- select(OECD_Health_Exp, Country, Value)
paged_table(Country_Value)
We display the entire dataframe without the year:
No_Year <- select(OECD_Health_Exp, -Year)
paged_table(No_Year)
We order the variables for descendent order of value of health spending:
Order_Value <- OECD_Health_Exp[order(OECD_Health_Exp$Value , decreasing = TRUE),]
paged_table(Order_Value)
Here we plot a pie-chart, representing the different percentages of the composition of health expenditure for a subset of countries, namely Switzerland, Italy and Korea:
Canada <- OECD_Health_Exp[which(OECD_Health_Exp$Country == "CAN"), names(OECD_Health_Exp) %in% c("ExpenditureType", "Value")]
paged_table(Canada)
CanadaPercentage <- Canada$Value / sum(Canada$Value)*100
expenditureType <- c("Voluntary", "Out-of-Pocket", "Compulsory")
CanadaPerc <- data.frame(ExpenditureType=expenditureType, Value=CanadaPercentage)
paged_table(CanadaPerc)
ggplot(CanadaPerc, aes(x = "", y = Value, fill = ExpenditureType)) + coord_polar("y", start=0) + geom_bar(width = 1, stat = "identity", color = "white") + theme_minimal() + theme_void() + ggtitle("Health Spending Composition in Canada")
Italy <- OECD_Health_Exp[which(OECD_Health_Exp$Country == "ITA"), names(OECD_Health_Exp) %in% c("ExpenditureType", "Value")]
paged_table(Italy)
ItalyPercentage <- Italy$Value / sum(Italy$Value)*100
ItaPerc <- data.frame(ExpenditureType=expenditureType, Value=ItalyPercentage)
paged_table(ItaPerc)
ggplot(ItaPerc, aes(x = "", y = Value, fill = ExpenditureType)) + coord_polar("y", start=0) + geom_bar(width = 1, stat = "identity", color = "white") + theme_minimal() + theme_void() + ggtitle("Health Spending Composition in Italy")
Germany <- OECD_Health_Exp[which(OECD_Health_Exp$Country == "DEU"), names(OECD_Health_Exp) %in% c("ExpenditureType", "Value")]
paged_table(Germany)
GermanyPercentage <- Germany$Value / sum(Germany$Value)*100
expenditureType2 <- c("Voluntary", "Compulsory","Out-of-Pocket" )
GermanyPerc <- data.frame(ExpenditureType=expenditureType2, Value=GermanyPercentage)
paged_table(GermanyPerc)
ggplot(GermanyPerc, aes(x = "", y = Value, fill = ExpenditureType)) + coord_polar("y", start=0) + geom_bar(width = 1, stat = "identity", color = "white") + theme_minimal() + theme_void() + ggtitle("Health Spending Composition in Germany")
Finally, we create a histogram for the health spending composition of a subset of countries: Korea, Italy, Germany, Japan, Canada, Netherlands, New Zeland and Sweden:
OECD_Subset <- OECD_Health_Exp[which(OECD_Health_Exp$Country == "ITA" | OECD_Health_Exp$Country == "KOR" | OECD_Health_Exp$Country == "JPN" | OECD_Health_Exp$Country == "CAN" | OECD_Health_Exp$Country == "NLD" | OECD_Health_Exp$Country == "NZL" | OECD_Health_Exp$Country == "DEU" | OECD_Health_Exp$Country == "SWE"), names(OECD_Health_Exp) %in% c("Country", "ExpenditureType", "Value")]
paged_table(OECD_Subset)
ggplot(OECD_Subset,aes(x=Country, y=Value, fill=ExpenditureType)) + geom_bar(stat="identity", position=position_dodge()) + ggtitle("Health Spending Composition in OECD Countries")
The native data format of the dataset is “long”, in fact each line is composed by a set of observations, which inlclude the country, the year, the type of expenditure (volntary or compulsory) and the respective value.
With the following procedure we can transform our tall dataset in a spread version:
OECDHEWide <- spread(OECD_Health_Exp, key = ExpenditureType, value = Value)
paged_table(OECDHEWide)
To compare it with the original version, we display also the tall one:
paged_table(OECD_Health_Exp)
We show the statistics at this stage, since we use the spread version of the dataset for their computation.
We create a version of the dataset without missing values, avoiding errors of computation:
OECDWide_NoNA <- na.omit(OECDHEWide)
We compute the covariance between Voluntary and Compulsory expenditure:
Covariance <- cov(OECDWide_NoNA$Voluntary, OECDWide_NoNA$Compulsory)
print(Covariance)
## [1] 35680.47
Here we have the correlation for the different types of health expenditures:
OECDCorr <- select(OECDWide_NoNA, - Year, - Country, -UNIT)
OECDWide_cor = cor(OECDCorr, method = c("spearman"))
print(OECDWide_cor)
## Compulsory Out-Of-Pocket Voluntary
## Compulsory 1.0000000 0.3176471 0.1970588
## Out-Of-Pocket 0.3176471 1.0000000 -0.2941176
## Voluntary 0.1970588 -0.2941176 1.0000000
corrplot(OECDWide_cor)
Finally, we calculate the mean for the compulsory health expenditures for our set of OECD countries:
OECDMean <- mean(OECDWide_NoNA$Compulsory)
print(OECDMean)
## [1] 2807.921