Background on Studies of Pharmaceutical Spending Per Capita and Its Relationship to GDP.
Datasets
(Potential) Statistics to be Used:
Variables Included in Dataset:
Background Research
Overarching Questions abou the Dataset
library(tidyverse) # Loads library.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("/Users/Owner/Desktop/Data Science Certificate/Statistics for Scientists/Final Project")
# Sets work directory.
ui <- read_csv("flat-ui_data.csv") # Imports and renames dataset.
## Rows: 1036 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): LOCATION, FLAG_CODES
## dbl (5): TIME, PC_HEALTHXP, PC_GDP, USD_CAP, TOTAL_SPEND
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gdp <- read_csv("gdp.csv")
## Rows: 11507 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Country Name, Country Code
## dbl (2): Year, Value
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(ui) # Returns a few rows at the head of the dataset.
head(gdp)
joined <- left_join(ui1, gdp, by = c("country_code", "year")) #joins the datasets "gdp" and "ui1" via the matching variables "country code" and "year".
head(joined)
All of the recorded annual GDP values in the “joined” dataset are visually summarized using the histogram below. The values show an extreme rightward skew. That is to say that the majority of recorded values are concentrated at the lower end of the range of the gdp values recorded.
hist(joined$value)
The recorded annual per-capita pharmaceutical spending values, like those for annual gdp, exhibit a strong rightward skew. The wide difference in scales between the x-axes of each graph, however, makes it appear as if the data in the second histogram exhibits a wider spread of values than those in the first. While the x-axis of the per-capita pharmaceutical spending plot ranges from $0.00 to $1,200.00, the values of the x-axis of the gdp box plot ranges from $0.00 to ~$2.5 trillion. If the pharmaceutical spending data was graphed on the same x-axis as the gdp data box plot, it would appear much more compressed than it does in it’s own histogram.
hist(joined$usd_cap)
All of the recorded annual gdp values in the “joined” dataset are visually summarized in the box plot below. The box was colored in light pink for contrast. The values are almost completely concentrated at the bottom edge of the y-axis- in this case, mostly under a 1/2-billion dollars in gdp. The outliers, however, range as high as ~1.5 trillion dollars.
Although the data is plotted in a different format, it appears to repeat and confirm the distribution shown in the histogram of the gdp values.
boxplot(joined$value, col = "lightpink")
Like the histogram of the values of per-capita pharmaceutical spending, the box plot of those same values appear to have a wider spread than the data in the annual gdp box plot. The scales on the y-axes of the two box plots, however, are significantly different: while the y-axis of the per-capita pharmaceutical spending plot ranges from $0.00 to $1,200.00, the values of the y-axis of the gdp box plot ranges from $0.00 to over $1.5 trillion. If the pharmaceutical spending data was graphed on the same y-axis as the gdp data box plot, it would appear much more compressed than it does in it’s own box plot.
Although the data is plotted in a different format, it appears to repeat and confirm the distribution shown in the histogram on the per capita pharmaceutical spending values.
boxplot(joined$usd_cap, col = "lightpink")
The changes in per capita pharmaceutical spending over the years for each country in the “joined” dataset are visualized in a combined line graph below. Though most of the lines curve up slowly and gradually from the x-axis, an extreme outlier is visible in the upper right-hand corner of the graph. Looking at the dataset alongside this graph, it appears that this outlier may represent the changes in per-capita pharmaceutical spending from the United States.
ggplot(data=joined, aes(x=joined$year,y=joined$usd_cap, group=joined$country_code)) +
geom_line() +
labs(x= "Year", y = "Pharmaceutical Spending per Capita, USD", title = "Change in Pharmaceutical Spending Per Capita Over Time")
linear_reg <- lm(data = joined, usd_cap ~ value)
summary(linear_reg)
##
## Call:
## lm(formula = usd_cap ~ value, data = joined)
##
## Residuals:
## Min 1Q Median 3Q Max
## -286.2 -165.3 -30.6 142.5 765.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.538e+02 6.089e+00 41.67 <2e-16 ***
## value 5.353e-11 2.933e-12 18.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 182 on 1034 degrees of freedom
## Multiple R-squared: 0.2437, Adjusted R-squared: 0.243
## F-statistic: 333.2 on 1 and 1034 DF, p-value: < 2.2e-16
The p-value of the variable at nearly zero means the probability that the recorded outcomes from this linear model are due to random chance is close to none. Conversely, the adjusted R2 value for the linear model, 0.243, means that only 24.3% of variations in the observations can be explained by it. These seemingly- contradictory conditions indicate that a more-effective linear model of the relationship may exist after the addition of more variables.
Parameters to Estimate:
Statistical Analysis
ggplot(joined, aes(x=value, y=usd_cap))+
geom_point(aes(color = year, alpha = 0.5))+
scale_color_gradient(low = "blue",high = "orange")+
geom_smooth(method = "lm")+
theme_bw()+
labs(x="GDP, USD",
y="Pharmaceutical Spending Per Capita, USD",
title = "Scatterplot of Gross Domestic Product to Average Annual Pharmaceutical Spending Per Capita (USD)")
## `geom_smooth()` using formula = 'y ~ x'
As predicted with the initial summary statistics, the linear model for the relationship with only one explanatory variable is not strong. The color gradients added to the data points to indicate the passage of the years, however, seem to suggest that in general per-capita pharmaceutical spending increased over time.