Our analysis comes from the blog Analytics for Fun, and can be found here. The post goes through some basic tables and charts analyzing breast cancer’s spread in 173 countries, based on new cases per 100,000 women, gathered from a 2002 World Health Organization study. The post concludes that there is significant variation in breast cancer among continents, with North America and Western Europe showing much higher median values than countries like Africa and Asia, but stops short of making a direct claim of a positive correlation between higher economic development in a country and inclination to breast cancer.
First, we import our data into the workspace.
load(file="breastcancer.RData")
Now that it’s all ready to go, let’s check out the range of cases:
range(breastcancer$cases)
## [1] 3.9 101.1
For more detail on the general spread of the data, a histogram would be helpful:
hist(breastcancer$cases, breaks=10, main="Breast Cancer Global Distribution", xlab = "New Cases Per 100,000 Women", col = "blue", border = "red")
In table form:
summary(breastcancer$cases)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.9 20.6 30.0 37.4 50.3 101.1
Let’s try another form of graph - the box plot:
boxplot(breastcancer$cases, col="blue", horizontal=TRUE, border="red", main="Global Breast Cancer Distribution", xlab="New Cases Per 100,000 Women")
Compared to the histogram, we can confirm that the data is skewed right, although the box plot makes it easier to see the one outlier which wasn’t as obvious in the histogram. Just to make sure we’ve got all our bases covered, let’s see a density plot as well:
(Note: our blog post used freq() to make a frequency plot of the data, but we were not able to find any documentation for this function so we approximated it with a density plot to get more or less the same result.)
d <- density(breastcancer$cases)
plot(d, main="New Breast Cancer Cases per 100,000 Women")
polygon(d, col="blue", border="red")
Finally, a scatterplot of all countries together:
sample <- mosaic::sample(breastcancer, size =15)
plot(breastcancer$country, breastcancer$cases, xaxt="n", xlab="Countries",ylab="New Breast Cancer Cases Per 100,000 Women")
text(x=sample$country,y=sample$cases,labels = (sample$country), cex =.75)
That’s enough for general plots. Let’s get a bit more specific and order the countries from least to most cases, then print out the bottom/top 10:
ordered<-breastcancer[order(breastcancer$cases),-4]
ordered[c(1:10,164:173),]
## country cases continent
## 110 Mozambique 3.9 AF
## 70 Haiti 4.4 NAM
## 60 Gambia 6.4 AF
## 108 Mongolia 6.6 AS
## 133 Rwanda 8.8 AF
## 37 Congo, Dem. Rep. 10.3 AF
## 100 Malawi 10.5 AF
## 90 Laos 10.9 AS
## 149 Swaziland 12.3 AF
## 172 Zambia 13.0 AF
## 114 Netherlands 86.7 WE
## 164 United Kingdom 87.2 WE
## 150 Sweden 87.8 WE
## 45 Denmark 88.7 WE
## 73 Iceland 90.0 WE
## 79 Israel 90.8 AS
## 57 France 91.9 WE
## 115 New Zealand 91.9 OC
## 15 Belgium 92.0 WE
## 165 United States 101.1 NAM
Now is a good time to shift from country analysis to continent, in order to see if there are any broader trends represented within each continent. The continent divisions are as follows:
AF - Africa
AS - Asia
EE - Eastern Europe
NAM - North America
OC - Oceania
SA - South America
WE - Western Europe
For some information on their respective distributions:
table(breastcancer$continent)
##
## AF AS EE NAM OC SA WE
## 0 51 45 20 18 9 12 18
Let’s try plotting breast cancer by continent to get a better idea of global spread by geographic area:
boxplot(breastcancer$cases ~ breastcancer$continent, main="Breast Cancer by Continent", xlab="Continents", ylab="New Cases Per 100,000 Women", col=breastcancer$continent)
We can see that there is indeed significant variation between the continents regarding new cases of breast cancer in women. Our boxplot looks different than the one in our blog post because of different classifications regarding the continents: for example, our blog post used “Latin America” as a broad group including Central America and South America, rather than the typical North and South America division that we utilized. This significantly skewed the North America plot with its inclusion of the United States, the country with the highest number of case by a large margin. Some uncertainty may have also come up in classifying certain continents which lie in both Europe and Asia, such as Kazakhstan, Azerbaijan, and Turkey, as well as the somewhat arbitrary division of Eastern and Western Europe that is ultimately at the data collector’s discretion. Regardless, we are happy with how our plots and data set turned out and feel that it provides a good base for us to do more testing off of.
Although our blog post stopped here, we have devised some new figures to gain further insight into the data set and analysis provided by the blog post. Instead of the vanilla R graphics used in the blog post, we will be utilizing the ggplot2 package for its superior customization options.
library("ggplot2")
First, we wanted to create more continent-specific charts: while our blog post did include the continent divisions as a variable, it hardly used them besides in one boxplot at the end. We thought it would be interesting and helpful to see unique plots for each continent. This allows us to get a better sense of the individual values for each country, especially since the world dot plot created in the blog post could only name a few points at risk of too much overlap between all the country names. (Make sure to pay close attention to the varying y-axis scale of each graph: these figures are useful for comparing countries within the context of their continent, but not for overall analysis; each continent has different average values and the plots often have a smaller number cases in comparison to the overall plot.)
AS <- subset(breastcancer, continent == "AS", select=c(country, cases, alcohol))
AF <- subset(breastcancer, continent == "AF", select=c(country, cases, alcohol))
EE <- subset(breastcancer, continent == "EE", select=c(country, cases, alcohol))
NAM <- subset(breastcancer, continent == "NAM", select=c(country, cases, alcohol))
OC <- subset(breastcancer, continent == "OC", select=c(country, cases, alcohol))
SA <- subset(breastcancer, continent == "SA", select=c(country, cases, alcohol))
WE <- subset(breastcancer, continent == "WE", select=c(country, cases, alcohol))
ggplot(AS, aes(country,cases, label = country)) + geom_text(check_overlap = TRUE, size = 2.5, nudge_x = 1.1, color = "blue") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), plot.background = element_rect(fill = "dodgerblue3"), panel.background = element_rect(fill = "gainsboro"), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + labs(title = "Asia",x="Country", y="New Breast Cancer Cases per 100,000 Females")
ggplot(AF, aes(country,cases, label = country)) + geom_text(check_overlap = TRUE, size = 2.5, nudge_x= .5, color = "darkslategray") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), plot.background = element_rect(fill = "tomato1"), panel.background = element_rect(fill = "gainsboro"), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + labs(title = "Africa",x="Country", y="New Breast Cancer Cases per 100,000 Females")
ggplot(NAM, aes(country,cases, label = country)) + geom_text(check_overlap = TRUE, size = 2.5, nudge_x= -.08, color = "black") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), plot.background = element_rect(fill = "mediumspringgreen"), panel.background = element_rect(fill = "gainsboro"), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + labs(title = "North America",x="Country", y="New Breast Cancer Cases per 100,000 Females")
ggplot(SA, aes(country,cases, label = country)) + geom_text(check_overlap = TRUE, size = 4, color = "grey19") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), plot.background = element_rect(fill = "mediumorchid3"), panel.background = element_rect(fill = "gainsboro"), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + labs(title = "South America",x="Country", y="New Breast Cancer Cases per 100,000 Females")
ggplot(OC, aes(country,cases, label = country)) + geom_text(check_overlap = TRUE, size = 5, color = "gray15") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), plot.background = element_rect(fill = "lightpink2"), panel.background = element_rect(fill = "gainsboro"), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + labs(title = "Oceania",x="Country", y="New Breast Cancer Cases per 100,000 Females")
ggplot(EE, aes(country,cases, label = country)) + geom_text(check_overlap = TRUE, size = 3.5, color = "firebrick") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), plot.background = element_rect(fill = "tan2"), panel.background = element_rect(fill = "gainsboro"), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + labs(title = "Eastern Europe",x="Country", y="New Breast Cancer Cases per 100,000 Females")
ggplot(WE, aes(country,cases, label = country)) + geom_text(check_overlap = TRUE, size = 2.5, color = "grey3", nudge_x = -.25) + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), plot.background = element_rect(fill = "palegreen"), panel.background = element_rect(fill = "gainsboro"), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + labs(title = "Western Europe",x="Country", y="New Breast Cancer Cases per 100,000 Females")
Something we found out after doing some supplementary research is that increased alcohol consumption is believed to increase the risk of developing breast cancer. To help test this theory, we added this data from a 2010 World Health Organization study on pure alcohol consumption in liters per capita for each country on the list. Unfortunately, the data is from a different year than our blog post’s (2002 vs. 2010), but this was the closest time period we could find for alcohol consumption data, and for our purposes should work fine. First, let’s check out an overall graph of new breast cancer cases and alcohol consumption for every country:
ggplot(data = breastcancer,aes(x=alcohol, y=cases)) + geom_point() + geom_smooth(method=lm) + labs(title = "Breast Cancer vs. Alcohol Consumption: Global Distribution", x = "Pure Alcohol Consumption in Liters Per Capita", y = "New Cases of Breast Cancer per 100,000 Women") + theme(plot.background = element_rect(fill = "blueviolet"), panel.background = element_rect(fill = "azure3"))
Judging by this plot, it does seem that there is a positive correlation between higher alcohol consumption in a country and development of breast cancer in females. However, we should do some more analysis to confirm. In particular, we will focus on the two continents with the respective lowest and highest breast cancer rates, Africa and Western Europe, in order to hone in further on this relationship. Let’s do Western Europe first.
ggplot(WE, aes(alcohol,cases, label = country)) + geom_text(check_overlap = TRUE, size = 3, color = "grey3", nudge_x = -.25) + theme(plot.background = element_rect(fill = "palegreen"), panel.background = element_rect(fill = "lightcyan3"), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + labs(title = "Western Europe New Cases vs. Alcohol Consumption",x="Pure Alcohol Consumption in Liters Per Capita", y="New Breast Cancer Cases per 100,000 Females") + geom_smooth(method=lm, se=F, size=3)
Interestingly enough, the regression line seems to indicate that alcohol consumption in fact decreases, rather than increases, the chance for breast cancer in Western Europe, even if just a small amount. Does this contradict the previous plot? Let’s try this graph again, only without the outlier of Portugal, and see what effect that has.
NewWE <- subset(breastcancer, continent == "WE" & country != "Portugal", select=c(country, cases, alcohol))
ggplot(NewWE, aes(alcohol,cases, label = country)) + geom_text(check_overlap = TRUE, size = 3, color = "black", nudge_x = -.25) + theme(plot.background = element_rect(fill = "darkgoldenrod2"), panel.background = element_rect(fill = "lightcyan3"), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + labs(title = "Western Europe (Minus Portugal) New Cases vs. Alcohol Consumption",x="Pure Alcohol Consumption in Liters Per Capita", y="New Breast Cancer Cases per 100,000 Females") + geom_smooth(method=lm, se=F, size=3)
That looks more like we were expecting! Now for the plot of Africa:
ggplot(AF, aes(alcohol,cases, label = country)) + geom_text(check_overlap = TRUE, size = 3, color = "orangered4", nudge_x = -.25) + theme(plot.background = element_rect(fill = "navajowhite4"), panel.background = element_rect(fill = "lightcyan3"), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + labs(title = "Africa New Cases vs. Alcohol Consumption",x="Pure Alcohol Consumption in Liters Per Capita", y="New Breast Cancer Cases per 100,000 Females") + geom_smooth(method=lm, se=F, size=1.5)
This plot seems to confirm the original hypothesis as well. In lieu of graphing out all the rest of the continents in a similar manner, we will instead present the original global distribution of new breast cancer cases vs. alcohol consumption again, this time factoring in the individual continents.
ggplot(data = breastcancer,aes(x=alcohol, y=cases, color=factor(continent))) + geom_point() + geom_smooth(method=lm, se=F, size=1.5) + theme(plot.background = element_rect(fill = "darkorchid1"), panel.background = element_rect(fill = "honeydew3")) + labs(title = "Global Distribution of Breast Cancer vs. Alcohol Consumption (By Continent)",x="Pure Alcohol Consumption in Liters Per Capita", y="New Breast Cancer Cases per 100,000 Females")
While Eastern Europe seems to be the exception, in general the regression lines for each continent trend towards the positive (especially if we also include the outlier adjustment for Western Europe.) After some additional analysis, we think it is safe to say that based on this data, alcohol consumption has a positive correlation with global development of breast cancer in women.
Expanding on the blog post’s original findings, we wanted to see if there was another factor that could be influencing global breast cancer rates besides higher economic development, the claim that was put forth by our blog post. After all, shouldn’t more economically developed countries have better means of preventing and treating disease, implying a decrease, rather than the increase shown by the data? When factoring in alcohol consumption, however, a factor that is known to have a link with increased breast cancer rates, we noticed a positive correlation that seemingly makes more sense. More economically developed countries will have more leisure money available, therefore having more money to spend on things like alcohol.
All in all, we had a fun and informative experience writing this lab. We became much more comfortable plotting our own graphs in R, especially using the incredibly powerful ggplot2 package, and gained experience in some of the basic steps of data analysis: import, visualize, and communicate. The exposure to RMarkdown was no doubt invaluable, as it is a great tool for creating a deliverable product beyond all the analysis in R that is required to make it. We also gained some useful, if tedious, experience creating our own datasets to be used in R, having to personally enter in the values of new breast cancer cases and per capita alcohol consumption for each country. If we had to do something differently next time, we would probably choose a different blog post, as on further inspection ours was riddled with certain inconsistencies in code that we could not exactly replicate, as well as showing a lack of refine with simple things like spelling mistakes. Despite that, we’re pleased with the direction our analysis went after we recreated the blog post, and ended up making some interesting visualizations as a result.