Dengue fever is a mosquito-borne illness typically found in tropical and subtropical regions. The dengue fever data set contains humidity, temperature, and tree cover data for 2000 administrative regions, as well as whether or not that region had dengue fever cases between the years 1961 and 1990. For this project, I will explore the geographic prevalence of dengue fever.
As a note, the maps would not publish to html. To view them, run the code in RStudio and uncomment both of the mapview functions.
#Imports
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(sf)
## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
library(mapview)
#Data
data <- read.table("C:/Users/Carlisle Ferguson/Downloads/dengue.csv", header=TRUE, sep=',')
summary(data)
## X humid humid90 temp
## Min. : 1.0 Min. : 0.6714 Min. : 1.066 Min. :-18.68
## 1st Qu.: 500.8 1st Qu.:10.0088 1st Qu.:10.307 1st Qu.: 11.10
## Median :1000.5 Median :16.1433 Median :16.870 Median : 20.99
## Mean :1000.5 Mean :16.7013 Mean :17.244 Mean : 18.41
## 3rd Qu.:1500.2 3rd Qu.:23.6184 3rd Qu.:24.131 3rd Qu.: 25.47
## Max. :2000.0 Max. :30.2665 Max. :30.539 Max. : 29.45
## NA's :2 NA's :2 NA's :2
## temp90 h10pix h10pix90 trees
## Min. :-10.07 Min. : 4.317 Min. : 5.848 Min. : 0.0
## 1st Qu.: 12.76 1st Qu.:14.584 1st Qu.:14.918 1st Qu.: 1.0
## Median : 22.03 Median :23.115 Median :24.130 Median :15.0
## Mean : 19.41 Mean :21.199 Mean :21.557 Mean :22.7
## 3rd Qu.: 25.98 3rd Qu.:28.509 3rd Qu.:28.627 3rd Qu.:37.0
## Max. : 29.66 Max. :31.134 Max. :31.134 Max. :85.0
## NA's :2 NA's :12
## trees90 NoYes Xmin Xmax
## Min. : 0.00 Min. :0.0000 Min. :-179.50 Min. :-172.00
## 1st Qu.: 6.00 1st Qu.:0.0000 1st Qu.: -12.00 1st Qu.: -10.00
## Median :30.60 Median :0.0000 Median : 16.00 Median : 17.75
## Mean :35.21 Mean :0.4155 Mean : 13.31 Mean : 15.63
## 3rd Qu.:63.62 3rd Qu.:1.0000 3rd Qu.: 42.62 3rd Qu.: 44.50
## Max. :97.10 Max. :1.0000 Max. : 178.00 Max. : 180.00
## NA's :12
## Ymin Ymax
## Min. :-54.50 Min. :-55.50
## 1st Qu.: 6.00 1st Qu.: 5.00
## Median : 18.00 Median : 17.00
## Mean : 19.78 Mean : 18.16
## 3rd Qu.: 39.00 3rd Qu.: 37.00
## Max. : 82.50 Max. : 68.50
##
#Plots
ggplot(data,aes(x=temp, y=humid, color=NoYes)) + geom_point() + ggtitle("Temperature vs Humidity")
## Warning: Removed 2 rows containing missing values (geom_point).
pltdata <- subset(data, select=c(temp,humid, NoYes))
pltdata$NoYes <- as.factor(pltdata$NoYes)
ggplot(pltdata,aes(x=NoYes, y=humid, fill=NoYes)) + geom_boxplot() + ggtitle("Humidity Box Plot")
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
ggplot(pltdata,aes(x=NoYes, y=temp, fill=NoYes)) + geom_boxplot() + ggtitle("Temperature Box Plot")
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
histdata <- subset(data, select=c(Ymax, Ymin, NoYes))
histdata$NoYes <- as.character(histdata$NoYes)
ggplot(histdata, aes(x=Ymax, fill=NoYes, color=NoYes)) + geom_histogram() + ggtitle("Maximum Latitude Histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(histdata, aes(x=Ymin, fill=NoYes, color=NoYes)) + geom_histogram() + ggtitle("Minimum Latitude Histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The scatter plot shows that the majority of the “Yes” cases took place in a hot, humid environment. The box plots also demonstrate this, as the average temperature and humidity values are higher in “Yes” regions, and the histograms show that the minimum and maximum latitudes are around the equator. However, where are these “Yes” regions? This can be difficult to visualize without a map. While the data set has longitude and latitude data, it lacks the names of cities and regions. It also only provides the minimum and maximum longitude and latitude. To visualize the data set, I created new columns for the average longitude and latitude for each row. Then, I used the mapview library to create a map with all the administrative regions in the data set.
#Mapping
df <- data.frame(data)
long_lat <- subset(data, select = c(Xmin, Xmax, Ymin, Ymax, NoYes))
long_lat$mean_long <- rowMeans(long_lat[,c('Xmin','Xmax')], na.rm=TRUE)
long_lat$mean_lat <- rowMeans(long_lat[,c('Ymin','Ymax')], na.rm=TRUE)
locations <- st_as_sf(long_lat, coords = c("mean_long", "mean_lat"), crs = 4326)
#mapview(locations, zcol="NoYes")
To examine the prevalence of dengue fever closer, I filtered the data set and remapped only the regions that had dengue fever cases.
#Mapping - Dengue Only
dengue_yes <- subset(long_lat, NoYes == 1, select = c(Xmin, Xmax, Ymin, Ymax, mean_long, mean_lat, NoYes))
locs_yes <- st_as_sf(dengue_yes, coords = c("mean_long", "mean_lat"), crs = 4326)
#mapview(locs_yes)
Some final overall stats for the “Yes” data:
#Final Stats
yes <-subset(data, NoYes == 1, select = c(temp, humid, Ymin, Ymax))
summary(yes)
## temp humid Ymin Ymax
## Min. : 0.5083 Min. : 0.8962 Min. :-27.000 Min. :-29.000
## 1st Qu.:23.0083 1st Qu.:20.2602 1st Qu.: 2.000 1st Qu.: -0.250
## Median :25.4083 Median :24.0117 Median : 10.000 Median : 8.500
## Mean :24.3447 Mean :23.1505 Mean : 7.539 Mean : 6.178
## 3rd Qu.:26.5917 3rd Qu.:26.7706 3rd Qu.: 15.000 3rd Qu.: 14.500
## Max. :29.3583 Max. :29.7924 Max. : 36.500 Max. : 34.500
## NA's :2 NA's :2
The plots and maps show that dengue fever is most commonly found in hot, humid environment between -27 and 36.5 degrees latitude. The dengue fever belt extends around the globe, including North America, South America, Africa, Asia, and Australia/Oceania.
#Bonus
library(readr)
urlfile = "https://raw.githubusercontent.com/carlisleferguson/RBridgeFinalProject/main/dengue.csv"
github_data <- read_csv(url(urlfile))
## Warning: Missing column names filled in: 'X1' [1]
##
## -- Column specification --------------------------------------------------------
## cols(
## X1 = col_double(),
## humid = col_double(),
## humid90 = col_double(),
## temp = col_double(),
## temp90 = col_double(),
## h10pix = col_double(),
## h10pix90 = col_double(),
## trees = col_double(),
## trees90 = col_double(),
## NoYes = col_double(),
## Xmin = col_double(),
## Xmax = col_double(),
## Ymin = col_double(),
## Ymax = col_double()
## )
head(github_data)
## # A tibble: 6 x 14
## X1 humid humid90 temp temp90 h10pix h10pix90 trees trees90 NoYes Xmin
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.671 4.42 2.04 8.47 17.4 17.8 0 1.5 0 70.5
## 2 2 7.65 8.17 12.3 14.9 11.0 11.7 0 1 0 62.5
## 3 3 6.98 9.56 6.93 14.6 17.5 17.6 0 1.2 0 68.5
## 4 4 1.11 1.83 4.64 6.05 17.4 17.5 0 0.6 0 67
## 5 5 9.03 9.74 18.2 19.7 13.8 13.8 0 0 0 61
## 6 6 8.91 9.52 11.9 16.6 11.7 11.7 0 0.200 0 64.5
## # ... with 3 more variables: Xmax <dbl>, Ymin <dbl>, Ymax <dbl>