I selected the “Medical Expenses in Vietnam (household Level)” data set from http://vincentarelbundock.github.io/Rdatasets/ and uploaded the original .csv to my github account.
#read the data from my github
vnm <- read.csv("https://raw.githubusercontent.com/pkofy/Bridge_Winter2022/main/VietNamH.csv", stringsAsFactors = FALSE)
#transpose and display the first few records
t(vnm[1:5, ]) #I could use t(head(vnm)) but it takes the first six rows and the sixth wraps around in the .html
## 1 2 3 4 5
## X "1" "2" "3" "4" "5"
## sex "female" "female" "male" "female" "female"
## age "68" "57" "42" "72" "73"
## educyr " 4" " 8" "14" " 9" " 1"
## farm "no" "no" "no" "no" "no"
## urban "yes" "yes" "yes" "yes" "yes"
## hhsize "6" "6" "6" "6" "8"
## lntotal "10.13649" "10.25206" "10.93231" "10.26749" "10.48811"
## lnmed "11.233210" " 8.505120" " 8.713418" " 9.291736" " 7.555382"
## lnrlfood " 8.639339" " 9.345752" "10.226330" " 9.263722" " 9.592890"
## lnexp12m "11.233210" " 8.505120" " 8.713418" " 9.291736" " 7.555382"
## commune "1" "1" "1" "1" "1"
X is the record number
sex, age and educyr describe the head of household with educyr being the years of education obtained
farm and urban describe the household
hhsize is the number of people in the household
lntotal is the natural log of the total miscellaneous expenditures in the household
lnmed is the natural log of the total medical expenditures in the household
lnrlfood is the natural log of the total food expenditures in the household
lnexp12m is the natural log of the expected total medical expenditures in the household over the next 12 months
commune is an organizational division for health and education in Vietnam. In 2008 there were over 9,000 communes. In this survey households from 150 different communes were surveyed.
summary(vnm)
## X sex age educyr
## Min. : 1 Length:5999 Min. :16.00 Min. : 0.000
## 1st Qu.:1500 Class :character 1st Qu.:37.00 1st Qu.: 4.000
## Median :3000 Mode :character Median :46.00 Median : 7.000
## Mean :3000 Mean :48.01 Mean : 7.094
## 3rd Qu.:4500 3rd Qu.:58.00 3rd Qu.:10.000
## Max. :5999 Max. :95.00 Max. :22.000
##
## farm urban hhsize lntotal
## Length:5999 Length:5999 Min. : 1.000 Min. : 6.543
## Class :character Class :character 1st Qu.: 4.000 1st Qu.: 8.920
## Mode :character Mode :character Median : 5.000 Median : 9.311
## Mean : 4.752 Mean : 9.342
## 3rd Qu.: 6.000 3rd Qu.: 9.759
## Max. :19.000 Max. :12.202
##
## lnmed lnrlfood lnexp12m commune
## Min. : 0.000 Min. : 6.356 Min. : 0.000 Min. : 1.00
## 1st Qu.: 4.174 1st Qu.: 8.376 1st Qu.: 5.273 1st Qu.: 51.00
## Median : 5.966 Median : 8.691 Median : 6.372 Median : 99.00
## Mean : 5.266 Mean : 8.680 Mean : 6.311 Mean : 98.27
## 3rd Qu.: 7.180 3rd Qu.: 9.002 3rd Qu.: 7.392 3rd Qu.:146.50
## Max. :12.363 Max. :11.384 Max. :12.363 Max. :194.00
## NA's :993
We can remove X as the record number since that information is contained in the order of the records
vnm <- vnm[,c(2:12)]
We can change the values of urban and farm to be more descriptive, clarify the column names and create a new column combining the two.
#install.packages("dplyr",repos = "http://cran.us.r-project.org")
suppressPackageStartupMessages(require(dplyr))
vnm$urban <- replace(vnm$urban, vnm$urban == "yes", "Urban")
vnm$urban <- replace(vnm$urban, vnm$urban == "no", "Rural")
vnm$farm <- replace(vnm$farm, vnm$farm == "yes", "Farm")
vnm$farm <- replace(vnm$farm, vnm$farm == "no", "Home")
vnm <- rename(vnm, c("isurban"="urban", "isfarm"="farm"))
#Not done here but you can specify to add a column after or before another column
vnm$type <- paste(vnm$isurban, vnm$isfarm)
#show changes
t(vnm[1:3,])
## 1 2 3
## sex "female" "female" "male"
## age "68" "57" "42"
## educyr " 4" " 8" "14"
## isfarm "Home" "Home" "Home"
## isurban "Urban" "Urban" "Urban"
## hhsize "6" "6" "6"
## lntotal "10.13649" "10.25206" "10.93231"
## lnmed "11.233210" " 8.505120" " 8.713418"
## lnrlfood " 8.639339" " 9.345752" "10.226330"
## lnexp12m "11.233210" " 8.505120" " 8.713418"
## commune "1" "1" "1"
## type "Urban Home" "Urban Home" "Urban Home"
We can add a column for each expenditure column, remove the log, convert to today’s VND (Vietnamese Dong) and convert to today’s USD (US Dollar).
#Inflation from 1997 to 2022 in VND
VNDinflation <- 4.2
VNDtoUSD <- 22730
#add new columns
vnm$misc <- round(exp(vnm$lntotal)*VNDinflation/VNDtoUSD*100)/100
vnm$med <- round(exp(vnm$lnmed)*VNDinflation/VNDtoUSD*100)/100
vnm$food <- round(exp(vnm$lnrlfood)*VNDinflation/VNDtoUSD*100)/100
vnm$expmed <- round(exp(vnm$lnexp12m)*VNDinflation/VNDtoUSD*100)/100
#show changes
t(vnm[1:3,])
## 1 2 3
## sex "female" "female" "male"
## age "68" "57" "42"
## educyr " 4" " 8" "14"
## isfarm "Home" "Home" "Home"
## isurban "Urban" "Urban" "Urban"
## hhsize "6" "6" "6"
## lntotal "10.13649" "10.25206" "10.93231"
## lnmed "11.233210" " 8.505120" " 8.713418"
## lnrlfood " 8.639339" " 9.345752" "10.226330"
## lnexp12m "11.233210" " 8.505120" " 8.713418"
## commune "1" "1" "1"
## type "Urban Home" "Urban Home" "Urban Home"
## misc " 4.67" " 5.24" "10.34"
## med "13.97" " 0.91" " 1.12"
## food "1.04" "2.12" "5.10"
## expmed "13.97" " 0.91" " 1.12"
Type of Household by Female/Male. It looks like women are more likely to be the head of household in an urban commune than in a rural commune.
#room for improvement-> show percentages
table(vnm$sex, vnm$type)
##
## Rural Farm Rural Home Urban Farm Urban Home
## female 661 264 75 624
## male 2587 757 115 916
The expense types are miscellaneous, medical and food expenses (the natural log of their values in 1997 VND). It looks like families spend a significant portion of their expenses on medical but maybe this is due to the log.
boxplot(vnm$lntotal, vnm$lnmed, vnm$lnrlfood, names=c("Misc","Medical","Food"))
meanmisc <- mean(vnm$misc)
meanmedical <- mean(vnm$med)
meanfood <- mean(vnm$food)
#pie chart with percentages
slices <- c(meanmisc, meanmedical, meanfood)
lbls <- c("Misc", "Medical", "Food")
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct)
lbls <- paste(lbls, "%",sep="")
pie(slices, labels=lbls, main="Pie Chart of Expenses (natural log removed)")
As of the date of the survey a 42 year old would have been born at the start of the war in 1955 and a 22 year-old would have been born at the end of the war in 1997. It looks like a chunk has been scooped out from this curve of heads of households aged 40 to 55. One explanation could be that the war caused the death or emigration as refugees of a significant percentage of the population aged 18 to 33.
hist(vnm$age)
We can see this gouge in the demographics in more detail using the ggplot2 package
#install.packages("ggplot2")
require(ggplot2)
qplot(age, data=vnm, binwidth=1)
You can see the size of the household increasing when the head of household is between 20 and 40, holding steady between 40 and 60 and decreasing thereafter.
plot(vnm$age, vnm$hhsize)
Here is a scatterplot of the natural log of the expenditure of food versus the size of the household.
#can't seem to make the chart differentiate between farm and non farm households.
qplot(lnrlfood, hhsize, data=vnm, colour='cyl')
Here we use ggplot2 to reproduce the chart but differentiating between farm and non farm households. And we are able to see that non farm households seem to spend more per person in the household on food than farm households.
ggplot(vnm, aes(x=lnrlfood, y=hhsize), fill="grey50") + geom_point(aes(color=isfarm))
It looks like women are more likely to be the head of household in an urban commune than in a rural commune. This could be because of more conservative values in rural areas, increased opportunities for women in urban settings, or maybe country-wide infrastructure projects taking men away from the cities for work.
Medical expenses look to be a significant portion of total expenses however this is due to comparing the natural log of the expenses. Once the log is removed the mean medical expenses are a much lower percentage of the total expenses.
It looks like the war decimated a generation who were of fighting age or young enough to seek a home in another country. From Wikipedia, an estimated 2 million were killed, 3 million were wounded and 12 million became refugees during the Vietnam War.
We show that housholds tend to increase in size while the head of household is between 20 and 40 years of age, stay the same while the head of household is between 40 and 60 years of age, and decrease after the head of household is 60.
Households spend more on food the more household members they have. By switching from qplot() to ggplot() we were able to show that non farm households spend more per member than farm households on food.
Follow up questions:
- Are there differences in education between types of households?
- Do older heads of households have higher medical expenses?