Task assigned: Exploratory data analysis; specifically, using R graphics to describe the data set and look for potential relationships in the data. With respect to the Ecdat OFP data set, I’ll be looking at age, family income and number of physician office visits.
install.packages("Ecdat", repos='https://mirrors.nics.utk.edu/cran/')
library(Ecdat)
Visually inspect data:
head(OFP)
## ofp ofnp opp opnp emr hosp numchron adldiff age black sex maried
## 1 5 0 0 0 0 1 2 0 6.9 yes male yes
## 2 1 0 2 0 2 0 2 0 7.4 no female yes
## 3 13 0 0 0 3 3 4 1 6.6 yes female no
## 4 16 0 5 0 1 1 2 1 7.6 no male yes
## 5 3 0 0 0 0 0 2 1 7.9 no female yes
## 6 17 0 0 0 0 0 5 1 6.6 no female no
## school faminc employed privins medicaid region hlth
## 1 6 2.8810 yes yes no other other
## 2 10 2.7478 no yes no other other
## 3 10 0.6532 no no yes other poor
## 4 3 0.6588 no yes no other poor
## 5 6 0.6588 no yes no other other
## 6 7 0.3301 no no yes other poor
The age column is in mutiples of 10 and the family income column is in multiples of 10,000. I subsetted the OFP data frame into a new data frame called ‘grphdata’ containing the three columns which interest me, renamed the column headers, and then multiplied the age by 10 and the family income by 10,000.
# Subset data into new data frame
c <- c("ofp","age","faminc")
grphdata <- OFP[c]
# Rename columns
n <- c("No_Office_Visits","Age","Family_Income")
colnames(grphdata) <- n
# Multiply Age by 10 and Family_Income by 10,000
grphdata$Age <- grphdata$Age * 10
grphdata$Family_Income <- grphdata$Family_Income * 10000
# Visually inspect data
head(grphdata)
## No_Office_Visits Age Family_Income
## 1 5 69 28810
## 2 1 74 27478
## 3 13 66 6532
## 4 16 76 6588
## 5 3 79 6588
## 6 17 66 3301
Much of the code in this section was taken from, or inspired by, Robert I. Kabacoff, Ph.D.’s web site “Quick-R” (http://www.statmethods.net/advgraphs/parameters.html) and DataCamp’s R-Blogger post “How to Make a Histogram with Basic R”: https://www.r-bloggers.com/how-to-make-a-histogram-with-basic-r/
A Histogram showing the distribution of ages for the OFP data set:
hist(grphdata$Age, breaks=10, col="red", border = "blue", xlab="Age",main="Age Distribution for OFP Data Set")
A Histogram showing the distribution of family incomes for the OFP data set:
hist(grphdata$Family_Income/1000, breaks=50, col="green", border = "purple", xlab="Family Income (in Thousands)",main="Family Income Distribution for OFP Data Set", xlim = c(1,175))
Based on these histograms, we can conclude that the data set consists of an elderly population, the overwhelming majority of which are living on low fixed incomes. Over half the observations have family incomes under $25,000, and nearly 90% of the observatins have family incomes under $50,000.
First, I created an additional column in the data frame showing age ranges with the ultimate objective of creating box plots by age range.
# Function to group ages into age ranges
r <- function(x) {
if (x < 70) {"69 or Less"}
else if (x >= 70 & x < 80) {"70 - 79"}
else if (x >= 80 & x < 90) {"80 - 89"}
else if (x >= 90 & x < 100) {"90 - 99"}
else {"Over 99"}
}
# Add and populate Age_Range column in data frame based on Age
grphdata$Age_Range <- mapply(r,grphdata$Age)
# Visually inspect data
head(grphdata)
## No_Office_Visits Age Family_Income Age_Range
## 1 5 69 28810 69 or Less
## 2 1 74 27478 70 - 79
## 3 13 66 6532 69 or Less
## 4 16 76 6588 70 - 79
## 5 3 79 6588 70 - 79
## 6 17 66 3301 69 or Less
Next, I created boxplots of number of visits by Age_Range:
boxplot(No_Office_Visits~Age_Range,data=grphdata, main="Physician Office Visits by Age Range", xlab="Age", ylab="Office Visits", ylim = c(0,40))
A few observations regarding the boxplots:
Next, a scatterplot was created showing number of visits against family income ranges. Family income ranges were put into a new column called “Fam_Inc_Ranges”:
grphdata$Fam_Inc_Range <- round(grphdata$Family_Income, digits = -3)
Then, a new data frame showing average number of visits per Fam_Inc_Range was created using the sqldf package.
Citiation information for sqldf: G. Grothendieck (2014). sqldf: Perform SQL Selects on R Data Frames. R package version 0.4-10. https://CRAN.R-project.org/package=sqldf
# Load sqldf
install.packages("sqldf", repos='https://mirrors.nics.utk.edu/cran/')
library(sqldf)
# Create new data set using income ranges and average visits
avinc <- sqldf('select grphdata.Fam_Inc_Range, avg(grphdata.No_Office_Visits) as AvgVisits from grphdata group by grphdata.Fam_Inc_Range')
## Loading required package: tcltk
Lastly, a scatterplot was created using the new data frame showing Family Income Range against Average # of Office Visits.
plot(avinc$Fam_Inc_Range/1000, avinc$AvgVisits, xlab = "Family Income Range (in Thousands)", ylab = "Ave. # of Office Visits", xlim = c(0,250), ylim = c(0,25))
It does not appear that Family Income is related to the average number of physician office visits. Perhaps one could hypothesize that we see greater variability in office visits for family income ranges over $50K because those individuals can choose how often to go, while we see a tighter range of average number of visits for family incomes under $50K because number of visits might be constrained by medicare/medicaid coverage limits, but this is pure speculation and much more research would be needed before drawing any conclusions.