This is a final project to show off what you have learned. Select
your data set from the list below: http://vincentarelbundock.github.io/Rdatasets/ (click on
the csv index for a list).
Another good source is found here: https://archive.ics.uci.edu/ml/datasets.html The
presentation approach is up to you but it should contain the
following:
data_in_doctorvisits <- read.csv('https://raw.githubusercontent.com/WeFixer/r_bridge_wk3/main/DoctorVisits.csv',header=TRUE)
summary(data_in_doctorvisits)
## X visits gender age
## Min. : 1 Min. :0.0000 Length:5190 Min. :0.1900
## 1st Qu.:1298 1st Qu.:0.0000 Class :character 1st Qu.:0.2200
## Median :2596 Median :0.0000 Mode :character Median :0.3200
## Mean :2596 Mean :0.3017 Mean :0.4064
## 3rd Qu.:3893 3rd Qu.:0.0000 3rd Qu.:0.6200
## Max. :5190 Max. :9.0000 Max. :0.7200
## income illness reduced health
## Min. :0.0000 Min. :0.000 Min. : 0.0000 Min. : 0.000
## 1st Qu.:0.2500 1st Qu.:0.000 1st Qu.: 0.0000 1st Qu.: 0.000
## Median :0.5500 Median :1.000 Median : 0.0000 Median : 0.000
## Mean :0.5832 Mean :1.432 Mean : 0.8619 Mean : 1.218
## 3rd Qu.:0.9000 3rd Qu.:2.000 3rd Qu.: 0.0000 3rd Qu.: 2.000
## Max. :1.5000 Max. :5.000 Max. :14.0000 Max. :12.000
## private freepoor freerepat nchronic
## Length:5190 Length:5190 Length:5190 Length:5190
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## lchronic
## Length:5190
## Class :character
## Mode :character
##
##
##
In conclusion, this simple data included 5190 observations and provided those variables like (private, freepoor, freerepat, nchronic, lchronic) as slicers which would help us to find some relationships of age, income, and free insurance due to income or age. Also, comparison in between the frequency of visits of free insurance and private insurance. In the summary, within 14days time frame, someone visit 9 times, and someone could not do any activities in whole 14 days. The oldest patient is 72 years old, and someone has $15,000 income.
d <- data_in_doctorvisits
d$gender<- with(d,replace(gender,gender=="male","M"))
d$gender<- with(d,replace(gender,gender=="female","F"))
sub_d <- subset(d,select=c("X","visits","gender","age","income","illness","reduced","private"))
names(sub_d)[names(sub_d)=="illness"] <- "sick_frq"
names(sub_d)[names(sub_d)=="reduced"] <- "in_act"
sub_d['age']<-sub_d['age']*100
sub_d['income']<-sub_d['income']*10000
head(sub_d,10)
## X visits gender age income sick_frq in_act private
## 1 1 1 F 19 5500 1 4 yes
## 2 2 1 F 19 4500 1 2 yes
## 3 3 1 M 19 9000 3 0 no
## 4 4 1 M 19 1500 1 0 no
## 5 5 1 M 19 4500 2 5 no
## 6 6 1 F 19 3500 5 1 no
## 7 7 1 F 19 5500 4 0 no
## 8 8 1 F 19 1500 3 0 no
## 9 9 1 F 19 6500 2 0 yes
## 10 10 1 M 19 1500 1 0 yes
summary(sub_d)
## X visits gender age
## Min. : 1 Min. :0.0000 Length:5190 Min. :19.00
## 1st Qu.:1298 1st Qu.:0.0000 Class :character 1st Qu.:22.00
## Median :2596 Median :0.0000 Mode :character Median :32.00
## Mean :2596 Mean :0.3017 Mean :40.64
## 3rd Qu.:3893 3rd Qu.:0.0000 3rd Qu.:62.00
## Max. :5190 Max. :9.0000 Max. :72.00
## income sick_frq in_act private
## Min. : 0 Min. :0.000 Min. : 0.0000 Length:5190
## 1st Qu.: 2500 1st Qu.:0.000 1st Qu.: 0.0000 Class :character
## Median : 5500 Median :1.000 Median : 0.0000 Mode :character
## Mean : 5832 Mean :1.432 Mean : 0.8619
## 3rd Qu.: 9000 3rd Qu.:2.000 3rd Qu.: 0.0000
## Max. :15000 Max. :5.000 Max. :14.0000
In this example I sub to new data set and changed some column’s name for easy understanding. Also increased age back to normal unit by times 100 from 0.19 to 19 and income data.
library(ggplot2)
#hist(sub_d$age, main = "Patient Male Histogram", xlab = "Age", ylab = "number of patient")
ggplot(sub_d, aes(x=age)) + geom_histogram(binwidth = 1)+labs(x="Patient's Age",y="Number of Patient",title="Histogram")+geom_vline(aes(xintercept=mean(age)),color="blue",linetype="dashed",linewidth=2)
boxplot(sub_d$income)
plot(income~visits,data = sub_d,main="Income and Visits")
abline(lm(sub_d$income~sub_d$visits))
plot(age~visits,data = sub_d,main="Age and Visits")
abline(lm(sub_d$age~sub_d$visits))
Question: Finding the correlation between age,income, and
visits.
From those graphics are showing a positive correlation as age increasing
along with visits number increase and a Negative correlation in between
lower income and the number of visits.
Data file has uploaded to Github and read from its link.