R Bridge Course Final Project

This is a final project to show off what you have learned. Select your data set from the list below: http://vincentarelbundock.github.io/Rdatasets/ (click on the csv index for a list).
Another good source is found here: https://archive.ics.uci.edu/ml/datasets.html The presentation approach is up to you but it should contain the following:

1. Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.

data_in_doctorvisits <- read.csv('https://raw.githubusercontent.com/WeFixer/r_bridge_wk3/main/DoctorVisits.csv',header=TRUE)
summary(data_in_doctorvisits)
##        X            visits          gender               age        
##  Min.   :   1   Min.   :0.0000   Length:5190        Min.   :0.1900  
##  1st Qu.:1298   1st Qu.:0.0000   Class :character   1st Qu.:0.2200  
##  Median :2596   Median :0.0000   Mode  :character   Median :0.3200  
##  Mean   :2596   Mean   :0.3017                      Mean   :0.4064  
##  3rd Qu.:3893   3rd Qu.:0.0000                      3rd Qu.:0.6200  
##  Max.   :5190   Max.   :9.0000                      Max.   :0.7200  
##      income          illness         reduced            health      
##  Min.   :0.0000   Min.   :0.000   Min.   : 0.0000   Min.   : 0.000  
##  1st Qu.:0.2500   1st Qu.:0.000   1st Qu.: 0.0000   1st Qu.: 0.000  
##  Median :0.5500   Median :1.000   Median : 0.0000   Median : 0.000  
##  Mean   :0.5832   Mean   :1.432   Mean   : 0.8619   Mean   : 1.218  
##  3rd Qu.:0.9000   3rd Qu.:2.000   3rd Qu.: 0.0000   3rd Qu.: 2.000  
##  Max.   :1.5000   Max.   :5.000   Max.   :14.0000   Max.   :12.000  
##    private            freepoor          freerepat           nchronic        
##  Length:5190        Length:5190        Length:5190        Length:5190       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    lchronic        
##  Length:5190       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

In conclusion, this simple data included 5190 observations and provided those variables like (private, freepoor, freerepat, nchronic, lchronic) as slicers which would help us to find some relationships of age, income, and free insurance due to income or age. Also, comparison in between the frequency of visits of free insurance and private insurance. In the summary, within 14days time frame, someone visit 9 times, and someone could not do any activities in whole 14 days. The oldest patient is 72 years old, and someone has $15,000 income.

2. Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together)

d <- data_in_doctorvisits
d$gender<- with(d,replace(gender,gender=="male","M"))
d$gender<- with(d,replace(gender,gender=="female","F"))

sub_d <- subset(d,select=c("X","visits","gender","age","income","illness","reduced","private"))
names(sub_d)[names(sub_d)=="illness"] <- "sick_frq"
names(sub_d)[names(sub_d)=="reduced"] <- "in_act"
sub_d['age']<-sub_d['age']*100
sub_d['income']<-sub_d['income']*10000
head(sub_d,10)
##     X visits gender age income sick_frq in_act private
## 1   1      1      F  19   5500        1      4     yes
## 2   2      1      F  19   4500        1      2     yes
## 3   3      1      M  19   9000        3      0      no
## 4   4      1      M  19   1500        1      0      no
## 5   5      1      M  19   4500        2      5      no
## 6   6      1      F  19   3500        5      1      no
## 7   7      1      F  19   5500        4      0      no
## 8   8      1      F  19   1500        3      0      no
## 9   9      1      F  19   6500        2      0     yes
## 10 10      1      M  19   1500        1      0     yes
summary(sub_d)
##        X            visits          gender               age       
##  Min.   :   1   Min.   :0.0000   Length:5190        Min.   :19.00  
##  1st Qu.:1298   1st Qu.:0.0000   Class :character   1st Qu.:22.00  
##  Median :2596   Median :0.0000   Mode  :character   Median :32.00  
##  Mean   :2596   Mean   :0.3017                      Mean   :40.64  
##  3rd Qu.:3893   3rd Qu.:0.0000                      3rd Qu.:62.00  
##  Max.   :5190   Max.   :9.0000                      Max.   :72.00  
##      income         sick_frq         in_act          private         
##  Min.   :    0   Min.   :0.000   Min.   : 0.0000   Length:5190       
##  1st Qu.: 2500   1st Qu.:0.000   1st Qu.: 0.0000   Class :character  
##  Median : 5500   Median :1.000   Median : 0.0000   Mode  :character  
##  Mean   : 5832   Mean   :1.432   Mean   : 0.8619                     
##  3rd Qu.: 9000   3rd Qu.:2.000   3rd Qu.: 0.0000                     
##  Max.   :15000   Max.   :5.000   Max.   :14.0000

In this example I sub to new data set and changed some column’s name for easy understanding. Also increased age back to normal unit by times 100 from 0.19 to 19 and income data.

3. Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.

library(ggplot2)

#hist(sub_d$age, main = "Patient Male Histogram", xlab = "Age", ylab = "number of patient")
ggplot(sub_d, aes(x=age)) + geom_histogram(binwidth = 1)+labs(x="Patient's Age",y="Number of Patient",title="Histogram")+geom_vline(aes(xintercept=mean(age)),color="blue",linetype="dashed",linewidth=2)

boxplot(sub_d$income)

plot(income~visits,data = sub_d,main="Income and Visits")
abline(lm(sub_d$income~sub_d$visits))

plot(age~visits,data = sub_d,main="Age and Visits")

abline(lm(sub_d$age~sub_d$visits))

4. Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end.

Question: Finding the correlation between age,income, and visits.
From those graphics are showing a positive correlation as age increasing along with visits number increase and a Negative correlation in between lower income and the number of visits.

5. BONUS – place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.

Data file has uploaded to Github and read from its link.