##load packages to use and data for the analysis

## Warning: package 'dplyr' was built under R version 4.0.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Warning: package 'ipumsr' was built under R version 4.0.2
## Warning: package 'ggplot2' was built under R version 4.0.2
## Warning: package 'psych' was built under R version 4.0.2
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
## Rows: 67
## Columns: 13
## $ cofips   <dbl> 42001, 42003, 42005, 42007, 42009, 42011, 42013, 42015, 4201…
## $ name     <chr> "Adams", "Allegheny", "Armstrong", "Beaver", "Bedford", "Ber…
## $ avemort  <dbl> 8.2360, 8.7939, 8.7597, 8.6994, 7.9789, 8.1985, 9.4295, 8.21…
## $ gini     <dbl> 0.384, 0.481, 0.403, 0.414, 0.413, 0.414, 0.434, 0.420, 0.42…
## $ depriv   <dbl> -1.94417870, 1.47773492, -0.89595270, -1.14512801, -1.862772…
## $ povrate  <dbl> 0.07045058, 0.12600280, 0.11556575, 0.10570022, 0.14187619, …
## $ pubassis <dbl> 0.01511776, 0.03061283, 0.02976539, 0.02623129, 0.02626996, …
## $ fmlhhd   <dbl> 0.06256630, 0.07292254, 0.06263135, 0.06731629, 0.04860938, …
## $ nhispwht <dbl> 0.9078318, 0.8218734, 0.9769206, 0.9131680, 0.9766557, 0.799…
## $ nhispblk <dbl> 0.016018057, 0.124774940, 0.008865506, 0.057382111, 0.003706…
## $ hispanic <dbl> 0.055993903, 0.013569216, 0.005973316, 0.010220629, 0.007533…
## $ ski05pcm <dbl> -0.297545135, 0.529664993, 0.721325159, 0.474343121, 0.08500…
## $ metro    <dbl> 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, …

Question 6. R application

The mortality file (PA_mortality) has several variables. “Avemort” is the average mortality rate at the county level. “Gini” is the gini coefficient. This is a measure for inequality.“Depriv” is the relative deprivation score. “Povrate” is the poverty rate. “Metro” is a dummy variable where 1 indicates metro and 0 indicates non-metro areas. Using mortality file available on blackboard, and do the following

Question 6a.

Generate a boxplot of poverty rate at the county level (2 points). Based on the boxplot, what is the median poverty rate and the interquartile range (IQR) of the poverty rate? (2 points) What’s the minimum and maximum values for the poverty rate? (4 points) Note: the function to generate boxplot in R is boxplot(data$var, main=”title of boxplot”)

boxplot(Stat_Exam$povrate, main='poverty rate at the county level')

#  Median = 2rd QU.
#  Median =0.12
# 3rd Qu.= 0.15
# 1st Qu.=0.10
# IQR =3rd Qu. - 1st Qu.
# IQR= 0.15 -0.10
#    = 0.05
# Minimum Value= 0.05
# Maximum Value= 0.20

Question 6b.

Is the distribution of poverty rate normally distributed? Why or why not? Describe how you reach to your conclusion. (4 points)

mean(Stat_Exam$povrate,na.rm = TRUE)
## [1] 0.1210957
median (Stat_Exam$povrate,na.rm = TRUE)
## [1] 0.1245455
hist(Stat_Exam$povrate, main="poverty rate at the county level")

sd(Stat_Exam$povrate)
## [1] 0.03403946
# Note that we can categorically say that a Variable has a normal distribution if the mean, median and mode is approximately equal and has a very small standard deviation. For the Variables under Consideration-"Poverty rate",the mean median and mode are approximately equal to each other, and  it has a relatively small standard deviation, Hence, we can conclude that the variable-Poverty rate is normally distributed. 

Question 6c.

Please create two binary variables based on “avemort” and “gini”. For the former, please recode those less than or equal to 8 as “Low Mortality”, otherwise “High Mortality.” For the latter, those less than or equal to 0.4 should be coded as “Equal”, otherwise, “Unequal.” (8 points)

## # A tibble: 67 x 4
##    avemort rec_avemort     gini rec_gini
##      <dbl> <chr>          <dbl> <chr>   
##  1    8.24 High Mortality 0.384 Equal   
##  2    8.79 High Mortality 0.481 Unequal 
##  3    8.76 High Mortality 0.403 Unequal 
##  4    8.70 High Mortality 0.414 Unequal 
##  5    7.98 Low Mortality  0.413 Unequal 
##  6    8.20 High Mortality 0.414 Unequal 
##  7    9.43 High Mortality 0.434 Unequal 
##  8    8.22 High Mortality 0.420 Unequal 
##  9    8.36 High Mortality 0.424 Unequal 
## 10    8.22 High Mortality 0.422 Unequal 
## # … with 57 more rows

Question 6d.

How many counties have high mortality? And how many counties have “unequal” gini coefficient? (8 points)

# How many counties have high mortality?
High_mortality<- New_Variables %>% 
filter(rec_avemort=='High Mortality') 
nrow(High_mortality) #Total numbers of Counties with High Mortality.
## [1] 52
Low_mortality<- New_Variables %>% 
filter(rec_avemort=='Low Mortality') 

#how many counties have “unequal” gini coefficient.
unequal_gini<- New_Variables %>%
filter( rec_gini=='Unequal')
nrow(unequal_gini) #Total numbers of Counties with unequal gini coefficient.
## [1] 56

Question 6e.

Show the confidence intervals for gini coefficients when county mortality level is low and high, respectively.

# Calculations Showing the confidence intervals for gini coefficients when county mortality level is High using normal distribution table.
length(High_mortality$gini)
## [1] 52
mean(High_mortality$gini)
## [1] 0.4200577
sd (High_mortality$gini)
## [1] 0.02342817
error_HM <- qnorm(0.975)*sd (High_mortality$gini)/sqrt(length(High_mortality$gini))
Lower_limit_HM <- mean(High_mortality$gini)- error_HM
Upper_Limit_HM <- mean(High_mortality$gini)+ error_HM
print(Lower_limit_HM )
## [1] 0.41369
print(Upper_Limit_HM)
## [1] 0.4264254
# Calculations Showing the confidence intervals for gini coefficients when county mortality level is High using normal distribution table.
length(Low_mortality$gini)
## [1] 15
mean(Low_mortality$gini)
## [1] 0.4218
sd (Low_mortality$gini)
## [1] 0.02341612
error_LM <- qt(0.975,df=length(Low_mortality$gini)-1)*sd(Low_mortality$gini)/sqrt(length(Low_mortality$gini))
Lower_limit_LM <- mean(High_mortality$gini)- error_HM
Upper_Limit_LM <- mean(High_mortality$gini)+ error_HM
print(Lower_limit_LM )
## [1] 0.41369
print(Upper_Limit_LM)
## [1] 0.4264254
## i    Do these confidence intervals overlap? (4 points)
# YES
## ii   Interpret the confidence intervals from e).    (8 points)
# It means we can say with a 95% confidence that the Gini coefficients for the counties with High mortality and low mortality respectively is on average between 0.41 & 0.43

# iii   What conclusion(s) can you draw with regard to the county’s mortality levels and gini coefficients? (4 points)
 
### Since the confidence interval for the  Gini coefficient for counties with high mortality and counties with low mortality overlaps one can conclude that the difference observed between the means of gini coefficients for the mortality groups(High and low) are not statistically significant.