##load packages to use and data for the analysis
## Warning: package 'dplyr' was built under R version 4.0.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'ipumsr' was built under R version 4.0.2
## Warning: package 'ggplot2' was built under R version 4.0.2
## Warning: package 'psych' was built under R version 4.0.2
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## Rows: 67
## Columns: 13
## $ cofips <dbl> 42001, 42003, 42005, 42007, 42009, 42011, 42013, 42015, 4201…
## $ name <chr> "Adams", "Allegheny", "Armstrong", "Beaver", "Bedford", "Ber…
## $ avemort <dbl> 8.2360, 8.7939, 8.7597, 8.6994, 7.9789, 8.1985, 9.4295, 8.21…
## $ gini <dbl> 0.384, 0.481, 0.403, 0.414, 0.413, 0.414, 0.434, 0.420, 0.42…
## $ depriv <dbl> -1.94417870, 1.47773492, -0.89595270, -1.14512801, -1.862772…
## $ povrate <dbl> 0.07045058, 0.12600280, 0.11556575, 0.10570022, 0.14187619, …
## $ pubassis <dbl> 0.01511776, 0.03061283, 0.02976539, 0.02623129, 0.02626996, …
## $ fmlhhd <dbl> 0.06256630, 0.07292254, 0.06263135, 0.06731629, 0.04860938, …
## $ nhispwht <dbl> 0.9078318, 0.8218734, 0.9769206, 0.9131680, 0.9766557, 0.799…
## $ nhispblk <dbl> 0.016018057, 0.124774940, 0.008865506, 0.057382111, 0.003706…
## $ hispanic <dbl> 0.055993903, 0.013569216, 0.005973316, 0.010220629, 0.007533…
## $ ski05pcm <dbl> -0.297545135, 0.529664993, 0.721325159, 0.474343121, 0.08500…
## $ metro <dbl> 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, …
The mortality file (PA_mortality) has several variables. “Avemort” is the average mortality rate at the county level. “Gini” is the gini coefficient. This is a measure for inequality.“Depriv” is the relative deprivation score. “Povrate” is the poverty rate. “Metro” is a dummy variable where 1 indicates metro and 0 indicates non-metro areas. Using mortality file available on blackboard, and do the following
Generate a boxplot of poverty rate at the county level (2 points). Based on the boxplot, what is the median poverty rate and the interquartile range (IQR) of the poverty rate? (2 points) What’s the minimum and maximum values for the poverty rate? (4 points) Note: the function to generate boxplot in R is boxplot(data$var, main=”title of boxplot”)
boxplot(Stat_Exam$povrate, main='poverty rate at the county level')
# Median = 2rd QU.
# Median =0.12
# 3rd Qu.= 0.15
# 1st Qu.=0.10
# IQR =3rd Qu. - 1st Qu.
# IQR= 0.15 -0.10
# = 0.05
# Minimum Value= 0.05
# Maximum Value= 0.20
Is the distribution of poverty rate normally distributed? Why or why not? Describe how you reach to your conclusion. (4 points)
mean(Stat_Exam$povrate,na.rm = TRUE)
## [1] 0.1210957
median (Stat_Exam$povrate,na.rm = TRUE)
## [1] 0.1245455
hist(Stat_Exam$povrate, main="poverty rate at the county level")
sd(Stat_Exam$povrate)
## [1] 0.03403946
# Note that we can categorically say that a Variable has a normal distribution if the mean, median and mode is approximately equal and has a very small standard deviation. For the Variables under Consideration-"Poverty rate",the mean median and mode are approximately equal to each other, and it has a relatively small standard deviation, Hence, we can conclude that the variable-Poverty rate is normally distributed.
Please create two binary variables based on “avemort” and “gini”. For the former, please recode those less than or equal to 8 as “Low Mortality”, otherwise “High Mortality.” For the latter, those less than or equal to 0.4 should be coded as “Equal”, otherwise, “Unequal.” (8 points)
## # A tibble: 67 x 4
## avemort rec_avemort gini rec_gini
## <dbl> <chr> <dbl> <chr>
## 1 8.24 High Mortality 0.384 Equal
## 2 8.79 High Mortality 0.481 Unequal
## 3 8.76 High Mortality 0.403 Unequal
## 4 8.70 High Mortality 0.414 Unequal
## 5 7.98 Low Mortality 0.413 Unequal
## 6 8.20 High Mortality 0.414 Unequal
## 7 9.43 High Mortality 0.434 Unequal
## 8 8.22 High Mortality 0.420 Unequal
## 9 8.36 High Mortality 0.424 Unequal
## 10 8.22 High Mortality 0.422 Unequal
## # … with 57 more rows
How many counties have high mortality? And how many counties have “unequal” gini coefficient? (8 points)
# How many counties have high mortality?
High_mortality<- New_Variables %>%
filter(rec_avemort=='High Mortality')
nrow(High_mortality) #Total numbers of Counties with High Mortality.
## [1] 52
Low_mortality<- New_Variables %>%
filter(rec_avemort=='Low Mortality')
#how many counties have “unequal” gini coefficient.
unequal_gini<- New_Variables %>%
filter( rec_gini=='Unequal')
nrow(unequal_gini) #Total numbers of Counties with unequal gini coefficient.
## [1] 56
Show the confidence intervals for gini coefficients when county mortality level is low and high, respectively.
# Calculations Showing the confidence intervals for gini coefficients when county mortality level is High using normal distribution table.
length(High_mortality$gini)
## [1] 52
mean(High_mortality$gini)
## [1] 0.4200577
sd (High_mortality$gini)
## [1] 0.02342817
error_HM <- qnorm(0.975)*sd (High_mortality$gini)/sqrt(length(High_mortality$gini))
Lower_limit_HM <- mean(High_mortality$gini)- error_HM
Upper_Limit_HM <- mean(High_mortality$gini)+ error_HM
print(Lower_limit_HM )
## [1] 0.41369
print(Upper_Limit_HM)
## [1] 0.4264254
# Calculations Showing the confidence intervals for gini coefficients when county mortality level is High using normal distribution table.
length(Low_mortality$gini)
## [1] 15
mean(Low_mortality$gini)
## [1] 0.4218
sd (Low_mortality$gini)
## [1] 0.02341612
error_LM <- qt(0.975,df=length(Low_mortality$gini)-1)*sd(Low_mortality$gini)/sqrt(length(Low_mortality$gini))
Lower_limit_LM <- mean(High_mortality$gini)- error_HM
Upper_Limit_LM <- mean(High_mortality$gini)+ error_HM
print(Lower_limit_LM )
## [1] 0.41369
print(Upper_Limit_LM)
## [1] 0.4264254
## i Do these confidence intervals overlap? (4 points)
# YES
## ii Interpret the confidence intervals from e). (8 points)
# It means we can say with a 95% confidence that the Gini coefficients for the counties with High mortality and low mortality respectively is on average between 0.41 & 0.43
# iii What conclusion(s) can you draw with regard to the county’s mortality levels and gini coefficients? (4 points)
### Since the confidence interval for the Gini coefficient for counties with high mortality and counties with low mortality overlaps one can conclude that the difference observed between the means of gini coefficients for the mortality groups(High and low) are not statistically significant.