R Markdown
Week 4 Discussion
The “swiss” dataset in library (datasets) contains information about socioeconomic indicators and fertility. Provide descriptive statistics and confidence intervals for appropriate variables. Interpret. Submit your R code.
Loading the Swiss data set and storing it in a variable
Viewing columns in the dataset
names(mydata)
## [1] "Fertility" "Agriculture" "Examination" "Education"
## [5] "Catholic" "Infant.Mortality"
The data is representative of 47 municipalities in
row.names(mydata)
## [1] "Courtelary" "Delemont" "Franches-Mnt" "Moutier" "Neuveville"
## [6] "Porrentruy" "Broye" "Glane" "Gruyere" "Sarine"
## [11] "Veveyse" "Aigle" "Aubonne" "Avenches" "Cossonay"
## [16] "Echallens" "Grandson" "Lausanne" "La Vallee" "Lavaux"
## [21] "Morges" "Moudon" "Nyone" "Orbe" "Oron"
## [26] "Payerne" "Paysd'enhaut" "Rolle" "Vevey" "Yverdon"
## [31] "Conthey" "Entremont" "Herens" "Martigwy" "Monthey"
## [36] "St Maurice" "Sierre" "Sion" "Boudry" "La Chauxdfnd"
## [41] "Le Locle" "Neuchatel" "Val de Ruz" "ValdeTravers" "V. De Geneve"
## [46] "Rive Droite" "Rive Gauche"
loading summary statistics of Swiss data
summary(mydata)
## Fertility Agriculture Examination Education
## Min. :35.00 Min. : 1.20 Min. : 3.00 Min. : 1.00
## 1st Qu.:64.70 1st Qu.:35.90 1st Qu.:12.00 1st Qu.: 6.00
## Median :70.40 Median :54.10 Median :16.00 Median : 8.00
## Mean :70.14 Mean :50.66 Mean :16.49 Mean :10.98
## 3rd Qu.:78.45 3rd Qu.:67.65 3rd Qu.:22.00 3rd Qu.:12.00
## Max. :92.50 Max. :89.70 Max. :37.00 Max. :53.00
## Catholic Infant.Mortality
## Min. : 2.150 Min. :10.80
## 1st Qu.: 5.195 1st Qu.:18.15
## Median : 15.140 Median :20.00
## Mean : 41.144 Mean :19.94
## 3rd Qu.: 93.125 3rd Qu.:21.70
## Max. :100.000 Max. :26.60
Another useful way to look at the summary data is using the psych package
install.packages("psych", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/mm19975/OneDrive - MassMutual/MyDocuments/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'psych' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\mm19975\AppData\Local\Temp\RtmpiykSxW\downloaded_packages
library(psych)
describe(mydata)
## vars n mean sd median trimmed mad min max range
## Fertility 1 47 70.14 12.49 70.40 70.66 10.23 35.00 92.5 57.50
## Agriculture 2 47 50.66 22.71 54.10 51.16 23.87 1.20 89.7 88.50
## Examination 3 47 16.49 7.98 16.00 16.08 7.41 3.00 37.0 34.00
## Education 4 47 10.98 9.62 8.00 9.38 5.93 1.00 53.0 52.00
## Catholic 5 47 41.14 41.70 15.14 39.12 18.65 2.15 100.0 97.85
## Infant.Mortality 6 47 19.94 2.91 20.00 19.98 2.82 10.80 26.6 15.80
## skew kurtosis se
## Fertility -0.46 0.26 1.82
## Agriculture -0.32 -0.89 3.31
## Examination 0.45 -0.14 1.16
## Education 2.27 6.14 1.40
## Catholic 0.48 -1.67 6.08
## Infant.Mortality -0.33 0.78 0.42
lets observe a visual representation of the variables in Swiss
boxplot(swiss)
Its clear to see that the widest range of values are within Catholic while infant.mortality has nearly similar values and education has the most outliers or extreme values.
Learning about distribution of category data to consider which variables are good candidates for confidence intervals this variable is normally distributed given the sample size with a majority of rates within 60%-90%
hist(swiss$Fertility,main="Fertility", xlab = "Fertility")
Municipality data skews to the left indicating that the majority of the municipalities have less than 20% of people educated beyond primary school
hist(swiss$Education,main="Education", xlab = "Education")
Agriculture data distribution is difficult to determine. Its clear there are some extremes or outliers and these are mostly concentrated in the left tail
hist(swiss$Agriculture,main="Agriculture", xlab = "Agriculture")
The density plot view might be more useful to view the distribution of Agriculture since the barchart within the histogram are difficult to determine and when we run its clear that the distribution is slightly right skewed
plot(density(swiss$Agriculture),main="Agriculture")
The Catholic variable has a bimodal distribution with extremes on both tails
hist(swiss$Catholic,main="Catholic", xlab = "Catholic")
Examination data is left skewed indicating a small number of municipalities hold a high percentage fo draftees with highest exam marks
hist(swiss$Examination,main="Examination", xlab = "Examination")
It appears infant mortality is normally distributed with some smaller rates in the left tail indicating a small number of municipalities with low Infant.Mortality
hist(swiss$Infant.Mortality,main="Infant.Mortality", xlab = "Infant.Mortality")
Confidence Intervals
I ran a t-test on the variables to calculate the confidence intervals
We are 95% confident that the population mean for Fertility is between 67 and 74. The true mean of 70 is between our confidence interval so we fail to reject the null hypothesis
t.test(swiss$Fertility)
##
## One Sample t-test
##
## data: swiss$Fertility
## t = 38.495, df = 46, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 66.47485 73.81025
## sample estimates:
## mean of x
## 70.14255
We are 95% confident that the population mean for Agriculture is between 44 and 57. True mean of 51 is between our confidence interval so we fail to reject the null hypothesis
t.test(swiss$Agriculture)
##
## One Sample t-test
##
## data: swiss$Agriculture
## t = 15.292, df = 46, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 43.99131 57.32784
## sample estimates:
## mean of x
## 50.65957
We are 95% confident that the population mean for Examination is between 14 and 19. True mean is between our confidence interval so we fail to reject the null hypothesis
t.test(swiss$Examination)
##
## One Sample t-test
##
## data: swiss$Examination
## t = 14.17, df = 46, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 14.14697 18.83176
## sample estimates:
## mean of x
## 16.48936
We are 95% confident that the population mean for Education is between 8 and 14. True mean of 11 is between our confidence interval so we fail to reject the null hypothesis
t.test(swiss$Education)
##
## One Sample t-test
##
## data: swiss$Education
## t = 7.8277, df = 46, p-value = 5.314e-10
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 8.155534 13.801913
## sample estimates:
## mean of x
## 10.97872
We are 95% confident that the population mean for Catholic is between 29 and 53. True mean of 41 is between our confidence interval so we fail to reject the null hypothesis
t.test(swiss$Catholic)
##
## One Sample t-test
##
## data: swiss$Catholic
## t = 6.7634, df = 46, p-value = 2.064e-08
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 28.89883 53.38883
## sample estimates:
## mean of x
## 41.14383
We are 95% confident that the population mean for Infant.Mortality is between 19 and 21. True mean of 20 is between our confidence interval so we fail to reject the null hypothesis
t.test(swiss$Infant.Mortality)
##
## One Sample t-test
##
## data: swiss$Infant.Mortality
## t = 46.939, df = 46, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 19.08735 20.79775
## sample estimates:
## mean of x
## 19.94255