R Markdown

Week 4 Discussion

The “swiss” dataset in library (datasets) contains information about socioeconomic indicators and fertility. Provide descriptive statistics and confidence intervals for appropriate variables. Interpret. Submit your R code.

Loading the Swiss data set and storing it in a variable

Viewing columns in the dataset

names(mydata)
## [1] "Fertility"        "Agriculture"      "Examination"      "Education"       
## [5] "Catholic"         "Infant.Mortality"

The data is representative of 47 municipalities in

row.names(mydata)
##  [1] "Courtelary"   "Delemont"     "Franches-Mnt" "Moutier"      "Neuveville"  
##  [6] "Porrentruy"   "Broye"        "Glane"        "Gruyere"      "Sarine"      
## [11] "Veveyse"      "Aigle"        "Aubonne"      "Avenches"     "Cossonay"    
## [16] "Echallens"    "Grandson"     "Lausanne"     "La Vallee"    "Lavaux"      
## [21] "Morges"       "Moudon"       "Nyone"        "Orbe"         "Oron"        
## [26] "Payerne"      "Paysd'enhaut" "Rolle"        "Vevey"        "Yverdon"     
## [31] "Conthey"      "Entremont"    "Herens"       "Martigwy"     "Monthey"     
## [36] "St Maurice"   "Sierre"       "Sion"         "Boudry"       "La Chauxdfnd"
## [41] "Le Locle"     "Neuchatel"    "Val de Ruz"   "ValdeTravers" "V. De Geneve"
## [46] "Rive Droite"  "Rive Gauche"

loading summary statistics of Swiss data

summary(mydata)
##    Fertility      Agriculture     Examination      Education    
##  Min.   :35.00   Min.   : 1.20   Min.   : 3.00   Min.   : 1.00  
##  1st Qu.:64.70   1st Qu.:35.90   1st Qu.:12.00   1st Qu.: 6.00  
##  Median :70.40   Median :54.10   Median :16.00   Median : 8.00  
##  Mean   :70.14   Mean   :50.66   Mean   :16.49   Mean   :10.98  
##  3rd Qu.:78.45   3rd Qu.:67.65   3rd Qu.:22.00   3rd Qu.:12.00  
##  Max.   :92.50   Max.   :89.70   Max.   :37.00   Max.   :53.00  
##     Catholic       Infant.Mortality
##  Min.   :  2.150   Min.   :10.80   
##  1st Qu.:  5.195   1st Qu.:18.15   
##  Median : 15.140   Median :20.00   
##  Mean   : 41.144   Mean   :19.94   
##  3rd Qu.: 93.125   3rd Qu.:21.70   
##  Max.   :100.000   Max.   :26.60

Another useful way to look at the summary data is using the psych package

install.packages("psych", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/mm19975/OneDrive - MassMutual/MyDocuments/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'psych' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\mm19975\AppData\Local\Temp\RtmpiykSxW\downloaded_packages
library(psych)
describe(mydata)
##                  vars  n  mean    sd median trimmed   mad   min   max range
## Fertility           1 47 70.14 12.49  70.40   70.66 10.23 35.00  92.5 57.50
## Agriculture         2 47 50.66 22.71  54.10   51.16 23.87  1.20  89.7 88.50
## Examination         3 47 16.49  7.98  16.00   16.08  7.41  3.00  37.0 34.00
## Education           4 47 10.98  9.62   8.00    9.38  5.93  1.00  53.0 52.00
## Catholic            5 47 41.14 41.70  15.14   39.12 18.65  2.15 100.0 97.85
## Infant.Mortality    6 47 19.94  2.91  20.00   19.98  2.82 10.80  26.6 15.80
##                   skew kurtosis   se
## Fertility        -0.46     0.26 1.82
## Agriculture      -0.32    -0.89 3.31
## Examination       0.45    -0.14 1.16
## Education         2.27     6.14 1.40
## Catholic          0.48    -1.67 6.08
## Infant.Mortality -0.33     0.78 0.42

lets observe a visual representation of the variables in Swiss

boxplot(swiss)

Its clear to see that the widest range of values are within Catholic while infant.mortality has nearly similar values and education has the most outliers or extreme values.

Learning about distribution of category data to consider which variables are good candidates for confidence intervals this variable is normally distributed given the sample size with a majority of rates within 60%-90%

hist(swiss$Fertility,main="Fertility", xlab = "Fertility")

Municipality data skews to the left indicating that the majority of the municipalities have less than 20% of people educated beyond primary school

hist(swiss$Education,main="Education", xlab = "Education")

Agriculture data distribution is difficult to determine. Its clear there are some extremes or outliers and these are mostly concentrated in the left tail

hist(swiss$Agriculture,main="Agriculture", xlab = "Agriculture")

The density plot view might be more useful to view the distribution of Agriculture since the barchart within the histogram are difficult to determine and when we run its clear that the distribution is slightly right skewed

plot(density(swiss$Agriculture),main="Agriculture")

The Catholic variable has a bimodal distribution with extremes on both tails

hist(swiss$Catholic,main="Catholic", xlab = "Catholic")

Examination data is left skewed indicating a small number of municipalities hold a high percentage fo draftees with highest exam marks

hist(swiss$Examination,main="Examination", xlab = "Examination")

It appears infant mortality is normally distributed with some smaller rates in the left tail indicating a small number of municipalities with low Infant.Mortality

hist(swiss$Infant.Mortality,main="Infant.Mortality", xlab = "Infant.Mortality")

Confidence Intervals

I ran a t-test on the variables to calculate the confidence intervals

We are 95% confident that the population mean for Fertility is between 67 and 74. The true mean of 70 is between our confidence interval so we fail to reject the null hypothesis

t.test(swiss$Fertility)
## 
##  One Sample t-test
## 
## data:  swiss$Fertility
## t = 38.495, df = 46, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  66.47485 73.81025
## sample estimates:
## mean of x 
##  70.14255

We are 95% confident that the population mean for Agriculture is between 44 and 57. True mean of 51 is between our confidence interval so we fail to reject the null hypothesis

t.test(swiss$Agriculture)
## 
##  One Sample t-test
## 
## data:  swiss$Agriculture
## t = 15.292, df = 46, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  43.99131 57.32784
## sample estimates:
## mean of x 
##  50.65957

We are 95% confident that the population mean for Examination is between 14 and 19. True mean is between our confidence interval so we fail to reject the null hypothesis

t.test(swiss$Examination)
## 
##  One Sample t-test
## 
## data:  swiss$Examination
## t = 14.17, df = 46, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  14.14697 18.83176
## sample estimates:
## mean of x 
##  16.48936

We are 95% confident that the population mean for Education is between 8 and 14. True mean of 11 is between our confidence interval so we fail to reject the null hypothesis

t.test(swiss$Education)
## 
##  One Sample t-test
## 
## data:  swiss$Education
## t = 7.8277, df = 46, p-value = 5.314e-10
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##   8.155534 13.801913
## sample estimates:
## mean of x 
##  10.97872

We are 95% confident that the population mean for Catholic is between 29 and 53. True mean of 41 is between our confidence interval so we fail to reject the null hypothesis

t.test(swiss$Catholic)
## 
##  One Sample t-test
## 
## data:  swiss$Catholic
## t = 6.7634, df = 46, p-value = 2.064e-08
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  28.89883 53.38883
## sample estimates:
## mean of x 
##  41.14383

We are 95% confident that the population mean for Infant.Mortality is between 19 and 21. True mean of 20 is between our confidence interval so we fail to reject the null hypothesis

t.test(swiss$Infant.Mortality)
## 
##  One Sample t-test
## 
## data:  swiss$Infant.Mortality
## t = 46.939, df = 46, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  19.08735 20.79775
## sample estimates:
## mean of x 
##  19.94255