1. Data Exploration: This should include summary statistics, means,
medians, quartiles, or any other relevant information about the data
set. Please include some conclusions in the R Markdown text.
Load in packages and dataset
require(ggplot2)
## Loading required package: ggplot2
require(tidyr)
## Loading required package: tidyr
require(plyr)
## Loading required package: plyr
csv <- read.csv("C://Users//natal//Documents//Masters//Cuny SPS MDS//R Bridge Course//R Bridge Week 3 Notebook//GSSvocab.csv")
head(csv)
## X year gender nativeBorn ageGroup educGroup vocab age educ
## 1 1978.1 1978 female yes 50-59 12 yrs 10 52 12
## 2 1978.2 1978 female yes 60+ <12 yrs 6 74 9
## 3 1978.3 1978 male yes 30-39 <12 yrs 4 35 10
## 4 1978.4 1978 female yes 50-59 12 yrs 9 50 12
## 5 1978.5 1978 female yes 40-49 12 yrs 6 41 12
## 6 1978.6 1978 male yes 18-29 12 yrs 6 19 12
summary(csv)
## X year gender nativeBorn
## Min. :1978 Min. :1978 Length:28867 Length:28867
## 1st Qu.:1988 1st Qu.:1988 Class :character Class :character
## Median :1996 Median :1996 Mode :character Mode :character
## Mean :1998 Mean :1997
## 3rd Qu.:2008 3rd Qu.:2008
## Max. :2017 Max. :2016
##
## ageGroup educGroup vocab age
## Length:28867 Length:28867 Min. : 0.000 Min. :18.00
## Class :character Class :character 1st Qu.: 5.000 1st Qu.:32.00
## Mode :character Mode :character Median : 6.000 Median :44.00
## Mean : 5.998 Mean :46.18
## 3rd Qu.: 7.000 3rd Qu.:59.00
## Max. :10.000 Max. :89.00
## NA's :1348 NA's :94
## educ
## Min. : 0.00
## 1st Qu.:12.00
## Median :12.00
## Mean :13.04
## 3rd Qu.:15.00
## Max. :20.00
## NA's :81
unique(csv$age)
## [1] 52 74 35 50 41 19 59 49 21 44 81 46 53 24 43 25 73 58 51 57 76 33 39 40 63
## [26] 22 20 31 26 30 27 71 72 66 48 38 56 23 18 55 45 28 42 70 29 32 82 79 36 77
## [51] 78 62 37 34 60 68 54 87 75 47 89 65 61 67 83 69 64 80 84 86 NA 88 85
The average mean of educ is 13.04, meaning that there are
individuals who have went onto post-secondary education.)
The min of age is 18, indicating that this data set consists of only
adults.
3. Graphics: Please make sure to display at least one scatter plot,
box plot and histogram. Don’t be limited to this. Please explore the
many other options in R packages such as ggplot2.
Scatterplot of age and years of education
attach(csvNative)
plot(age,yearsOfEducation, main = "Age v.s. Education", xlab = "age", ylab = "Years of Education", pch = 16)

Box plot of age groups and years of education
boxPlot <- ggplot(csvNative, aes(x=ageGroup, y=yearsOfEducation)) +
geom_boxplot() + ggtitle("Years of Education by Age Group") +
xlab("Age Group") + ylab("Years")
boxPlot
## Warning: Removed 53 rows containing non-finite values (`stat_boxplot()`).

Histogram of vocabulary scores and years of education
histogram <- csvNative$vocabularyScore
hist(vocabularyScore)

histogram <- csvNative$yearsOfEducation
hist(yearsOfEducation)

4. Meaningful question for analysis: Please state at the beginning a
meaningful question for analysis. Use the first three steps and anything
else that would be helpful to answer the question you are posing from
the data set you chose. Please write a brief conclusion paragraph in R
markdown at the end.
Is there a correlation between the number of years of
education and age among native borns?
Typically, older generations tend to lack the extensive educational
backgrounds of newer generations due to financial constraints, lack of
opportunities in the past, gender roles, etc. I would like to see if
this trend applies to this demographic as well.
In the scatterplot, we can see a slight negative correlation between
age and years of education, where as age increases, there are more
people who have less years of education. This trend continues in the
boxplot, where once the age group passes 49, the box shifts lower,
meaning the range of years of education is decreasing for the particular
age groups.Another trend that I found intriguing is that the boxes
shifted upwards then downwards, where the middle age demographic of
40-49 had the highest average, and the youngest demographic, had the
third lowest average and second lowest box.
We can conclude that there is some overall negative correlation that
exists between age and years of education among native born individuals.
However, within this there is a small positive relationship when it
comes to the youngest demographic up to middle age of (40-49).
5.BONUS – place the original .csv in a github file and have R read
from the link. This will be a very useful skill as you progress in your
data science education and career.
native_github <- read.csv('https://raw.githubusercontent.com/nk014914/R-Bridge-Class-HW-CSV/main/GSSvocab.csv')