1. Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.

Load in packages and dataset

require(ggplot2)
## Loading required package: ggplot2
require(tidyr)
## Loading required package: tidyr
require(plyr)
## Loading required package: plyr
csv <- read.csv("C://Users//natal//Documents//Masters//Cuny SPS MDS//R Bridge Course//R Bridge Week 3 Notebook//GSSvocab.csv")

head(csv)
##        X year gender nativeBorn ageGroup educGroup vocab age educ
## 1 1978.1 1978 female        yes    50-59    12 yrs    10  52   12
## 2 1978.2 1978 female        yes      60+   <12 yrs     6  74    9
## 3 1978.3 1978   male        yes    30-39   <12 yrs     4  35   10
## 4 1978.4 1978 female        yes    50-59    12 yrs     9  50   12
## 5 1978.5 1978 female        yes    40-49    12 yrs     6  41   12
## 6 1978.6 1978   male        yes    18-29    12 yrs     6  19   12
summary(csv)
##        X             year         gender           nativeBorn       
##  Min.   :1978   Min.   :1978   Length:28867       Length:28867      
##  1st Qu.:1988   1st Qu.:1988   Class :character   Class :character  
##  Median :1996   Median :1996   Mode  :character   Mode  :character  
##  Mean   :1998   Mean   :1997                                        
##  3rd Qu.:2008   3rd Qu.:2008                                        
##  Max.   :2017   Max.   :2016                                        
##                                                                     
##    ageGroup          educGroup             vocab             age       
##  Length:28867       Length:28867       Min.   : 0.000   Min.   :18.00  
##  Class :character   Class :character   1st Qu.: 5.000   1st Qu.:32.00  
##  Mode  :character   Mode  :character   Median : 6.000   Median :44.00  
##                                        Mean   : 5.998   Mean   :46.18  
##                                        3rd Qu.: 7.000   3rd Qu.:59.00  
##                                        Max.   :10.000   Max.   :89.00  
##                                        NA's   :1348     NA's   :94     
##       educ      
##  Min.   : 0.00  
##  1st Qu.:12.00  
##  Median :12.00  
##  Mean   :13.04  
##  3rd Qu.:15.00  
##  Max.   :20.00  
##  NA's   :81
unique(csv$age)
##  [1] 52 74 35 50 41 19 59 49 21 44 81 46 53 24 43 25 73 58 51 57 76 33 39 40 63
## [26] 22 20 31 26 30 27 71 72 66 48 38 56 23 18 55 45 28 42 70 29 32 82 79 36 77
## [51] 78 62 37 34 60 68 54 87 75 47 89 65 61 67 83 69 64 80 84 86 NA 88 85

The average mean of educ is 13.04, meaning that there are individuals who have went onto post-secondary education.)

The min of age is 18, indicating that this data set consists of only adults.

2.Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together.)

Renaming some columns for better readability

names(csv)[names(csv) == 'vocab'] <- 'vocabularyScore'
names(csv)[names(csv) == 'educ'] <- 'yearsOfEducation'

colnames(csv)
## [1] "X"                "year"             "gender"           "nativeBorn"      
## [5] "ageGroup"         "educGroup"        "vocabularyScore"  "age"             
## [9] "yearsOfEducation"

Create a subset with only native born and remove na’s in age group

df_csv <- data.frame(csv)
csvNative <- subset(csv, csv$nativeBorn == 'yes' & csv$ageGroup != "NA")
unique(csvNative$ageGroup)
## [1] "50-59" "60+"   "30-39" "40-49" "18-29"

Age group categorized numerically

csvNative$ageGroupNumeric <- c("18-29" = 1, "30-39" = 2, "40-49" = 3, "50-59" = 4, "60+" = 5)
head(csvNative)
##        X year gender nativeBorn ageGroup educGroup vocabularyScore age
## 1 1978.1 1978 female        yes    50-59    12 yrs              10  52
## 2 1978.2 1978 female        yes      60+   <12 yrs               6  74
## 3 1978.3 1978   male        yes    30-39   <12 yrs               4  35
## 4 1978.4 1978 female        yes    50-59    12 yrs               9  50
## 5 1978.5 1978 female        yes    40-49    12 yrs               6  41
## 6 1978.6 1978   male        yes    18-29    12 yrs               6  19
##   yearsOfEducation ageGroupNumeric
## 1               12               1
## 2                9               2
## 3               10               3
## 4               12               4
## 5               12               5
## 6               12               1

3. Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.

Scatterplot of age and years of education

attach(csvNative)
plot(age,yearsOfEducation, main = "Age v.s. Education", xlab = "age", ylab = "Years of Education", pch = 16)

Box plot of age groups and years of education

boxPlot <- ggplot(csvNative, aes(x=ageGroup, y=yearsOfEducation)) + 
  geom_boxplot() + ggtitle("Years of Education by Age Group") +
  xlab("Age Group") + ylab("Years")

boxPlot
## Warning: Removed 53 rows containing non-finite values (`stat_boxplot()`).

Histogram of vocabulary scores and years of education

histogram <- csvNative$vocabularyScore
hist(vocabularyScore)

histogram <- csvNative$yearsOfEducation
hist(yearsOfEducation)

4. Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end.

Is there a correlation between the number of years of education and age among native borns?

Typically, older generations tend to lack the extensive educational backgrounds of newer generations due to financial constraints, lack of opportunities in the past, gender roles, etc. I would like to see if this trend applies to this demographic as well.

In the scatterplot, we can see a slight negative correlation between age and years of education, where as age increases, there are more people who have less years of education. This trend continues in the boxplot, where once the age group passes 49, the box shifts lower, meaning the range of years of education is decreasing for the particular age groups.Another trend that I found intriguing is that the boxes shifted upwards then downwards, where the middle age demographic of 40-49 had the highest average, and the youngest demographic, had the third lowest average and second lowest box.

We can conclude that there is some overall negative correlation that exists between age and years of education among native born individuals. However, within this there is a small positive relationship when it comes to the youngest demographic up to middle age of (40-49).

5.BONUS – place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.

native_github <- read.csv('https://raw.githubusercontent.com/nk014914/R-Bridge-Class-HW-CSV/main/GSSvocab.csv')