Sung Lee Final Project

Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.

require(plyr)

## Loading required package: plyr

require(ggplot2)

## Loading required package: ggplot2

z_file <- gzcon(url("https://raw.githubusercontent.com/logicalschema/r_winter_projects/master/GSSvocab.csv.gz"))
raw_csv <- textConnection(readLines(z_file))
close(z_file)

gssData <- read.table(raw_csv, header = TRUE, sep = ",")

Conclusions: Some conclusions from the data: 1) the percentage of women who participated in the study was 56.76 percent and 2) the number of participants who identified as native born was 26,224 in comparison with no’s: 2,556 and NA’s: 87.

2. Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together)

#Renaming the columns for the data set
gssData <- rename(gssData, c("year" = "Year", "gender" = "Gender", "educGroup" = "EducationGroup", "vocab" = "VocabularyScore", "age" = "Age", "educ" = "Education", "ageGroup" = "AgeGroup", "nativeBorn" = "NativeBorn"))

#Creating a subset for the data for participants between 21 and 50 years of age
subset_gssData <- subset(gssData, Age >= 21 & Age <= 50)
head(subset_gssData, 5)

##         X Year Gender NativeBorn AgeGroup EducationGroup VocabularyScore Age
## 3  1978.3 1978   male        yes    30-39        <12 yrs               4  35
## 4  1978.4 1978 female        yes    50-59         12 yrs               9  50
## 5  1978.5 1978 female        yes    40-49         12 yrs               6  41
## 9  1978.9 1978 female        yes    40-49         16 yrs               8  49
## 10 1978.1 1978   male        yes    18-29         12 yrs               3  21
##    Education
## 3         10
## 4         12
## 5         12
## 9         16
## 10        12

Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.

# scatter plot
plot(gssData$Education,gssData$VocabularyScore, xlab='Years of Education',ylab='Vocabulary Score' ,main='Years of Education vs. Vocabulary Score', col='blue')

# box plot
boxplot(Education~VocabularyScore,data=gssData, main="Education and Vocabulary Score", xlab="Education", ylab="Vocabulary Score")

#histogram

ggplot(gssData, aes(x=VocabularyScore, color=NativeBorn)) + geom_histogram(fill="white", bins=10)

## Warning: Removed 1348 rows containing non-finite values (stat_bin).

4. Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end.

My original question was do natives do better than non-natives in vocabulary for exams? From the analysis, those who identified as non-native faired slightly better on a vocabulary proficiency exam. I do not have access to the study’s vocabulary exam, but the analysis disproved my original presuppositions for my question.

5. BONUS – place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.

The following code is used to download a zipped copy of the data file and read it.

z_file <- gzcon(url("https://raw.githubusercontent.com/logicalschema/r_winter_projects/master/GSSvocab.csv.gz"))
raw_csv <- textConnection(readLines(z_file))
close(z_file)

gssData <- read.table(raw_csv, header = TRUE, sep = ",")

Sung Lee Final Project

Sung Lee

1/17/2020