Exploratory data analysis on Test taker focusing on Vocabulary

Wilson Chau Summer 2022

I wanted to do some analysis on this data set given to me that focuses on Vocabulary and Education. This dataset had sources recorded from 1972-2016. I wanted to see if there’s any correlation with what male to female on taking the exam, and who had the better educational background.

source: https://vincentarelbundock.github.io/Rdatasets/doc/carData/Vocab.html Questions to answer: 1. Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.

#install.packages("ggplot2")
require(ggplot2)

## Loading required package: ggplot2

#5. BONUS – place the original .csv in a github file and have R read from the link. 
Voc <- "https://raw.githubusercontent.com/Wilchau/Rprogrammingweek3Project/main/Vocab.csv"

Voc <- read.csv(Voc,TRUE, ",")
Vocdf <- data.frame(Voc) #convert to dataframe
#This dataset has over 30,351 observation variables with 5 variables: ID, birthyear, sex, education(in years), vocabulary score(scored out of a 10)
head(Vocdf)

##          X year    sex education vocabulary
## 1 19740001 1974   Male        14          9
## 2 19740002 1974   Male        16          9
## 3 19740003 1974 Female        10          9
## 4 19740004 1974 Female        10          5
## 5 19740005 1974 Female        12          8
## 6 19740006 1974   Male        16          8

summary(Vocdf)

##        X                 year          sex              education    
##  Min.   :19740001   Min.   :1974   Length:30351       Min.   : 0.00  
##  1st Qu.:19870112   1st Qu.:1987   Class :character   1st Qu.:12.00  
##  Median :19942104   Median :1994   Mode  :character   Median :12.00  
##  Mean   :19954597   Mean   :1995                      Mean   :13.03  
##  3rd Qu.:20063676   3rd Qu.:2006                      3rd Qu.:15.00  
##  Max.   :20162866   Max.   :2016                      Max.   :20.00  
##    vocabulary    
##  Min.   : 0.000  
##  1st Qu.: 5.000  
##  Median : 6.000  
##  Mean   : 6.004  
##  3rd Qu.: 7.000  
##  Max.   :10.000

#This anaswers Question 1 in finding medans, medians, quartiles.

I am changing the names of year and vocabulary to test_year and voc_score to give us a better identifier

#change year = Test_year and vocabulary = Voc_Score
colnames(Vocdf)[colnames(Vocdf) == "year"] <- "test_year"
colnames(Vocdf)[colnames(Vocdf) == "vocabulary"] <- "voc_score"
colnames(Vocdf, do.NULL = TRUE, prefix ="col")

## [1] "X"         "test_year" "sex"       "education" "voc_score"

These 3 summary gives us the Mean and Median for test_year,education, voc_score

summary(Vocdf$test_year)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1974    1987    1994    1995    2006    2016

summary(Vocdf$education)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   12.00   12.00   13.03   15.00   20.00

summary(Vocdf$voc_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.000   6.000   6.004   7.000  10.000

#2. Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together) This code: Makes a subset to focus on the facotrial data of test_year, education, voc_score, and transforming sex from a string variable to a factorial data

Voc_subset  <- Vocdf[,c ("test_year", "education", "voc_score","sex")]
head(Voc_subset)

##   test_year education voc_score    sex
## 1      1974        14         9   Male
## 2      1974        16         9   Male
## 3      1974        10         9 Female
## 4      1974        10         5 Female
## 5      1974        12         8 Female
## 6      1974        16         8   Male

Voc_subset$sex <- as.factor((Voc_subset$sex)) #This allow us to convert sex to a factorial data
summary(Voc_subset)

##    test_year      education       voc_score          sex       
##  Min.   :1974   Min.   : 0.00   Min.   : 0.000   Female:17148  
##  1st Qu.:1987   1st Qu.:12.00   1st Qu.: 5.000   Male  :13203  
##  Median :1994   Median :12.00   Median : 6.000                 
##  Mean   :1995   Mean   :13.03   Mean   : 6.004                 
##  3rd Qu.:2006   3rd Qu.:15.00   3rd Qu.: 7.000                 
##  Max.   :2016   Max.   :20.00   Max.   :10.000

Based on this summary, there were 17,148 Females and 13,203 Males who particpated on this exam. The frequent year were people were taking this exam was in 1994, and the median education level is around 12. #3 Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t limited to this. Please explore the many other options in R packages such as ggplot2. Graphs, Based on summary of subset, it seems like education and vocabular can be related, test year and voc score can determine if different year has easier or harder exam which can result low or high scores based on difficultiy

#This histograph shows that the average score is below a 6 and it seems to be around the median. That means 50% of the 30,351 particpants scored below a 6 vocabulary score. I also showed the histograph of education level and it’s sitting around a

boxplot(Voc_subset$education, main = "education level", ylab = "education years")

hist(Voc_subset$voc_score, main ="Vocbulary score", xlab ="Vocabulary Score")

summary(Voc_subset$education)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   12.00   12.00   13.03   15.00   20.00

Boxplot indicates the Sex versus education, it’s shown that the male had a higher education level than the females. In terms of investigation both males and females took the test at the same years, and they scored similarly.

plot(Voc_subset$education ~ Voc_subset$sex)

plot(Voc_subset$test_year ~ Voc_subset$sex)

plot(Voc_subset$voc_score ~ Voc_subset$sex)

#4. Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end.

In the beginning I wanted to see in this data set if the education level has any correlation to a higher voc_score. On the Histogram of sex versus education. There was a lot more outlier of females particpant who has less than the 25% quartile of education. This may skew some of the data more, but the male has fewer outliar. Based on observation, Males has a higher education background. Through the analysis of sex vs voc_score. Both male and female score around a median of 6 at the same educational level. This can indicate that

Exploratory data analysis on Test taker focusing on Vocabulary

2022-07-30