#reading the dataset file for further analysis
gssdata <- read.csv(file="gss2years.csv", header=TRUE, sep=",")
head(gssdata)
## X abany abdefect abhlth abrape absingle age aged
## 1 1 <NA> <NA> <NA> <NA> <NA> 31 A GOOD IDEA
## 2 2 <NA> <NA> <NA> <NA> <NA> 23 DEPENDS
## 3 3 <NA> <NA> <NA> <NA> <NA> 82 A GOOD IDEA
## 4 4 <NA> <NA> <NA> <NA> <NA> 40 DEPENDS
## 5 5 YES YES YES YES YES 46 A GOOD IDEA
## 6 6 NO YES YES YES NO 31 DEPENDS
## attend cappun childs coneduc conpress
## 1 ONCE A YR THRU SEV TIMES A YEAR <NA> 0 ONLY SOME ONLY SOME
## 2 LT ONCE A YEAR OPPOSE 0 ONLY SOME HARDLY ANY
## 3 ONCE A MNTH THRU ALMST WEEKLY OPPOSE 5 A GREAT DEAL A GREAT DEAL
## 4 LT ONCE A YEAR OPPOSE 2 ONLY SOME ONLY SOME
## 5 ONCE A YR THRU SEV TIMES A YEAR OPPOSE 1 ONLY SOME ONLY SOME
## 6 ONCE A MNTH THRU ALMST WEEKLY OPPOSE 3 ONLY SOME HARDLY ANY
## degree divlaw eqwlth fechld fefam
## 1 BACHELOR EASIER 3 AGREE <NA>
## 2 BACHELOR STAY SAME 2 AGREE STRONGLY DISAGREE
## 3 LT HIGH SCHOOL MORE DIFFICULT NO GOVT ACTION DISAGREE DISAGREE
## 4 LT HIGH SCHOOL EASIER 4 AGREE AGREE
## 5 JUNIOR COLLEGE <NA> 4 <NA> <NA>
## 6 HIGH SCHOOL <NA> 4 <NA> <NA>
## fepol finalter goodlife grass gunlaw happy
## 1 DISAGREE BETTER AGREE <NA> <NA> PRETTY HAPPY
## 2 DISAGREE STAYED SAME AGREE LEGAL <NA> NOT TOO HAPPY
## 3 DISAGREE WORSE STRONGLY AGREE NOT LEGAL <NA> NOT TOO HAPPY
## 4 <NA> WORSE NEITHER LEGAL <NA> PRETTY HAPPY
## 5 <NA> WORSE AGREE <NA> FAVOR PRETTY HAPPY
## 6 <NA> BETTER STRONGLY AGREE NOT LEGAL FAVOR VERY HAPPY
## health helpful homosex kidssol letdie1
## 1 <NA> LOOKOUT FOR SELF <NA> MUCH BETTER <NA>
## 2 <NA> LOOKOUT FOR SELF <NA> ABOUT THE SAME YES
## 3 <NA> LOOKOUT FOR SELF <NA> MUCH BETTER NO
## 4 <NA> LOOKOUT FOR SELF <NA> SOMEWHAT BETTER NO
## 5 GOOD DEPENDS NOT WRONG AT ALL ABOUT THE SAME <NA>
## 6 EXCELLENT HELPFUL <NA> MUCH BETTER <NA>
## marital natcrime nateduc partyid polviews owngun
## 1 NEVER MARRIED <NA> <NA> DEMOCRAT LIBERAL <NA>
## 2 NEVER MARRIED TOO LITTLE TOO LITTLE DEMOCRAT LIBERAL <NA>
## 3 WIDOWED <NA> <NA> REPUBLICAN LIBERAL <NA>
## 4 NEVER MARRIED TOO LITTLE TOO LITTLE DEMOCRAT LIBERAL <NA>
## 5 DIVORCED SEPARATED <NA> <NA> INDEPENDENT CONSERVATIVE NO
## 6 MARRIED TOO LITTLE TOO LITTLE DEMOCRAT LIBERAL NO
## premarsx race region relig satfin sei
## 1 NOT WRONG AT ALL OTHER MIDDLE ATLANTIC CATHOLIC MORE OR LESS 76.4
## 2 NOT WRONG AT ALL WHITE MIDDLE ATLANTIC NONE MORE OR LESS 85.1
## 3 SOMETIMES WRONG WHITE MIDDLE ATLANTIC CATHOLIC SATISFIED NA
## 4 SOMETIMES WRONG BLACK MIDDLE ATLANTIC NONE MORE OR LESS 32.3
## 5 <NA> BLACK MIDDLE ATLANTIC CATHOLIC MORE OR LESS 63.5
## 6 <NA> BLACK MIDDLE ATLANTIC CATHOLIC NOT AT ALL SAT NA
## sex sexfreq sibs socfrend
## 1 MALE ONCE A YR THRU ONCE A MNTH 2 SEV TIMES A MNTH
## 2 FEMALE WEEKLY 3 SEV TIMES A WEEK OR MORE
## 3 FEMALE <NA> 10 ONCE A YEAR OR LESS
## 4 MALE <NA> 11 SEV TIMES A MNTH
## 5 FEMALE <NA> 2 <NA>
## 6 FEMALE <NA> 1 <NA>
## socommun suicide1 tvhours year age3 SEI3
## 1 ONCE A YEAR OR LESS <NA> 1 2010 18 THRU 36 HIGH
## 2 ONCE A MONTH YES 0 2010 18 THRU 36 HIGH
## 3 ONCE A YEAR OR LESS NO 3 2010 54 THRU 89 <NA>
## 4 ONCE A MONTH NO 4 2010 37 THRU 53 LOW
## 5 <NA> <NA> NA 2010 37 THRU 53 HIGH
## 6 <NA> <NA> NA 2010 18 THRU 36 <NA>
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.4
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#using the is.na() function to remove those rows with "NA"
gssnew<-subset(gssdata,!is.na(gssdata$age))
ggplot(gssnew,mapping = aes(as.numeric(gssnew$age)))+geom_histogram(binwidth = 1)
From this graph, we can say that the majority of the population are in the age bracket (10-30years). The histogram is a great tool to represent data in stacks for further analysis.
qplot(sex, data = gssnew)
pie(table(gssdata$marital))
pie(table(gssdata$race))
From this pie charts, we can conclude that good portion of population is married and belong to white race. Silge (2016) underlined that barcharts are used to identify how three variables are connected.
hist(as.numeric(gssnew$degree), main="Histogram for graduates", xlab="", border="red", breaks = 20, col="blue")
ggplot(data = gssnew, aes(relig, fill=sex))+geom_bar()
ggplot(data = gssnew, aes(marital, fill=region))+geom_bar()
ggplot(data = gssnew, aes(degree))+geom_bar()+facet_grid(race~.)
ggplot(data = gssnew, aes(satfin, fill=region))+geom_bar()+coord_flip()
The simple histogram depicted that there are significant number of citizens with assoicate/junior degree and limited number with bachelors’ degree in the population sample
From the bar charts, we could generate many insights about the GSS survey results about the population demographics & social behaviors.
Protestants are high in number and male to female ratio is lowest in catholic & protestant communities
South Atlantic has the highest proportion of married population and unmarried in West South Central America
White population peformed well in education compared to the black population.
Those who reside in mountains are satisfied financially the most.
gssnew<- subset(gssnew, !is.na(gssnew$goodlife))
ggplot(data = gssnew, aes(satfin, fill=goodlife))+geom_bar()
ggplot(data = gssnew, aes(sex, fill=degree))+geom_bar(position = "dodge")+coord_flip()
ggplot(data = gssnew, aes(age3))+geom_density()
those population with more or less satisfied financially agreed that they are having good life and otherwise. female population scored high on the academic front compared to the male population from the above graphs. The density graph was used to show the rate of change of variable frequency. In this case, the density is low and falling down for age group 18-36 unlike the slope curve for group 37-53
gssnew<-mutate(gssnew, famchildpolicy = ifelse(is.numeric(childs) <3, "good","bad"))
## Warning: package 'bindrcpp' was built under R version 3.4.4
Horton (2015) highlighted that we can use mutate() and ifelse() function to create new variables in the dplyr package.I have created a new variable called “famchildpolicy” that will take the value of “good” if the number of children is less than 3 in the family.
References:
Horton, N. J., & Kleinman, K. (2015). Using R and RStudio for data management, statistical analysis, and graphics. Chapman and Hall/CRC.
Silge, J., & Robinson, D. (2016). tidytext: Text mining and analysis using tidy data principles in r. The Journal of Open Source Software, 1(3), 37.