Contents:
Part 1:
Using Acrobat Reader, open the file hsb.pdf, and decide which of those columns are nominal, ordinal or numerical (integer or decimal). Then open the file hsb.sav, which is in SPSS format. You can use the library rio (function import) or foreign (function read.spss) to open it. Make sure to find out the arguments needed in each function.
library(rio)
library(tidyverse)
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
## Registered S3 method overwritten by 'rvest':
## method from
## read_xml.response xml2
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.1 v purrr 0.3.2
## v tibble 2.1.1 v dplyr 0.8.0.1
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
setwd("C:/Users/leoto/OneDrive/Documents/598/Week 5/session5")
hsb.dat <- import("C:/Users/leoto/OneDrive/Documents/598/Week 5/session5/hsb.sav")
str(hsb.dat)
## 'data.frame': 600 obs. of 15 variables:
## $ ID : num 1 2 3 4 5 6 7 8 9 10 ...
## ..- attr(*, "format.spss")= chr "F5.0"
## $ SEX : num 2 1 2 2 2 1 1 2 1 2 ...
## ..- attr(*, "format.spss")= chr "F5.0"
## $ RACE : num 2 2 2 2 2 2 2 2 2 2 ...
## ..- attr(*, "format.spss")= chr "F5.0"
## $ SES : num 1 1 1 2 2 2 1 1 2 1 ...
## ..- attr(*, "format.spss")= chr "F5.0"
## $ SCTYP : num 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "format.spss")= chr "F5.0"
## $ HSP : num 3 2 2 3 3 2 1 1 1 1 ...
## ..- attr(*, "format.spss")= chr "F5.0"
## $ LOCUS : num 0.29 -0.42 0.71 0.06 0.22 0.46 0.44 0.68 0.06 0.05 ...
## ..- attr(*, "format.spss")= chr "F5.2"
## $ CONCPT: num 0.88 0.03 0.03 0.03 -0.28 0.03 -0.47 0.25 0.56 0.15 ...
## ..- attr(*, "format.spss")= chr "F5.2"
## $ MOT : num 0.67 0.33 0.67 0 0 0 0.33 1 0.33 1 ...
## ..- attr(*, "format.spss")= chr "F5.2"
## $ CAR : num 10 2 9 15 1 11 10 9 9 11 ...
## ..- attr(*, "format.spss")= chr "F5.0"
## $ RDG : num 33.6 46.9 41.6 38.9 36.3 49.5 62.7 44.2 46.9 44.2 ...
## ..- attr(*, "format.spss")= chr "F5.2"
## $ WRTG : num 43.7 35.9 59.3 41.1 48.9 46.3 64.5 51.5 41.1 49.5 ...
## ..- attr(*, "format.spss")= chr "F5.2"
## $ MATH : num 40.2 41.9 41.9 32.7 39.5 46.2 48 36.9 45.3 40.5 ...
## ..- attr(*, "format.spss")= chr "F5.2"
## $ SCI : num 39 36.3 44.4 41.7 41.7 41.7 63.4 49.8 47.1 39 ...
## ..- attr(*, "format.spss")= chr "F5.2"
## $ CIV : num 40.6 45.6 45.6 40.6 45.6 35.6 55.6 55.6 55.6 50.6 ...
## ..- attr(*, "format.spss")= chr "F5.2"
Part 2:
Make two barplots for one nominal variable. The first one should have title and source, but no more changes to the default given by ggplot. The second plot should include more customized changes of your choice.
hsb.dat$SEX = factor(
hsb.dat$SEX , levels = 1:2,
labels = c("Male" , "Female")
)
frTable=as.data.frame(table(hsb.dat$SEX))
names(frTable)=c('SEX','Count')
baseNom= ggplot(data = frTable,
aes(x=SEX, y=Count))
barNom=baseNom + geom_bar(stat = 'identity')
barNom
titletext = "Gender Breakdown of Student Sample"
sourcetext= 'Source: National Center for Education Statistics, 1980'
barNom1 = barNom + labs(title = titletext , caption = sourcetext)
barNom1
barNom2 = barNom1 + geom_bar(stat = "identity",mapping = aes( fill = "orange"))
barNom2
Part 3:
Make two boxplots for one ordinal variable. The first one should have title and source, but no more changes to the default given by ggplot. The second plot should include more customized changes of your choice.
titletext2 = 'Socioeconomic status of Students'
sourcetext = 'Source: National Center for Education Statistcs, 1980'
baseOrd= ggplot(hsb.dat, aes(y=as.numeric(SES)))
baseOrdBox= baseOrd + geom_boxplot()+labs(title = titletext2, caption=sourcetext)
baseOrdBox
hsb.dat$SES <- ordered(hsb.dat$SES,
levels=c(1,2,3),
labels=c("Low","Medium","High")
)
ordLabels=levels(hsb.dat$SES)
Plot1 = baseOrdBox + scale_y_continuous(labels=ordLabels,breaks=1:3) + geom_boxplot(color = 'red')
Plot1
Part 4:
Make two histograms for one numerical variable. The first one should have title and source, but no more changes to the default given by ggplot. The second plot should include more customized changes of your choice.
baseInt2= ggplot(hsb.dat,aes(x = WRTG))
baseIntHist= baseInt2 + geom_histogram()
titleText='Distribution of Student Writing Scores'
sourceText='Source: National Center for Education Statistcs, 1980'
xaxisText='Writing T Score'
yaxisText='Frequency'
baseDecHist2= baseIntHist + labs(title=titleText,
x = xaxisText,
y = yaxisText,
caption = sourceText)
baseDecHist2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
baseInt3= ggplot(hsb.dat,aes(x = WRTG))
baseIntHist3= baseInt3 + geom_histogram(binwidth = 5, color='black',fill='blue')
titleText='Distribution of Student Writing Scores'
sourceText='Source: National Center for Education Statistcs, 1980'
xaxisText='Writing T Score'
yaxisText='Frequency'
baseDecHist3= baseIntHist3 + labs(title=titleText,
x = xaxisText,
y = yaxisText,
caption = sourceText)
baseDecHist3
Part 5:
Make an alternative plot for the nominal and the numerical variable. Customize it with the elements of your choice.
pie(table(hsb.dat$SEX))
ggplot(data = hsb.dat) +
geom_point(mapping = aes(x=WRTG, y = SCI) , position = "jitter")
Part 6 (For final project):
sleep.dat <- import("C:/Users/leoto/OneDrive/Documents/598/Week 5/session5/sleep data.csv")
str(sleep.dat)
## 'data.frame': 21 obs. of 2 variables:
## $ Date : chr "4/15/2019" "4/16/2019" "4/17/2019" "4/18/2019" ...
## $ Hours Slept: num 6 8.5 7.5 8 8.5 8 6 7 8 7 ...
ggplot(data=sleep.dat, aes(x=Date, y=`Hours Slept` , group =1)) +
geom_line()+
geom_point()