Session 5 LAB Assignment

Contents:

Part 1:

Using Acrobat Reader, open the file hsb.pdf, and decide which of those columns are nominal, ordinal or numerical (integer or decimal). Then open the file hsb.sav, which is in SPSS format. You can use the library rio (function import) or foreign (function read.spss) to open it. Make sure to find out the arguments needed in each function.

library(rio)
library(tidyverse)

## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang

## Registered S3 method overwritten by 'rvest':
##   method            from
##   read_xml.response xml2

## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.1.1       v purrr   0.3.2  
## v tibble  2.1.1       v dplyr   0.8.0.1
## v tidyr   0.8.3       v stringr 1.4.0  
## v readr   1.3.1       v forcats 0.4.0

## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

setwd("C:/Users/leoto/OneDrive/Documents/598/Week 5/session5")

hsb.dat <- import("C:/Users/leoto/OneDrive/Documents/598/Week 5/session5/hsb.sav")

str(hsb.dat)

## 'data.frame':    600 obs. of  15 variables:
##  $ ID    : num  1 2 3 4 5 6 7 8 9 10 ...
##   ..- attr(*, "format.spss")= chr "F5.0"
##  $ SEX   : num  2 1 2 2 2 1 1 2 1 2 ...
##   ..- attr(*, "format.spss")= chr "F5.0"
##  $ RACE  : num  2 2 2 2 2 2 2 2 2 2 ...
##   ..- attr(*, "format.spss")= chr "F5.0"
##  $ SES   : num  1 1 1 2 2 2 1 1 2 1 ...
##   ..- attr(*, "format.spss")= chr "F5.0"
##  $ SCTYP : num  1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "format.spss")= chr "F5.0"
##  $ HSP   : num  3 2 2 3 3 2 1 1 1 1 ...
##   ..- attr(*, "format.spss")= chr "F5.0"
##  $ LOCUS : num  0.29 -0.42 0.71 0.06 0.22 0.46 0.44 0.68 0.06 0.05 ...
##   ..- attr(*, "format.spss")= chr "F5.2"
##  $ CONCPT: num  0.88 0.03 0.03 0.03 -0.28 0.03 -0.47 0.25 0.56 0.15 ...
##   ..- attr(*, "format.spss")= chr "F5.2"
##  $ MOT   : num  0.67 0.33 0.67 0 0 0 0.33 1 0.33 1 ...
##   ..- attr(*, "format.spss")= chr "F5.2"
##  $ CAR   : num  10 2 9 15 1 11 10 9 9 11 ...
##   ..- attr(*, "format.spss")= chr "F5.0"
##  $ RDG   : num  33.6 46.9 41.6 38.9 36.3 49.5 62.7 44.2 46.9 44.2 ...
##   ..- attr(*, "format.spss")= chr "F5.2"
##  $ WRTG  : num  43.7 35.9 59.3 41.1 48.9 46.3 64.5 51.5 41.1 49.5 ...
##   ..- attr(*, "format.spss")= chr "F5.2"
##  $ MATH  : num  40.2 41.9 41.9 32.7 39.5 46.2 48 36.9 45.3 40.5 ...
##   ..- attr(*, "format.spss")= chr "F5.2"
##  $ SCI   : num  39 36.3 44.4 41.7 41.7 41.7 63.4 49.8 47.1 39 ...
##   ..- attr(*, "format.spss")= chr "F5.2"
##  $ CIV   : num  40.6 45.6 45.6 40.6 45.6 35.6 55.6 55.6 55.6 50.6 ...
##   ..- attr(*, "format.spss")= chr "F5.2"

Part 2:

Make two barplots for one nominal variable. The first one should have title and source, but no more changes to the default given by ggplot. The second plot should include more customized changes of your choice.

hsb.dat$SEX = factor(
  hsb.dat$SEX , levels = 1:2, 
  labels = c("Male" , "Female")
)


frTable=as.data.frame(table(hsb.dat$SEX))
names(frTable)=c('SEX','Count')

baseNom= ggplot(data = frTable, 
             aes(x=SEX, y=Count)) 
barNom=baseNom + geom_bar(stat = 'identity')
barNom

titletext = "Gender Breakdown of Student Sample"
sourcetext= 'Source: National Center for Education Statistics, 1980'

barNom1 = barNom + labs(title = titletext , caption = sourcetext)
barNom1

barNom2 = barNom1 + geom_bar(stat = "identity",mapping = aes( fill = "orange"))
barNom2

Part 3:

Make two boxplots for one ordinal variable. The first one should have title and source, but no more changes to the default given by ggplot. The second plot should include more customized changes of your choice.

titletext2 = 'Socioeconomic status of Students'
sourcetext = 'Source: National Center for Education Statistcs, 1980'
baseOrd= ggplot(hsb.dat, aes(y=as.numeric(SES)))
baseOrdBox= baseOrd + geom_boxplot()+labs(title = titletext2, caption=sourcetext)
baseOrdBox

hsb.dat$SES <- ordered(hsb.dat$SES,
                   levels=c(1,2,3),
                   labels=c("Low","Medium","High")
                   )


ordLabels=levels(hsb.dat$SES)


Plot1 = baseOrdBox + scale_y_continuous(labels=ordLabels,breaks=1:3) + geom_boxplot(color = 'red')
Plot1

Part 4:

Make two histograms for one numerical variable. The first one should have title and source, but no more changes to the default given by ggplot. The second plot should include more customized changes of your choice.

baseInt2= ggplot(hsb.dat,aes(x = WRTG))  
baseIntHist= baseInt2 + geom_histogram()


titleText='Distribution of Student Writing Scores'
sourceText='Source: National Center for Education Statistcs, 1980'
xaxisText='Writing T Score'
yaxisText='Frequency'

baseDecHist2= baseIntHist + labs(title=titleText,
                               x = xaxisText, 
                               y = yaxisText,
                               caption = sourceText)
baseDecHist2

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

baseInt3= ggplot(hsb.dat,aes(x = WRTG))  
baseIntHist3= baseInt3 + geom_histogram(binwidth = 5, color='black',fill='blue')


titleText='Distribution of Student Writing Scores'
sourceText='Source: National Center for Education Statistcs, 1980'
xaxisText='Writing T Score'
yaxisText='Frequency'

baseDecHist3= baseIntHist3 + labs(title=titleText,
                               x = xaxisText, 
                               y = yaxisText,
                               caption = sourceText)
baseDecHist3

Part 5:

Make an alternative plot for the nominal and the numerical variable. Customize it with the elements of your choice.

pie(table(hsb.dat$SEX))

ggplot(data = hsb.dat) +
  geom_point(mapping = aes(x=WRTG, y = SCI) , position = "jitter")

Part 6 (For final project):

Download the data you have collected from one source.
Input it to R.
Summarize what you see. Select one variable and create one plot based on what you have.

sleep.dat <- import("C:/Users/leoto/OneDrive/Documents/598/Week 5/session5/sleep data.csv")

str(sleep.dat)

## 'data.frame':    21 obs. of  2 variables:
##  $ Date       : chr  "4/15/2019" "4/16/2019" "4/17/2019" "4/18/2019" ...
##  $ Hours Slept: num  6 8.5 7.5 8 8.5 8 6 7 8 7 ...

ggplot(data=sleep.dat, aes(x=Date, y=`Hours Slept` , group =1)) +
  geom_line()+
  geom_point()

Spencer Leo

Session 5 LAB Assignment