For this project we made a .csv file from the image posted in Week 5 by Donghwan Kim and saved it to my GitHub repo here. The .csv file was imported into R and saved as a data frame called RankHousing.
fileURL <- "https://raw.githubusercontent.com/douglasbarley/DATA607/master/ClassRankAndHousing.csv"
RankHousing <- read.csv(fileURL, header = TRUE)
names(RankHousing) <- c("State_of_residence","Remove_column","Class_rank","Off_campus","On_campus","Total")
glimpse(RankHousing)
## Rows: 10
## Columns: 6
## $ State_of_residence <chr> "State of residence", "In state", "", "", "Out o...
## $ Remove_column <chr> "", "Class Rank", "", "Total", "Class Rank", "",...
## $ Class_rank <chr> "", "Underclassman", "Upperclassman", "", "Under...
## $ Off_campus <chr> "Off-campus", "58", "108", "166", "13", "39", "5...
## $ On_campus <chr> "On-campus", "110", "7", "177", "30", "2", "32",...
## $ Total <chr> "Total", "168", "115", "283", "43", "41", "84", ...
We wanted to remove unnecessary rows and columns from the table, fill missing values forward with the values that are already in the table for state of residence, and pivot the table longer to make the housing choice a single column with two possible values (off campus, on campus). All numbers are stacked in a single “count” column.
RankHousing <- RankHousing[-c(1,4,7,8,9,10),] %>% # remove unnecessary rows
select(!c(Remove_column)) %>% # remove unnecessary column
replace_with_na_all(condition = ~.x == "") %>% # replace null strings with NA values
fill(State_of_residence) %>% # forward fill values into NA values
mutate(Off_campus = as.integer(Off_campus),
On_campus = as.integer(On_campus),
Total = as.integer(Total)) %>%
pivot_longer(`Off_campus`:`On_campus`, names_to = "Housing_pref", values_to = "Num", values_drop_na = TRUE)
RankHousing$State_of_residence <- sub(" ", "_", RankHousing$State_of_residence) # run for first space in the values
RankHousing$State_of_residence <- sub(" ", "_", RankHousing$State_of_residence) # rerun for second space in the values
RankHousing <- RankHousing %>%
select(State_of_residence,Class_rank,Housing_pref,Num,Total) %>%
mutate(RankHousing, Pct = Num / Total)
RankHousing
## # A tibble: 8 x 6
## State_of_residence Class_rank Housing_pref Num Total Pct
## <chr> <chr> <chr> <int> <int> <dbl>
## 1 In_state Underclassman Off_campus 58 168 0.345
## 2 In_state Underclassman On_campus 110 168 0.655
## 3 In_state Upperclassman Off_campus 108 115 0.939
## 4 In_state Upperclassman On_campus 7 115 0.0609
## 5 Out_of_state Underclassman Off_campus 13 43 0.302
## 6 Out_of_state Underclassman On_campus 30 43 0.698
## 7 Out_of_state Upperclassman Off_campus 39 41 0.951
## 8 Out_of_state Upperclassman On_campus 2 41 0.0488
We want to analyze the change in a student’s housing choice by residency status as class rank increases. So we should visualize the changes from underclassman housing preferences to upperclassman housing preferences.
ggplot(RankHousing) + geom_col(aes(x= State_of_residence, y = Num, fill = Housing_pref)) + facet_wrap(~Class_rank)
In terms of raw numbers we see that in-state underclassmen have a greater preference for living off campus than out-of-state underclassmen, while at the upperclassman level there appears to be a roughly equal preference for living off campus. The data table confirms that 34.5% of in-state underclassmen would prefer to live off campus compared to 30.2% of out-of-state underclassmen. The table also confirms that 93.9% of in-state upperclassmen and 95.1% of out-of-state upperclassmen would prefer to live off campus.