I decided to use the following three datasets - The space distribution dataset for commercial real estate that I had provided to the team, the Student Performance data provided by Jered Ataky and the school diversity data provided by Zhouxin Shi.
This is a typical example of additions and demolitions to stock in commercial real estate that is used to assess fundamentals, too much competitive supply will depress pricing and vice versa. Demolitions can also give a sense of the highest and best use for a parcel of land, too many demolitions and very few additions for a given property-type means that the highest and best use is perhaps an office tower instead of an older warehouse or an apartment building instead of a surface parking lot etc. Furthermore, the information can be aggregated up to determine what share of total stock for a given property-type is functionally obsolete. For example, modern warehouses (Built 2010 and later) have higher ceiling heights and more dock doors for example compared to product built before 1970 with a low dock-door ratio. Such aggregations provide a useful benchmark for investment management and development purposes.
library(readr)
## Warning: package 'readr' was built under R version 4.0.5
library(RCurl)
url <- getURLContent("https://raw.githubusercontent.com/tponnada/DATA607/master/Space%20distribution.csv")
space <- read_csv(url)
## Rows: 18 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Year_Built, Status
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
space
## # A tibble: 18 × 7
## Year_Built Status `10K-49K` `50K-99K` `100K-199K` `200K-399K` `400K+`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 <NA> <NA> NA NA NA NA NA
## 2 2010 to present Built 6000 11000 25000 30000 33000
## 3 <NA> Demolish… 100 200 300 400 500
## 4 <NA> <NA> NA NA NA NA NA
## 5 2000 to 2009 Built 36000 31000 41000 27000 30000
## 6 <NA> Demolish… 200 0 0 0 0
## 7 <NA> <NA> NA NA NA NA NA
## 8 1990 to 1999 Built 59000 53000 67000 45000 58000
## 9 <NA> Demolish… 100 200 0 0 0
## 10 <NA> <NA> NA NA NA NA NA
## 11 1980 to 1989 Built 129000 82000 99000 57000 60000
## 12 <NA> Demolish… 50 50 0 0 0
## 13 <NA> <NA> NA NA NA NA NA
## 14 1970 to 1979 Built 145000 99000 106000 61000 79000
## 15 <NA> Demolish… 50 50 0 0 0
## 16 <NA> <NA> NA NA NA NA NA
## 17 Before 1970 Built 313000 240000 250000 222000 40000
## 18 <NA> Demolish… 100000 50000 30000 20000 10000
##Step 1: There is missing data in the first column where Year_Built is not repeated for each row; also an empty row between two rows of data is filled with NA’s when read in from the csv file, both of these issues need to be cleaned up.
## Warning: package 'tidyr' was built under R version 4.0.5
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
##
## complete
## # A tibble: 12 × 7
## Year_Built Status `10K-49K` `50K-99K` `100K-199K` `200K-399K` `400K+`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2010 to present Built 6000 11000 25000 30000 33000
## 2 2010 to present Demolish… 100 200 300 400 500
## 3 2000 to 2009 Built 36000 31000 41000 27000 30000
## 4 2000 to 2009 Demolish… 200 0 0 0 0
## 5 1990 to 1999 Built 59000 53000 67000 45000 58000
## 6 1990 to 1999 Demolish… 100 200 0 0 0
## 7 1980 to 1989 Built 129000 82000 99000 57000 60000
## 8 1980 to 1989 Demolish… 50 50 0 0 0
## 9 1970 to 1979 Built 145000 99000 106000 61000 79000
## 10 1970 to 1979 Demolish… 50 50 0 0 0
## 11 Before 1970 Built 313000 240000 250000 222000 40000
## 12 Before 1970 Demolish… 100000 50000 30000 20000 10000
##Step 2: The dataset has three variables - Year_Built, Built Status (Built or Demolished), and building size broken up into a few ranges - (10K - 49K), (50K - 99K), (100K - 199K), (200K - 399K) and 400K+. To tidy it, we need to melt, or stack it. In other words, we need to turn columns into rows. Melting is parameterised by a list of columns that are already variables, or colvars for short. In this example, the colvar is Year_Built. The individual size range columns are melted into one variable called Size_Range. However, this form is not yet tidy because we have two variables stored in rows: “Built” and “Demolished”. This needs to be cast.
space_clean2 <- gather(space_clean1, "Size_Range", "Value", 3:7, na.rm = TRUE); space_clean2
## # A tibble: 60 × 4
## Year_Built Status Size_Range Value
## <chr> <chr> <chr> <dbl>
## 1 2010 to present Built 10K-49K 6000
## 2 2010 to present Demolished 10K-49K 100
## 3 2000 to 2009 Built 10K-49K 36000
## 4 2000 to 2009 Demolished 10K-49K 200
## 5 1990 to 1999 Built 10K-49K 59000
## 6 1990 to 1999 Demolished 10K-49K 100
## 7 1980 to 1989 Built 10K-49K 129000
## 8 1980 to 1989 Demolished 10K-49K 50
## 9 1970 to 1979 Built 10K-49K 145000
## 10 1970 to 1979 Demolished 10K-49K 50
## # … with 50 more rows
##Step 3: Casting the X2 column. Steps 2 and 3 make the data tidy. There is one variable in each column and each row represents an observation. We can use the tidy dataset to perform an initial analysis which is comparing the built and demolished space across size ranges by years. This is best accomplished graphically in the next step.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ dplyr 1.0.8
## ✓ tibble 3.1.6 ✓ stringr 1.4.0
## ✓ purrr 0.3.4 ✓ forcats 0.5.1
## Warning: package 'dplyr' was built under R version 4.0.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x tidyr::complete() masks RCurl::complete()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
options(scipen = 999)
space_clean3 <- spread(space_clean2, "Status", "Value", convert = TRUE); space_clean3
## # A tibble: 30 × 4
## Year_Built Size_Range Built Demolished
## <chr> <chr> <int> <int>
## 1 1970 to 1979 100K-199K 106000 0
## 2 1970 to 1979 10K-49K 145000 50
## 3 1970 to 1979 200K-399K 61000 0
## 4 1970 to 1979 400K+ 79000 0
## 5 1970 to 1979 50K-99K 99000 50
## 6 1980 to 1989 100K-199K 99000 0
## 7 1980 to 1989 10K-49K 129000 50
## 8 1980 to 1989 200K-399K 57000 0
## 9 1980 to 1989 400K+ 60000 0
## 10 1980 to 1989 50K-99K 82000 50
## # … with 20 more rows
##Step 4: We visualize by plotting the graphs for total built space versus demolitions and then repeat the exercise for each of the years. Since it’s easiest to compare built space next to demolitions, I use the original melted dataset space_clean2 to plot built space and demolitions side by side.
The visualizations below provide a series of observations - Smaller size industrial buildings were in favor before the 1970’s compared to big-box warehouse (> 400,000 SF) which gained popularity post-2010. The second observation is as expected which is that we see a high rate of demolitions for industrial stock built before the 70’s but limited demolitions for any other time period. As buildings reach the end of their useful life, they are demolished to make way for a higher and better use. Had we combined all built space and demolitions, such granularity and color would have been lost which is best assessed by grouping by the respective time periods.
library(ggplot2)
library(tidyverse)
#By year - Before 1970
space_clean4 <- space_clean2 %>% filter(space_clean2$Year_Built == "Before 1970"); space_clean4
## # A tibble: 10 × 4
## Year_Built Status Size_Range Value
## <chr> <chr> <chr> <dbl>
## 1 Before 1970 Built 10K-49K 313000
## 2 Before 1970 Demolished 10K-49K 100000
## 3 Before 1970 Built 50K-99K 240000
## 4 Before 1970 Demolished 50K-99K 50000
## 5 Before 1970 Built 100K-199K 250000
## 6 Before 1970 Demolished 100K-199K 30000
## 7 Before 1970 Built 200K-399K 222000
## 8 Before 1970 Demolished 200K-399K 20000
## 9 Before 1970 Built 400K+ 40000
## 10 Before 1970 Demolished 400K+ 10000
SizeRange <- factor(space_clean4$Size_Range, levels = c("10K-49K", "50K-99K", "100K-199K", "200K-399K", "400K+"))
ggplot(data = space_clean4, aes(x = SizeRange, y = Value, fill = Status)) + geom_bar(stat = "identity", width = 0.9, position = "dodge") + xlab("Size Range") + ylab("Value in SF") + ggtitle("Before 1970") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))
#By year - 1970 to 1979
space_clean5 <- space_clean2 %>% filter(space_clean2$Year_Built == "1970 to 1979"); space_clean5
## # A tibble: 10 × 4
## Year_Built Status Size_Range Value
## <chr> <chr> <chr> <dbl>
## 1 1970 to 1979 Built 10K-49K 145000
## 2 1970 to 1979 Demolished 10K-49K 50
## 3 1970 to 1979 Built 50K-99K 99000
## 4 1970 to 1979 Demolished 50K-99K 50
## 5 1970 to 1979 Built 100K-199K 106000
## 6 1970 to 1979 Demolished 100K-199K 0
## 7 1970 to 1979 Built 200K-399K 61000
## 8 1970 to 1979 Demolished 200K-399K 0
## 9 1970 to 1979 Built 400K+ 79000
## 10 1970 to 1979 Demolished 400K+ 0
SizeRange <- factor(space_clean5$Size_Range, levels = c("10K-49K", "50K-99K", "100K-199K", "200K-399K", "400K+"))
ggplot(data = space_clean5, aes(x = SizeRange, y = Value, fill = Status)) + geom_bar(stat = "identity", width = 0.9, position = "dodge") + xlab("Size Range") + ylab("Value in SF") + ggtitle("1970 to 1979") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))
#By year - 1980 to 1989
space_clean6 <- space_clean2 %>% filter(space_clean2$Year_Built == "1980 to 1989"); space_clean6
## # A tibble: 10 × 4
## Year_Built Status Size_Range Value
## <chr> <chr> <chr> <dbl>
## 1 1980 to 1989 Built 10K-49K 129000
## 2 1980 to 1989 Demolished 10K-49K 50
## 3 1980 to 1989 Built 50K-99K 82000
## 4 1980 to 1989 Demolished 50K-99K 50
## 5 1980 to 1989 Built 100K-199K 99000
## 6 1980 to 1989 Demolished 100K-199K 0
## 7 1980 to 1989 Built 200K-399K 57000
## 8 1980 to 1989 Demolished 200K-399K 0
## 9 1980 to 1989 Built 400K+ 60000
## 10 1980 to 1989 Demolished 400K+ 0
SizeRange <- factor(space_clean6$Size_Range, levels = c("10K-49K", "50K-99K", "100K-199K", "200K-399K", "400K+"))
ggplot(data = space_clean6, aes(x = SizeRange, y = Value, fill = Status)) + geom_bar(stat = "identity", width = 0.9, position = "dodge") + xlab("Size Range") + ylab("Value in SF") + ggtitle("1980 to 1989") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))
#By year - 1990 to 1999
space_clean7 <- space_clean2 %>% filter(space_clean2$Year_Built == "1990 to 1999"); space_clean7
## # A tibble: 10 × 4
## Year_Built Status Size_Range Value
## <chr> <chr> <chr> <dbl>
## 1 1990 to 1999 Built 10K-49K 59000
## 2 1990 to 1999 Demolished 10K-49K 100
## 3 1990 to 1999 Built 50K-99K 53000
## 4 1990 to 1999 Demolished 50K-99K 200
## 5 1990 to 1999 Built 100K-199K 67000
## 6 1990 to 1999 Demolished 100K-199K 0
## 7 1990 to 1999 Built 200K-399K 45000
## 8 1990 to 1999 Demolished 200K-399K 0
## 9 1990 to 1999 Built 400K+ 58000
## 10 1990 to 1999 Demolished 400K+ 0
SizeRange <- factor(space_clean7$Size_Range, levels = c("10K-49K", "50K-99K", "100K-199K", "200K-399K", "400K+"))
ggplot(data = space_clean7, aes(x = SizeRange, y = Value, fill = Status)) + geom_bar(stat = "identity", width = 0.9, position = "dodge") + xlab("Size Range") + ylab("Value in SF") + ggtitle("1990 to 1999") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))
#By year - 2000 to 2009
space_clean8 <- space_clean2 %>% filter(space_clean2$Year_Built == "2000 to 2009"); space_clean8
## # A tibble: 10 × 4
## Year_Built Status Size_Range Value
## <chr> <chr> <chr> <dbl>
## 1 2000 to 2009 Built 10K-49K 36000
## 2 2000 to 2009 Demolished 10K-49K 200
## 3 2000 to 2009 Built 50K-99K 31000
## 4 2000 to 2009 Demolished 50K-99K 0
## 5 2000 to 2009 Built 100K-199K 41000
## 6 2000 to 2009 Demolished 100K-199K 0
## 7 2000 to 2009 Built 200K-399K 27000
## 8 2000 to 2009 Demolished 200K-399K 0
## 9 2000 to 2009 Built 400K+ 30000
## 10 2000 to 2009 Demolished 400K+ 0
SizeRange <- factor(space_clean8$Size_Range, levels = c("10K-49K", "50K-99K", "100K-199K", "200K-399K", "400K+"))
ggplot(data = space_clean8, aes(x = SizeRange, y = Value, fill = Status)) + geom_bar(stat = "identity", width = 0.9, position = "dodge") + xlab("Size Range") + ylab("Value in SF") + ggtitle("2000 to 2009") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))
#By year - 2010 to present
space_clean9 <- space_clean2 %>% filter(space_clean2$Year_Built == "2010 to present"); space_clean9
## # A tibble: 10 × 4
## Year_Built Status Size_Range Value
## <chr> <chr> <chr> <dbl>
## 1 2010 to present Built 10K-49K 6000
## 2 2010 to present Demolished 10K-49K 100
## 3 2010 to present Built 50K-99K 11000
## 4 2010 to present Demolished 50K-99K 200
## 5 2010 to present Built 100K-199K 25000
## 6 2010 to present Demolished 100K-199K 300
## 7 2010 to present Built 200K-399K 30000
## 8 2010 to present Demolished 200K-399K 400
## 9 2010 to present Built 400K+ 33000
## 10 2010 to present Demolished 400K+ 500
SizeRange <- factor(space_clean9$Size_Range, levels = c("10K-49K", "50K-99K", "100K-199K", "200K-399K", "400K+"))
ggplot(data = space_clean9, aes(x = SizeRange, y = Value, fill = Status)) + geom_bar(stat = "identity", width = 0.9, position = "dodge") + xlab("Size Range") + ylab("Value in SF") + ggtitle("2010 to present") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))
The original dataset already appeared to be in melted format, so I changed it to wide by adding two new columns Female and Male that are actually variable values. When the dataset is read in, any rows without values in these columns is filled with NA’s which needs to be taken care of. I also passed variable names such that they are shorter and more meaningful.
library(readr)
library(RCurl)
field_names <- c("Race_Ethnicity", "Parent_education", "Lunch", "Test_Prep", "Math_Score", "Reading_Score", "Writing_Score", "Female", "Male")
url <- getURLContent("https://raw.githubusercontent.com/tponnada/DATA607/master/Students_Performance.csv")
performance <- read_csv(url, col_names = field_names, skip = 1, trim_ws = TRUE)
## Rows: 1000 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Race_Ethnicity, Parent_education, Lunch, Test_Prep, Female, Male
## dbl (3): Math_Score, Reading_Score, Writing_Score
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
performance
## # A tibble: 1,000 × 9
## Race_Ethnicity Parent_education Lunch Test_Prep Math_Score Reading_Score
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 group B bachelor's degree standard none 72 72
## 2 group C some college standard completed 69 90
## 3 group B master's degree standard none 90 95
## 4 group B associate's degree standard none 71 83
## 5 group B some college standard completed 88 95
## 6 group B high school free/re… none 38 60
## 7 group B high school standard none 65 81
## 8 group A master's degree standard none 50 53
## 9 group C some high school standard none 69 75
## 10 group B some high school free/re… none 18 32
## # … with 990 more rows, and 3 more variables: Writing_Score <dbl>,
## # Female <chr>, Male <chr>
##Step 1: The dataset has eight variables - Race_Ethnicity, Parent_Education, Lunch, Test_Prep, Math_Score, Reading_Score, Writing_Score and Gender (Female and Male). To tidy it, we need to melt, or stack it. The individual Female and Male columns are melted into one variable called Gender, we don’t really need the Value column because Gender itself indicates value for each row and I also arranged the columns as in the original dataset by putting Gender first. In addition, I also melt the Math, Reading and Writing Scores into one column called Score so that I can plot Math, Reading and Writing scores side by side for each parent_education grouping.
library(tidyr)
performance_clean1 <- gather(performance, "Gender", "Value", 8:9, na.rm = TRUE); performance_clean1
## # A tibble: 1,000 × 9
## Race_Ethnicity Parent_education Lunch Test_Prep Math_Score Reading_Score
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 group B bachelor's degree standard none 72 72
## 2 group C some college standard completed 69 90
## 3 group B master's degree standard none 90 95
## 4 group B associate's degree standard none 71 83
## 5 group B some college standard completed 88 95
## 6 group B high school free/re… none 38 60
## 7 group B high school standard none 65 81
## 8 group A master's degree standard none 50 53
## 9 group C some high school standard none 69 75
## 10 group B some high school free/re… none 18 32
## # … with 990 more rows, and 3 more variables: Writing_Score <dbl>,
## # Gender <chr>, Value <chr>
performance_clean2 <- performance_clean1[c("Gender", "Race_Ethnicity", "Parent_education", "Lunch", "Test_Prep", "Math_Score", "Reading_Score", "Writing_Score")]; performance_clean2
## # A tibble: 1,000 × 8
## Gender Race_Ethnicity Parent_education Lunch Test_Prep Math_Score
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 Female group B bachelor's degree standard none 72
## 2 Female group C some college standard completed 69
## 3 Female group B master's degree standard none 90
## 4 Female group B associate's degree standard none 71
## 5 Female group B some college standard completed 88
## 6 Female group B high school free/reduced none 38
## 7 Female group B high school standard none 65
## 8 Female group A master's degree standard none 50
## 9 Female group C some high school standard none 69
## 10 Female group B some high school free/reduced none 18
## # … with 990 more rows, and 2 more variables: Reading_Score <dbl>,
## # Writing_Score <dbl>
##Step 2: We use the tidy dataset above to perform the analysis asked in the assignment which is analyzing student performance for a given level of parental education. This analysis yields interesting insights as below.
It is no surprise that the mean scores across math, reading and writing increase with the level of parental education as more educated parents in turn stress on the importance of education for their children. However, it is surprising that scores across math, reading and writing for parents with some high school education were higher than for those parents who completed their high school education. The other two observations are that the mean Math score ranges were the lowest across parental education categories compared to reading and writing and also that the range of writing scores across parental education categories was the widest.
library(ggplot2)
library(tidyverse)
#Math/Reading/Writing mean/median/sd breakdowns by parental level of education
vis_perf1 <- performance_clean2 %>%
group_by(Parent_education) %>%
summarise(mean_MS = mean(Math_Score), median_MS = median(Math_Score), sd_MS = sd(Math_Score), mean_RS = mean(Reading_Score), median_RS = median(Reading_Score), sd_rS = sd(Reading_Score), mean_WS = mean(Writing_Score), median_WS = median(Writing_Score), sd_WS = sd(Writing_Score))
vis_perf1
## # A tibble: 6 × 10
## Parent_education mean_MS median_MS sd_MS mean_RS median_RS sd_rS mean_WS
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 associate's degree 67.9 67 15.1 70.9 72.5 13.9 69.9
## 2 bachelor's degree 69.4 68 14.9 73 73 14.3 73.4
## 3 high school 62.1 63 14.5 64.7 66 14.1 62.4
## 4 master's degree 69.7 73 15.2 75.4 76 13.8 75.7
## 5 some college 67.1 67.5 14.3 69.5 70.5 14.1 68.8
## 6 some high school 63.5 65 15.9 66.9 67 15.5 64.9
## # … with 2 more variables: median_WS <dbl>, sd_WS <dbl>
##Only calculate means and plot to compare
vis_perf2 <- performance_clean2 %>%
group_by(Parent_education) %>%
summarise(mean_MS = mean(Math_Score), mean_RS = mean(Reading_Score), mean_WS = mean(Writing_Score))
vis_perf2
## # A tibble: 6 × 4
## Parent_education mean_MS mean_RS mean_WS
## <chr> <dbl> <dbl> <dbl>
## 1 associate's degree 67.9 70.9 69.9
## 2 bachelor's degree 69.4 73 73.4
## 3 high school 62.1 64.7 62.4
## 4 master's degree 69.7 75.4 75.7
## 5 some college 67.1 69.5 68.8
## 6 some high school 63.5 66.9 64.9
##Order parental level of education from least to most highly educated
Parentedu <- factor(vis_perf1$Parent_education, levels = c("some high school", "high school", "some college", "associate's degree", "bachelor's degree", "master's degree"))
##Visualize by plotting side-by-side the mean math, reading and writing scores
ggplot(data = vis_perf2) + geom_bar(mapping = aes(x = Parentedu, y = mean_MS), stat = "identity") + xlab("Parental level of education") + ylab("Mean Math Scores") + ggtitle("Mean Math score by parental level of education") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))
ggplot(data = vis_perf2) + geom_bar(mapping = aes(x = Parentedu, y = mean_RS), stat = "identity") + xlab("Parental level of education") + ylab("Mean Reading Scores") + ggtitle("Mean Reading score by parental level of education") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))
ggplot(data = vis_perf2) + geom_bar(mapping = aes(x = Parentedu, y = mean_WS), stat = "identity") + xlab("Parental level of education") + ylab("Mean Writing Scores") + ggtitle("Mean Writing score by parental level of education") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))
I used the file that Zhouxin has already uploaded to Github. The file is in wide format with 6 different races in 6 different columns - AIAN (American Indian and Alaska Native), Asian, Black, Hispanic, White and Multi. These need to be melted together. I start by cleaning up the column names so that they are understandable and consistent.
library(readr)
library(RCurl)
field_names <- c("RowNum", "LEAID", "LEA_Name", "State", "Locale", "School_Year", "AIAN", "Asian", "Black", "Hispanic", "White", "Multi", "Total", "Diverse", "Variance", "Int_Group")
url <- getURLContent("https://raw.githubusercontent.com/szx868/data607/master/school_diversity.csv")
school_diversity <- read_csv(url, col_names = field_names, skip = 1, trim_ws = TRUE)
## Rows: 27944 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): LEA_Name, State, Locale, School_Year, Diverse, Int_Group
## dbl (10): RowNum, LEAID, AIAN, Asian, Black, Hispanic, White, Multi, Total, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
school_diversity
## # A tibble: 27,944 × 16
## RowNum LEAID LEA_Name State Locale School_Year AIAN Asian Black Hispanic
## <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 100002 alabama … AL <NA> 1994-1995 0 0.589 71.7 0.196
## 2 2 100005 albertvi… AL town-… 1994-1995 0 0.321 1.28 4.52
## 3 3 100005 albertvi… AL town-… 2016-2017 0.294 0.551 3.19 46.7
## 4 4 100006 marshall… AL rural… 1994-1995 0.104 0.134 0.373 0.909
## 5 5 100006 marshall… AL rural… 2016-2017 0.492 0.299 1.07 21.3
## 6 6 100007 hoover c… AL city-… 1994-1995 0.0652 1.60 6.04 0.548
## 7 7 100007 hoover c… AL city-… 2016-2017 0.0933 6.68 24.7 7.76
## 8 8 100008 madison … AL subur… 2016-2017 0.594 8.92 20.4 5.11
## 9 9 100011 leeds ci… AL subur… 2016-2017 0.304 1.06 23.0 12.5
## 10 10 100012 boaz city AL town-… 2016-2017 0.502 1.17 1.97 29.8
## # … with 27,934 more rows, and 6 more variables: White <dbl>, Multi <dbl>,
## # Total <dbl>, Diverse <chr>, Variance <dbl>, Int_Group <chr>
##Step 1: We filter out all the schools with population less than 100, as indicated in the suggested problem which excludes 2,166 rows.
school_diversity1 <- school_diversity %>% filter(school_diversity$Total >= 100); school_diversity1
## # A tibble: 25,778 × 16
## RowNum LEAID LEA_Name State Locale School_Year AIAN Asian Black Hispanic
## <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 100002 alabama … AL <NA> 1994-1995 0 0.589 71.7 0.196
## 2 2 100005 albertvi… AL town-… 1994-1995 0 0.321 1.28 4.52
## 3 3 100005 albertvi… AL town-… 2016-2017 0.294 0.551 3.19 46.7
## 4 4 100006 marshall… AL rural… 1994-1995 0.104 0.134 0.373 0.909
## 5 5 100006 marshall… AL rural… 2016-2017 0.492 0.299 1.07 21.3
## 6 6 100007 hoover c… AL city-… 1994-1995 0.0652 1.60 6.04 0.548
## 7 7 100007 hoover c… AL city-… 2016-2017 0.0933 6.68 24.7 7.76
## 8 8 100008 madison … AL subur… 2016-2017 0.594 8.92 20.4 5.11
## 9 9 100011 leeds ci… AL subur… 2016-2017 0.304 1.06 23.0 12.5
## 10 10 100012 boaz city AL town-… 2016-2017 0.502 1.17 1.97 29.8
## # … with 25,768 more rows, and 6 more variables: White <dbl>, Multi <dbl>,
## # Total <dbl>, Diverse <chr>, Variance <dbl>, Int_Group <chr>
##Step 2: The dataset has eleven variables - RowNum, LEAID, LEA_Name, State, Locale, School_Year, Total, Diverse, Variance, Int_Group and Race with 6 values that need to be stacked - AIAN (American Indian and Alaska Native), Asian, Black, Hispanic, White and Multi.
library(tidyr)
school_diversity2 <- gather(school_diversity1, "Race", "Value", 7:12, na.rm = TRUE); school_diversity2
## # A tibble: 141,289 × 12
## RowNum LEAID LEA_Name State Locale School_Year Total Diverse Variance
## <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 100002 alabama youth … AL <NA> 1994-1995 509 Diverse NA
## 2 2 100005 albertville ci… AL town-… 1994-1995 3118 Extrem… NA
## 3 3 100005 albertville ci… AL town-… 2016-2017 5447 Diverse 0.0116
## 4 4 100006 marshall county AL rural… 1994-1995 6707 Extrem… NA
## 5 5 100006 marshall county AL rural… 2016-2017 5687 Undive… NA
## 6 6 100007 hoover city AL city-… 1994-1995 7671 Extrem… NA
## 7 7 100007 hoover city AL city-… 2016-2017 13938 Diverse 0.0316
## 8 8 100008 madison city AL subur… 2016-2017 10440 Diverse 0.00266
## 9 9 100011 leeds city AL subur… 2016-2017 1973 Diverse NA
## 10 10 100012 boaz city AL town-… 2016-2017 2389 Diverse NA
## # … with 141,279 more rows, and 3 more variables: Int_Group <chr>, Race <chr>,
## # Value <dbl>
##Step 3: The tidied dataset is used below for some initial analysis. We calculate the racial average by eliminating the White population in the dataset and by grouping by school_year, the proportions should add up to 1 (including the white population). The higher the overall proportion minus the whites, the higher the racial diversity in the school population but this only means that other races were represented. On the other end of the spectrum, the school could be considered less diverse if there is a high majority of only one type of race in the school (Could be Asian, African-American, Hispanic or any other race).
library(ggplot2)
library(tidyverse)
distinct_df = school_diversity2 %>% distinct(School_Year); distinct_df
## # A tibble: 2 × 1
## School_Year
## <chr>
## 1 1994-1995
## 2 2016-2017
school_diversity3 <- school_diversity2 %>%
group_by(State) %>%
group_by(LEAID) %>%
filter(Race != "White" & School_Year == "1994-1995") %>%
mutate(Racial_average = sum(Value)); school_diversity3
## # A tibble: 53,516 × 13
## # Groups: LEAID [13,379]
## RowNum LEAID LEA_Name State Locale School_Year Total Diverse Variance
## <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 100002 alabama youth … AL <NA> 1994-1995 509 Diverse NA
## 2 2 100005 albertville ci… AL town-… 1994-1995 3118 Extrem… NA
## 3 4 100006 marshall county AL rural… 1994-1995 6707 Extrem… NA
## 4 6 100007 hoover city AL city-… 1994-1995 7671 Extrem… NA
## 5 12 100030 alexander city AL town-… 1994-1995 3788 Diverse NA
## 6 14 100060 andalusia city AL town-… 1994-1995 2091 Undive… NA
## 7 16 100090 anniston city AL city-… 1994-1995 3584 Undive… NA
## 8 18 100100 arab city AL rural… 1994-1995 2503 Extrem… NA
## 9 20 100120 athens city AL town-… 1994-1995 2915 Diverse 0.00307
## 10 22 100180 attalla city AL subur… 1994-1995 1967 Undive… NA
## # … with 53,506 more rows, and 4 more variables: Int_Group <chr>, Race <chr>,
## # Value <dbl>, Racial_average <dbl>
school_diversity5 <- school_diversity3 %>% filter(Racial_average >= 50); school_diversity5
## # A tibble: 6,340 × 13
## # Groups: LEAID [1,585]
## RowNum LEAID LEA_Name State Locale School_Year Total Diverse Variance
## <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 100002 alabama youth … AL <NA> 1994-1995 509 Diverse NA
## 2 16 100090 anniston city AL city-… 1994-1995 3584 Undive… NA
## 3 36 100300 barbour county AL rural… 1994-1995 2027 Undive… NA
## 4 38 100330 bessemer city AL subur… 1994-1995 4872 Extrem… NA
## 5 42 100390 birmingham city AL city-… 1994-1995 41801 Extrem… NA
## 6 48 100480 bullock county AL town-… 1994-1995 1972 Extrem… NA
## 7 50 100510 butler county AL town-… 1994-1995 4300 Diverse 0.0242
## 8 54 100600 chambers county AL town-… 1994-1995 4138 Diverse 0.305
## 9 60 100690 choctaw county AL rural… 1994-1995 2929 Diverse 0.154
## 10 62 100720 clarke county AL town-… 1994-1995 4059 Diverse 0.00873
## # … with 6,330 more rows, and 4 more variables: Int_Group <chr>, Race <chr>,
## # Value <dbl>, Racial_average <dbl>
school_diversity4 <- school_diversity2 %>%
group_by(State) %>%
group_by(LEAID) %>%
filter(Race != "White" & School_Year == "2016-2017") %>%
mutate(Racial_average = sum(Value)); school_diversity4
## # A tibble: 61,995 × 13
## # Groups: LEAID [12,399]
## RowNum LEAID LEA_Name State Locale School_Year Total Diverse Variance
## <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 3 100005 albertville ci… AL town-… 2016-2017 5447 Diverse 0.0116
## 2 5 100006 marshall county AL rural… 2016-2017 5687 Undive… NA
## 3 7 100007 hoover city AL city-… 2016-2017 13938 Diverse 0.0316
## 4 8 100008 madison city AL subur… 2016-2017 10440 Diverse 0.00266
## 5 9 100011 leeds city AL subur… 2016-2017 1973 Diverse NA
## 6 10 100012 boaz city AL town-… 2016-2017 2389 Diverse NA
## 7 11 100013 trussville city AL subur… 2016-2017 4539 Undive… NA
## 8 13 100030 alexander city AL town-… 2016-2017 3055 Diverse NA
## 9 15 100060 andalusia city AL town-… 2016-2017 1744 Diverse NA
## 10 17 100090 anniston city AL city-… 2016-2017 2069 Extrem… NA
## # … with 61,985 more rows, and 4 more variables: Int_Group <chr>, Race <chr>,
## # Value <dbl>, Racial_average <dbl>
school_diversity6 <- school_diversity4 %>% filter(Racial_average >= 50); school_diversity6
## # A tibble: 13,935 × 13
## # Groups: LEAID [2,787]
## RowNum LEAID LEA_Name State Locale School_Year Total Diverse Variance
## <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 3 100005 albertville ci… AL town-… 2016-2017 5447 Diverse 0.0116
## 2 17 100090 anniston city AL city-… 2016-2017 2069 Extrem… NA
## 3 25 100188 chickasaw city AL subur… 2016-2017 1056 Diverse NA
## 4 37 100300 barbour county AL rural… 2016-2017 847 Undive… NA
## 5 39 100330 bessemer city AL subur… 2016-2017 3605 Undive… NA
## 6 43 100390 birmingham city AL city-… 2016-2017 23158 Extrem… NA
## 7 49 100480 bullock county AL town-… 2016-2017 1476 Undive… NA
## 8 51 100510 butler county AL town-… 2016-2017 3131 Diverse 0.142
## 9 55 100600 chambers county AL town-… 2016-2017 3651 Diverse 0.178
## 10 61 100690 choctaw county AL rural… 2016-2017 1417 Diverse NA
## # … with 13,925 more rows, and 4 more variables: Int_Group <chr>, Race <chr>,
## # Value <dbl>, Racial_average <dbl>
total <- rbind(school_diversity3, school_diversity4); total[order(total$Racial_average, decreasing = TRUE), ]
## # A tibble: 115,511 × 13
## # Groups: LEAID [13,902]
## RowNum LEAID LEA_Name State Locale School_Year Total Diverse Variance
## <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 5184 1710950 ford heights … IL subur… 1994-1995 962 Extrem… NA
## 2 13560 3011420 frazer elem MT rural… 1994-1995 124 Extrem… NA
## 3 5184 1710950 ford heights … IL subur… 1994-1995 962 Extrem… NA
## 4 13560 3011420 frazer elem MT rural… 1994-1995 124 Extrem… NA
## 5 5184 1710950 ford heights … IL subur… 1994-1995 962 Extrem… NA
## 6 13560 3011420 frazer elem MT rural… 1994-1995 124 Extrem… NA
## 7 5184 1710950 ford heights … IL subur… 1994-1995 962 Extrem… NA
## 8 13560 3011420 frazer elem MT rural… 1994-1995 124 Extrem… NA
## 9 694 407200 sacaton eleme… AZ rural… 2016-2017 556 Extrem… NA
## 10 14441 3119560 umo n ho n na… NE rural… 2016-2017 548 Extrem… NA
## # … with 115,501 more rows, and 4 more variables: Int_Group <chr>, Race <chr>,
## # Value <dbl>, Racial_average <dbl>
As we see above, by sorting in descending order we can see that ford heights sd 169 and frazer elem are one of the most undiversified schools because they have a high concentration of American Indian and Black populations. In fact, 6,340 schools or 12% of the schools had such an undiversified student base in the 1994-1995 school year and 13,935 schools or 22% of the schools had such an undiversified student base in the 2016-2017 school year. Hence, the common misconception is incorrect that schools are not racially diverse if they contain a high number of caucasian students, schools can also be considered un-diverse if they have a high concentration of students of any one race including minorities.
##Step 4: Prepare the output file and write out.
output <- data.frame(total$LEAID, total$LEA_Name, total$State, total$School_Year, round(total$Racial_average))
write_csv(output, "/Users/tponnada/Desktop/DATA607/Racial_diversity.csv")