Purpose

  1. Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets:

Introduction

I decided to use the following three datasets - The space distribution dataset for commercial real estate that I had provided to the team, the Student Performance data provided by Jered Ataky and the school diversity data provided by Zhouxin Shi.

Dataset 1 - Space distribution in commercial real estate

This is a typical example of additions and demolitions to stock in commercial real estate that is used to assess fundamentals, too much competitive supply will depress pricing and vice versa. Demolitions can also give a sense of the highest and best use for a parcel of land, too many demolitions and very few additions for a given property-type means that the highest and best use is perhaps an office tower instead of an older warehouse or an apartment building instead of a surface parking lot etc. Furthermore, the information can be aggregated up to determine what share of total stock for a given property-type is functionally obsolete. For example, modern warehouses (Built 2010 and later) have higher ceiling heights and more dock doors for example compared to product built before 1970 with a low dock-door ratio. Such aggregations provide a useful benchmark for investment management and development purposes.

library(readr)
## Warning: package 'readr' was built under R version 4.0.5
library(RCurl)
url <- getURLContent("https://raw.githubusercontent.com/tponnada/DATA607/master/Space%20distribution.csv")
space <- read_csv(url)
## Rows: 18 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Year_Built, Status
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
space
## # A tibble: 18 × 7
##    Year_Built      Status    `10K-49K` `50K-99K` `100K-199K` `200K-399K` `400K+`
##    <chr>           <chr>         <dbl>     <dbl>       <dbl>       <dbl>   <dbl>
##  1 <NA>            <NA>             NA        NA          NA          NA      NA
##  2 2010 to present Built          6000     11000       25000       30000   33000
##  3 <NA>            Demolish…       100       200         300         400     500
##  4 <NA>            <NA>             NA        NA          NA          NA      NA
##  5 2000 to 2009    Built         36000     31000       41000       27000   30000
##  6 <NA>            Demolish…       200         0           0           0       0
##  7 <NA>            <NA>             NA        NA          NA          NA      NA
##  8 1990 to 1999    Built         59000     53000       67000       45000   58000
##  9 <NA>            Demolish…       100       200           0           0       0
## 10 <NA>            <NA>             NA        NA          NA          NA      NA
## 11 1980 to 1989    Built        129000     82000       99000       57000   60000
## 12 <NA>            Demolish…        50        50           0           0       0
## 13 <NA>            <NA>             NA        NA          NA          NA      NA
## 14 1970 to 1979    Built        145000     99000      106000       61000   79000
## 15 <NA>            Demolish…        50        50           0           0       0
## 16 <NA>            <NA>             NA        NA          NA          NA      NA
## 17 Before 1970     Built        313000    240000      250000      222000   40000
## 18 <NA>            Demolish…    100000     50000       30000       20000   10000

Data cleansing

##Step 1: There is missing data in the first column where Year_Built is not repeated for each row; also an empty row between two rows of data is filled with NA’s when read in from the csv file, both of these issues need to be cleaned up.

## Warning: package 'tidyr' was built under R version 4.0.5
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
## 
##     complete
## # A tibble: 12 × 7
##    Year_Built      Status    `10K-49K` `50K-99K` `100K-199K` `200K-399K` `400K+`
##    <chr>           <chr>         <dbl>     <dbl>       <dbl>       <dbl>   <dbl>
##  1 2010 to present Built          6000     11000       25000       30000   33000
##  2 2010 to present Demolish…       100       200         300         400     500
##  3 2000 to 2009    Built         36000     31000       41000       27000   30000
##  4 2000 to 2009    Demolish…       200         0           0           0       0
##  5 1990 to 1999    Built         59000     53000       67000       45000   58000
##  6 1990 to 1999    Demolish…       100       200           0           0       0
##  7 1980 to 1989    Built        129000     82000       99000       57000   60000
##  8 1980 to 1989    Demolish…        50        50           0           0       0
##  9 1970 to 1979    Built        145000     99000      106000       61000   79000
## 10 1970 to 1979    Demolish…        50        50           0           0       0
## 11 Before 1970     Built        313000    240000      250000      222000   40000
## 12 Before 1970     Demolish…    100000     50000       30000       20000   10000

##Step 2: The dataset has three variables - Year_Built, Built Status (Built or Demolished), and building size broken up into a few ranges - (10K - 49K), (50K - 99K), (100K - 199K), (200K - 399K) and 400K+. To tidy it, we need to melt, or stack it. In other words, we need to turn columns into rows. Melting is parameterised by a list of columns that are already variables, or colvars for short. In this example, the colvar is Year_Built. The individual size range columns are melted into one variable called Size_Range. However, this form is not yet tidy because we have two variables stored in rows: “Built” and “Demolished”. This needs to be cast.

space_clean2 <- gather(space_clean1, "Size_Range", "Value", 3:7, na.rm = TRUE); space_clean2
## # A tibble: 60 × 4
##    Year_Built      Status     Size_Range  Value
##    <chr>           <chr>      <chr>       <dbl>
##  1 2010 to present Built      10K-49K      6000
##  2 2010 to present Demolished 10K-49K       100
##  3 2000 to 2009    Built      10K-49K     36000
##  4 2000 to 2009    Demolished 10K-49K       200
##  5 1990 to 1999    Built      10K-49K     59000
##  6 1990 to 1999    Demolished 10K-49K       100
##  7 1980 to 1989    Built      10K-49K    129000
##  8 1980 to 1989    Demolished 10K-49K        50
##  9 1970 to 1979    Built      10K-49K    145000
## 10 1970 to 1979    Demolished 10K-49K        50
## # … with 50 more rows

##Step 3: Casting the X2 column. Steps 2 and 3 make the data tidy. There is one variable in each column and each row represents an observation. We can use the tidy dataset to perform an initial analysis which is comparing the built and demolished space across size ranges by years. This is best accomplished graphically in the next step.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ dplyr   1.0.8
## ✓ tibble  3.1.6     ✓ stringr 1.4.0
## ✓ purrr   0.3.4     ✓ forcats 0.5.1
## Warning: package 'dplyr' was built under R version 4.0.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x tidyr::complete() masks RCurl::complete()
## x dplyr::filter()   masks stats::filter()
## x dplyr::lag()      masks stats::lag()
options(scipen = 999)
space_clean3 <- spread(space_clean2, "Status", "Value", convert = TRUE); space_clean3
## # A tibble: 30 × 4
##    Year_Built   Size_Range  Built Demolished
##    <chr>        <chr>       <int>      <int>
##  1 1970 to 1979 100K-199K  106000          0
##  2 1970 to 1979 10K-49K    145000         50
##  3 1970 to 1979 200K-399K   61000          0
##  4 1970 to 1979 400K+       79000          0
##  5 1970 to 1979 50K-99K     99000         50
##  6 1980 to 1989 100K-199K   99000          0
##  7 1980 to 1989 10K-49K    129000         50
##  8 1980 to 1989 200K-399K   57000          0
##  9 1980 to 1989 400K+       60000          0
## 10 1980 to 1989 50K-99K     82000         50
## # … with 20 more rows

##Step 4: We visualize by plotting the graphs for total built space versus demolitions and then repeat the exercise for each of the years. Since it’s easiest to compare built space next to demolitions, I use the original melted dataset space_clean2 to plot built space and demolitions side by side.

Conclusions:

The visualizations below provide a series of observations - Smaller size industrial buildings were in favor before the 1970’s compared to big-box warehouse (> 400,000 SF) which gained popularity post-2010. The second observation is as expected which is that we see a high rate of demolitions for industrial stock built before the 70’s but limited demolitions for any other time period. As buildings reach the end of their useful life, they are demolished to make way for a higher and better use. Had we combined all built space and demolitions, such granularity and color would have been lost which is best assessed by grouping by the respective time periods.

library(ggplot2)
library(tidyverse)

#By year - Before 1970
space_clean4 <- space_clean2 %>% filter(space_clean2$Year_Built == "Before 1970"); space_clean4
## # A tibble: 10 × 4
##    Year_Built  Status     Size_Range  Value
##    <chr>       <chr>      <chr>       <dbl>
##  1 Before 1970 Built      10K-49K    313000
##  2 Before 1970 Demolished 10K-49K    100000
##  3 Before 1970 Built      50K-99K    240000
##  4 Before 1970 Demolished 50K-99K     50000
##  5 Before 1970 Built      100K-199K  250000
##  6 Before 1970 Demolished 100K-199K   30000
##  7 Before 1970 Built      200K-399K  222000
##  8 Before 1970 Demolished 200K-399K   20000
##  9 Before 1970 Built      400K+       40000
## 10 Before 1970 Demolished 400K+       10000
SizeRange <- factor(space_clean4$Size_Range, levels = c("10K-49K", "50K-99K", "100K-199K", "200K-399K", "400K+"))
ggplot(data = space_clean4, aes(x = SizeRange, y = Value, fill = Status)) + geom_bar(stat = "identity", width = 0.9, position = "dodge") + xlab("Size Range") + ylab("Value in SF") + ggtitle("Before 1970") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))

#By year - 1970 to 1979
space_clean5 <- space_clean2 %>% filter(space_clean2$Year_Built == "1970 to 1979"); space_clean5
## # A tibble: 10 × 4
##    Year_Built   Status     Size_Range  Value
##    <chr>        <chr>      <chr>       <dbl>
##  1 1970 to 1979 Built      10K-49K    145000
##  2 1970 to 1979 Demolished 10K-49K        50
##  3 1970 to 1979 Built      50K-99K     99000
##  4 1970 to 1979 Demolished 50K-99K        50
##  5 1970 to 1979 Built      100K-199K  106000
##  6 1970 to 1979 Demolished 100K-199K       0
##  7 1970 to 1979 Built      200K-399K   61000
##  8 1970 to 1979 Demolished 200K-399K       0
##  9 1970 to 1979 Built      400K+       79000
## 10 1970 to 1979 Demolished 400K+           0
SizeRange <- factor(space_clean5$Size_Range, levels = c("10K-49K", "50K-99K", "100K-199K", "200K-399K", "400K+"))
ggplot(data = space_clean5, aes(x = SizeRange, y = Value, fill = Status)) + geom_bar(stat = "identity", width = 0.9, position = "dodge") + xlab("Size Range") + ylab("Value in SF") + ggtitle("1970 to 1979") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))

#By year - 1980 to 1989
space_clean6 <- space_clean2 %>% filter(space_clean2$Year_Built == "1980 to 1989"); space_clean6
## # A tibble: 10 × 4
##    Year_Built   Status     Size_Range  Value
##    <chr>        <chr>      <chr>       <dbl>
##  1 1980 to 1989 Built      10K-49K    129000
##  2 1980 to 1989 Demolished 10K-49K        50
##  3 1980 to 1989 Built      50K-99K     82000
##  4 1980 to 1989 Demolished 50K-99K        50
##  5 1980 to 1989 Built      100K-199K   99000
##  6 1980 to 1989 Demolished 100K-199K       0
##  7 1980 to 1989 Built      200K-399K   57000
##  8 1980 to 1989 Demolished 200K-399K       0
##  9 1980 to 1989 Built      400K+       60000
## 10 1980 to 1989 Demolished 400K+           0
SizeRange <- factor(space_clean6$Size_Range, levels = c("10K-49K", "50K-99K", "100K-199K", "200K-399K", "400K+"))
ggplot(data = space_clean6, aes(x = SizeRange, y = Value, fill = Status)) + geom_bar(stat = "identity", width = 0.9, position = "dodge") + xlab("Size Range") + ylab("Value in SF") + ggtitle("1980 to 1989") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))

#By year - 1990 to 1999
space_clean7 <- space_clean2 %>% filter(space_clean2$Year_Built == "1990 to 1999"); space_clean7
## # A tibble: 10 × 4
##    Year_Built   Status     Size_Range Value
##    <chr>        <chr>      <chr>      <dbl>
##  1 1990 to 1999 Built      10K-49K    59000
##  2 1990 to 1999 Demolished 10K-49K      100
##  3 1990 to 1999 Built      50K-99K    53000
##  4 1990 to 1999 Demolished 50K-99K      200
##  5 1990 to 1999 Built      100K-199K  67000
##  6 1990 to 1999 Demolished 100K-199K      0
##  7 1990 to 1999 Built      200K-399K  45000
##  8 1990 to 1999 Demolished 200K-399K      0
##  9 1990 to 1999 Built      400K+      58000
## 10 1990 to 1999 Demolished 400K+          0
SizeRange <- factor(space_clean7$Size_Range, levels = c("10K-49K", "50K-99K", "100K-199K", "200K-399K", "400K+"))
ggplot(data = space_clean7, aes(x = SizeRange, y = Value, fill = Status)) + geom_bar(stat = "identity", width = 0.9, position = "dodge") + xlab("Size Range") + ylab("Value in SF") + ggtitle("1990 to 1999") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))

#By year - 2000 to 2009
space_clean8 <- space_clean2 %>% filter(space_clean2$Year_Built == "2000 to 2009"); space_clean8
## # A tibble: 10 × 4
##    Year_Built   Status     Size_Range Value
##    <chr>        <chr>      <chr>      <dbl>
##  1 2000 to 2009 Built      10K-49K    36000
##  2 2000 to 2009 Demolished 10K-49K      200
##  3 2000 to 2009 Built      50K-99K    31000
##  4 2000 to 2009 Demolished 50K-99K        0
##  5 2000 to 2009 Built      100K-199K  41000
##  6 2000 to 2009 Demolished 100K-199K      0
##  7 2000 to 2009 Built      200K-399K  27000
##  8 2000 to 2009 Demolished 200K-399K      0
##  9 2000 to 2009 Built      400K+      30000
## 10 2000 to 2009 Demolished 400K+          0
SizeRange <- factor(space_clean8$Size_Range, levels = c("10K-49K", "50K-99K", "100K-199K", "200K-399K", "400K+"))
ggplot(data = space_clean8, aes(x = SizeRange, y = Value, fill = Status)) + geom_bar(stat = "identity", width = 0.9, position = "dodge") + xlab("Size Range") + ylab("Value in SF") + ggtitle("2000 to 2009") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))

#By year - 2010 to present
space_clean9 <- space_clean2 %>% filter(space_clean2$Year_Built == "2010 to present"); space_clean9
## # A tibble: 10 × 4
##    Year_Built      Status     Size_Range Value
##    <chr>           <chr>      <chr>      <dbl>
##  1 2010 to present Built      10K-49K     6000
##  2 2010 to present Demolished 10K-49K      100
##  3 2010 to present Built      50K-99K    11000
##  4 2010 to present Demolished 50K-99K      200
##  5 2010 to present Built      100K-199K  25000
##  6 2010 to present Demolished 100K-199K    300
##  7 2010 to present Built      200K-399K  30000
##  8 2010 to present Demolished 200K-399K    400
##  9 2010 to present Built      400K+      33000
## 10 2010 to present Demolished 400K+        500
SizeRange <- factor(space_clean9$Size_Range, levels = c("10K-49K", "50K-99K", "100K-199K", "200K-399K", "400K+"))
ggplot(data = space_clean9, aes(x = SizeRange, y = Value, fill = Status)) + geom_bar(stat = "identity", width = 0.9, position = "dodge") + xlab("Size Range") + ylab("Value in SF") + ggtitle("2010 to present") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))

Dataset 2 - Student Performance

Purpose - Analyze student performance and parental level of education.

The original dataset already appeared to be in melted format, so I changed it to wide by adding two new columns Female and Male that are actually variable values. When the dataset is read in, any rows without values in these columns is filled with NA’s which needs to be taken care of. I also passed variable names such that they are shorter and more meaningful.

library(readr)
library(RCurl)
field_names <- c("Race_Ethnicity", "Parent_education", "Lunch", "Test_Prep", "Math_Score", "Reading_Score", "Writing_Score", "Female", "Male")
url <- getURLContent("https://raw.githubusercontent.com/tponnada/DATA607/master/Students_Performance.csv")
performance <- read_csv(url, col_names = field_names, skip = 1, trim_ws = TRUE)
## Rows: 1000 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Race_Ethnicity, Parent_education, Lunch, Test_Prep, Female, Male
## dbl (3): Math_Score, Reading_Score, Writing_Score
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
performance
## # A tibble: 1,000 × 9
##    Race_Ethnicity Parent_education   Lunch    Test_Prep Math_Score Reading_Score
##    <chr>          <chr>              <chr>    <chr>          <dbl>         <dbl>
##  1 group B        bachelor's degree  standard none              72            72
##  2 group C        some college       standard completed         69            90
##  3 group B        master's degree    standard none              90            95
##  4 group B        associate's degree standard none              71            83
##  5 group B        some college       standard completed         88            95
##  6 group B        high school        free/re… none              38            60
##  7 group B        high school        standard none              65            81
##  8 group A        master's degree    standard none              50            53
##  9 group C        some high school   standard none              69            75
## 10 group B        some high school   free/re… none              18            32
## # … with 990 more rows, and 3 more variables: Writing_Score <dbl>,
## #   Female <chr>, Male <chr>

Data cleansing

##Step 1: The dataset has eight variables - Race_Ethnicity, Parent_Education, Lunch, Test_Prep, Math_Score, Reading_Score, Writing_Score and Gender (Female and Male). To tidy it, we need to melt, or stack it. The individual Female and Male columns are melted into one variable called Gender, we don’t really need the Value column because Gender itself indicates value for each row and I also arranged the columns as in the original dataset by putting Gender first. In addition, I also melt the Math, Reading and Writing Scores into one column called Score so that I can plot Math, Reading and Writing scores side by side for each parent_education grouping.

library(tidyr)
performance_clean1 <- gather(performance, "Gender", "Value", 8:9, na.rm = TRUE); performance_clean1
## # A tibble: 1,000 × 9
##    Race_Ethnicity Parent_education   Lunch    Test_Prep Math_Score Reading_Score
##    <chr>          <chr>              <chr>    <chr>          <dbl>         <dbl>
##  1 group B        bachelor's degree  standard none              72            72
##  2 group C        some college       standard completed         69            90
##  3 group B        master's degree    standard none              90            95
##  4 group B        associate's degree standard none              71            83
##  5 group B        some college       standard completed         88            95
##  6 group B        high school        free/re… none              38            60
##  7 group B        high school        standard none              65            81
##  8 group A        master's degree    standard none              50            53
##  9 group C        some high school   standard none              69            75
## 10 group B        some high school   free/re… none              18            32
## # … with 990 more rows, and 3 more variables: Writing_Score <dbl>,
## #   Gender <chr>, Value <chr>
performance_clean2 <- performance_clean1[c("Gender", "Race_Ethnicity", "Parent_education", "Lunch", "Test_Prep", "Math_Score", "Reading_Score", "Writing_Score")]; performance_clean2
## # A tibble: 1,000 × 8
##    Gender Race_Ethnicity Parent_education   Lunch        Test_Prep Math_Score
##    <chr>  <chr>          <chr>              <chr>        <chr>          <dbl>
##  1 Female group B        bachelor's degree  standard     none              72
##  2 Female group C        some college       standard     completed         69
##  3 Female group B        master's degree    standard     none              90
##  4 Female group B        associate's degree standard     none              71
##  5 Female group B        some college       standard     completed         88
##  6 Female group B        high school        free/reduced none              38
##  7 Female group B        high school        standard     none              65
##  8 Female group A        master's degree    standard     none              50
##  9 Female group C        some high school   standard     none              69
## 10 Female group B        some high school   free/reduced none              18
## # … with 990 more rows, and 2 more variables: Reading_Score <dbl>,
## #   Writing_Score <dbl>

##Step 2: We use the tidy dataset above to perform the analysis asked in the assignment which is analyzing student performance for a given level of parental education. This analysis yields interesting insights as below.

Conclusion:

It is no surprise that the mean scores across math, reading and writing increase with the level of parental education as more educated parents in turn stress on the importance of education for their children. However, it is surprising that scores across math, reading and writing for parents with some high school education were higher than for those parents who completed their high school education. The other two observations are that the mean Math score ranges were the lowest across parental education categories compared to reading and writing and also that the range of writing scores across parental education categories was the widest.

library(ggplot2)
library(tidyverse)

#Math/Reading/Writing mean/median/sd breakdowns by parental level of education
vis_perf1 <- performance_clean2 %>%
  group_by(Parent_education) %>%
  summarise(mean_MS = mean(Math_Score), median_MS = median(Math_Score), sd_MS = sd(Math_Score), mean_RS = mean(Reading_Score),  median_RS = median(Reading_Score), sd_rS = sd(Reading_Score), mean_WS = mean(Writing_Score), median_WS = median(Writing_Score), sd_WS = sd(Writing_Score))
vis_perf1
## # A tibble: 6 × 10
##   Parent_education   mean_MS median_MS sd_MS mean_RS median_RS sd_rS mean_WS
##   <chr>                <dbl>     <dbl> <dbl>   <dbl>     <dbl> <dbl>   <dbl>
## 1 associate's degree    67.9      67    15.1    70.9      72.5  13.9    69.9
## 2 bachelor's degree     69.4      68    14.9    73        73    14.3    73.4
## 3 high school           62.1      63    14.5    64.7      66    14.1    62.4
## 4 master's degree       69.7      73    15.2    75.4      76    13.8    75.7
## 5 some college          67.1      67.5  14.3    69.5      70.5  14.1    68.8
## 6 some high school      63.5      65    15.9    66.9      67    15.5    64.9
## # … with 2 more variables: median_WS <dbl>, sd_WS <dbl>
##Only calculate means and plot to compare
vis_perf2 <- performance_clean2 %>%
  group_by(Parent_education) %>%
  summarise(mean_MS = mean(Math_Score), mean_RS = mean(Reading_Score), mean_WS = mean(Writing_Score))
vis_perf2
## # A tibble: 6 × 4
##   Parent_education   mean_MS mean_RS mean_WS
##   <chr>                <dbl>   <dbl>   <dbl>
## 1 associate's degree    67.9    70.9    69.9
## 2 bachelor's degree     69.4    73      73.4
## 3 high school           62.1    64.7    62.4
## 4 master's degree       69.7    75.4    75.7
## 5 some college          67.1    69.5    68.8
## 6 some high school      63.5    66.9    64.9
##Order parental level of education from least to most highly educated
Parentedu <- factor(vis_perf1$Parent_education, levels = c("some high school", "high school", "some college", "associate's degree", "bachelor's degree", "master's degree"))

##Visualize by plotting side-by-side the mean math, reading and writing scores
ggplot(data = vis_perf2) + geom_bar(mapping = aes(x = Parentedu, y = mean_MS), stat = "identity") + xlab("Parental level of education") + ylab("Mean Math Scores") + ggtitle("Mean Math score by parental level of education") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))

ggplot(data = vis_perf2) + geom_bar(mapping = aes(x = Parentedu, y = mean_RS), stat = "identity") + xlab("Parental level of education") + ylab("Mean Reading Scores") + ggtitle("Mean Reading score by parental level of education") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))

ggplot(data = vis_perf2) + geom_bar(mapping = aes(x = Parentedu, y = mean_WS), stat = "identity") + xlab("Parental level of education") + ylab("Mean Writing Scores") + ggtitle("Mean Writing score by parental level of education") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))

Dataset 3 - School Diversity

Purpose - Calculate the racial average for each school in each state and write to a csv file.

I used the file that Zhouxin has already uploaded to Github. The file is in wide format with 6 different races in 6 different columns - AIAN (American Indian and Alaska Native), Asian, Black, Hispanic, White and Multi. These need to be melted together. I start by cleaning up the column names so that they are understandable and consistent.

library(readr)
library(RCurl)
field_names <- c("RowNum", "LEAID", "LEA_Name", "State", "Locale", "School_Year", "AIAN", "Asian", "Black", "Hispanic", "White", "Multi", "Total", "Diverse", "Variance", "Int_Group")
url <- getURLContent("https://raw.githubusercontent.com/szx868/data607/master/school_diversity.csv")
school_diversity <- read_csv(url, col_names = field_names, skip = 1, trim_ws = TRUE)
## Rows: 27944 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (6): LEA_Name, State, Locale, School_Year, Diverse, Int_Group
## dbl (10): RowNum, LEAID, AIAN, Asian, Black, Hispanic, White, Multi, Total, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
school_diversity
## # A tibble: 27,944 × 16
##    RowNum  LEAID LEA_Name  State Locale School_Year   AIAN Asian  Black Hispanic
##     <dbl>  <dbl> <chr>     <chr> <chr>  <chr>        <dbl> <dbl>  <dbl>    <dbl>
##  1      1 100002 alabama … AL    <NA>   1994-1995   0      0.589 71.7      0.196
##  2      2 100005 albertvi… AL    town-… 1994-1995   0      0.321  1.28     4.52 
##  3      3 100005 albertvi… AL    town-… 2016-2017   0.294  0.551  3.19    46.7  
##  4      4 100006 marshall… AL    rural… 1994-1995   0.104  0.134  0.373    0.909
##  5      5 100006 marshall… AL    rural… 2016-2017   0.492  0.299  1.07    21.3  
##  6      6 100007 hoover c… AL    city-… 1994-1995   0.0652 1.60   6.04     0.548
##  7      7 100007 hoover c… AL    city-… 2016-2017   0.0933 6.68  24.7      7.76 
##  8      8 100008 madison … AL    subur… 2016-2017   0.594  8.92  20.4      5.11 
##  9      9 100011 leeds ci… AL    subur… 2016-2017   0.304  1.06  23.0     12.5  
## 10     10 100012 boaz city AL    town-… 2016-2017   0.502  1.17   1.97    29.8  
## # … with 27,934 more rows, and 6 more variables: White <dbl>, Multi <dbl>,
## #   Total <dbl>, Diverse <chr>, Variance <dbl>, Int_Group <chr>

Data cleansing

##Step 1: We filter out all the schools with population less than 100, as indicated in the suggested problem which excludes 2,166 rows.

school_diversity1 <- school_diversity %>% filter(school_diversity$Total >= 100); school_diversity1
## # A tibble: 25,778 × 16
##    RowNum  LEAID LEA_Name  State Locale School_Year   AIAN Asian  Black Hispanic
##     <dbl>  <dbl> <chr>     <chr> <chr>  <chr>        <dbl> <dbl>  <dbl>    <dbl>
##  1      1 100002 alabama … AL    <NA>   1994-1995   0      0.589 71.7      0.196
##  2      2 100005 albertvi… AL    town-… 1994-1995   0      0.321  1.28     4.52 
##  3      3 100005 albertvi… AL    town-… 2016-2017   0.294  0.551  3.19    46.7  
##  4      4 100006 marshall… AL    rural… 1994-1995   0.104  0.134  0.373    0.909
##  5      5 100006 marshall… AL    rural… 2016-2017   0.492  0.299  1.07    21.3  
##  6      6 100007 hoover c… AL    city-… 1994-1995   0.0652 1.60   6.04     0.548
##  7      7 100007 hoover c… AL    city-… 2016-2017   0.0933 6.68  24.7      7.76 
##  8      8 100008 madison … AL    subur… 2016-2017   0.594  8.92  20.4      5.11 
##  9      9 100011 leeds ci… AL    subur… 2016-2017   0.304  1.06  23.0     12.5  
## 10     10 100012 boaz city AL    town-… 2016-2017   0.502  1.17   1.97    29.8  
## # … with 25,768 more rows, and 6 more variables: White <dbl>, Multi <dbl>,
## #   Total <dbl>, Diverse <chr>, Variance <dbl>, Int_Group <chr>

##Step 2: The dataset has eleven variables - RowNum, LEAID, LEA_Name, State, Locale, School_Year, Total, Diverse, Variance, Int_Group and Race with 6 values that need to be stacked - AIAN (American Indian and Alaska Native), Asian, Black, Hispanic, White and Multi.

library(tidyr)
school_diversity2 <- gather(school_diversity1, "Race", "Value", 7:12, na.rm = TRUE); school_diversity2
## # A tibble: 141,289 × 12
##    RowNum  LEAID LEA_Name        State Locale School_Year Total Diverse Variance
##     <dbl>  <dbl> <chr>           <chr> <chr>  <chr>       <dbl> <chr>      <dbl>
##  1      1 100002 alabama youth … AL    <NA>   1994-1995     509 Diverse NA      
##  2      2 100005 albertville ci… AL    town-… 1994-1995    3118 Extrem… NA      
##  3      3 100005 albertville ci… AL    town-… 2016-2017    5447 Diverse  0.0116 
##  4      4 100006 marshall county AL    rural… 1994-1995    6707 Extrem… NA      
##  5      5 100006 marshall county AL    rural… 2016-2017    5687 Undive… NA      
##  6      6 100007 hoover city     AL    city-… 1994-1995    7671 Extrem… NA      
##  7      7 100007 hoover city     AL    city-… 2016-2017   13938 Diverse  0.0316 
##  8      8 100008 madison city    AL    subur… 2016-2017   10440 Diverse  0.00266
##  9      9 100011 leeds city      AL    subur… 2016-2017    1973 Diverse NA      
## 10     10 100012 boaz city       AL    town-… 2016-2017    2389 Diverse NA      
## # … with 141,279 more rows, and 3 more variables: Int_Group <chr>, Race <chr>,
## #   Value <dbl>

##Step 3: The tidied dataset is used below for some initial analysis. We calculate the racial average by eliminating the White population in the dataset and by grouping by school_year, the proportions should add up to 1 (including the white population). The higher the overall proportion minus the whites, the higher the racial diversity in the school population but this only means that other races were represented. On the other end of the spectrum, the school could be considered less diverse if there is a high majority of only one type of race in the school (Could be Asian, African-American, Hispanic or any other race).

library(ggplot2)
library(tidyverse)

distinct_df = school_diversity2 %>% distinct(School_Year); distinct_df
## # A tibble: 2 × 1
##   School_Year
##   <chr>      
## 1 1994-1995  
## 2 2016-2017
school_diversity3 <- school_diversity2 %>%
  group_by(State) %>%
  group_by(LEAID) %>%
  filter(Race != "White" & School_Year == "1994-1995") %>%
  mutate(Racial_average = sum(Value)); school_diversity3
## # A tibble: 53,516 × 13
## # Groups:   LEAID [13,379]
##    RowNum  LEAID LEA_Name        State Locale School_Year Total Diverse Variance
##     <dbl>  <dbl> <chr>           <chr> <chr>  <chr>       <dbl> <chr>      <dbl>
##  1      1 100002 alabama youth … AL    <NA>   1994-1995     509 Diverse NA      
##  2      2 100005 albertville ci… AL    town-… 1994-1995    3118 Extrem… NA      
##  3      4 100006 marshall county AL    rural… 1994-1995    6707 Extrem… NA      
##  4      6 100007 hoover city     AL    city-… 1994-1995    7671 Extrem… NA      
##  5     12 100030 alexander city  AL    town-… 1994-1995    3788 Diverse NA      
##  6     14 100060 andalusia city  AL    town-… 1994-1995    2091 Undive… NA      
##  7     16 100090 anniston city   AL    city-… 1994-1995    3584 Undive… NA      
##  8     18 100100 arab city       AL    rural… 1994-1995    2503 Extrem… NA      
##  9     20 100120 athens city     AL    town-… 1994-1995    2915 Diverse  0.00307
## 10     22 100180 attalla city    AL    subur… 1994-1995    1967 Undive… NA      
## # … with 53,506 more rows, and 4 more variables: Int_Group <chr>, Race <chr>,
## #   Value <dbl>, Racial_average <dbl>
school_diversity5 <- school_diversity3 %>% filter(Racial_average >= 50); school_diversity5
## # A tibble: 6,340 × 13
## # Groups:   LEAID [1,585]
##    RowNum  LEAID LEA_Name        State Locale School_Year Total Diverse Variance
##     <dbl>  <dbl> <chr>           <chr> <chr>  <chr>       <dbl> <chr>      <dbl>
##  1      1 100002 alabama youth … AL    <NA>   1994-1995     509 Diverse NA      
##  2     16 100090 anniston city   AL    city-… 1994-1995    3584 Undive… NA      
##  3     36 100300 barbour county  AL    rural… 1994-1995    2027 Undive… NA      
##  4     38 100330 bessemer city   AL    subur… 1994-1995    4872 Extrem… NA      
##  5     42 100390 birmingham city AL    city-… 1994-1995   41801 Extrem… NA      
##  6     48 100480 bullock county  AL    town-… 1994-1995    1972 Extrem… NA      
##  7     50 100510 butler county   AL    town-… 1994-1995    4300 Diverse  0.0242 
##  8     54 100600 chambers county AL    town-… 1994-1995    4138 Diverse  0.305  
##  9     60 100690 choctaw county  AL    rural… 1994-1995    2929 Diverse  0.154  
## 10     62 100720 clarke county   AL    town-… 1994-1995    4059 Diverse  0.00873
## # … with 6,330 more rows, and 4 more variables: Int_Group <chr>, Race <chr>,
## #   Value <dbl>, Racial_average <dbl>
school_diversity4 <- school_diversity2 %>%
  group_by(State) %>%
  group_by(LEAID) %>%
  filter(Race != "White" & School_Year == "2016-2017") %>%
  mutate(Racial_average = sum(Value)); school_diversity4
## # A tibble: 61,995 × 13
## # Groups:   LEAID [12,399]
##    RowNum  LEAID LEA_Name        State Locale School_Year Total Diverse Variance
##     <dbl>  <dbl> <chr>           <chr> <chr>  <chr>       <dbl> <chr>      <dbl>
##  1      3 100005 albertville ci… AL    town-… 2016-2017    5447 Diverse  0.0116 
##  2      5 100006 marshall county AL    rural… 2016-2017    5687 Undive… NA      
##  3      7 100007 hoover city     AL    city-… 2016-2017   13938 Diverse  0.0316 
##  4      8 100008 madison city    AL    subur… 2016-2017   10440 Diverse  0.00266
##  5      9 100011 leeds city      AL    subur… 2016-2017    1973 Diverse NA      
##  6     10 100012 boaz city       AL    town-… 2016-2017    2389 Diverse NA      
##  7     11 100013 trussville city AL    subur… 2016-2017    4539 Undive… NA      
##  8     13 100030 alexander city  AL    town-… 2016-2017    3055 Diverse NA      
##  9     15 100060 andalusia city  AL    town-… 2016-2017    1744 Diverse NA      
## 10     17 100090 anniston city   AL    city-… 2016-2017    2069 Extrem… NA      
## # … with 61,985 more rows, and 4 more variables: Int_Group <chr>, Race <chr>,
## #   Value <dbl>, Racial_average <dbl>
school_diversity6 <- school_diversity4 %>% filter(Racial_average >= 50); school_diversity6
## # A tibble: 13,935 × 13
## # Groups:   LEAID [2,787]
##    RowNum  LEAID LEA_Name        State Locale School_Year Total Diverse Variance
##     <dbl>  <dbl> <chr>           <chr> <chr>  <chr>       <dbl> <chr>      <dbl>
##  1      3 100005 albertville ci… AL    town-… 2016-2017    5447 Diverse   0.0116
##  2     17 100090 anniston city   AL    city-… 2016-2017    2069 Extrem…  NA     
##  3     25 100188 chickasaw city  AL    subur… 2016-2017    1056 Diverse  NA     
##  4     37 100300 barbour county  AL    rural… 2016-2017     847 Undive…  NA     
##  5     39 100330 bessemer city   AL    subur… 2016-2017    3605 Undive…  NA     
##  6     43 100390 birmingham city AL    city-… 2016-2017   23158 Extrem…  NA     
##  7     49 100480 bullock county  AL    town-… 2016-2017    1476 Undive…  NA     
##  8     51 100510 butler county   AL    town-… 2016-2017    3131 Diverse   0.142 
##  9     55 100600 chambers county AL    town-… 2016-2017    3651 Diverse   0.178 
## 10     61 100690 choctaw county  AL    rural… 2016-2017    1417 Diverse  NA     
## # … with 13,925 more rows, and 4 more variables: Int_Group <chr>, Race <chr>,
## #   Value <dbl>, Racial_average <dbl>
total <- rbind(school_diversity3, school_diversity4); total[order(total$Racial_average, decreasing = TRUE), ]
## # A tibble: 115,511 × 13
## # Groups:   LEAID [13,902]
##    RowNum   LEAID LEA_Name       State Locale School_Year Total Diverse Variance
##     <dbl>   <dbl> <chr>          <chr> <chr>  <chr>       <dbl> <chr>      <dbl>
##  1   5184 1710950 ford heights … IL    subur… 1994-1995     962 Extrem…       NA
##  2  13560 3011420 frazer elem    MT    rural… 1994-1995     124 Extrem…       NA
##  3   5184 1710950 ford heights … IL    subur… 1994-1995     962 Extrem…       NA
##  4  13560 3011420 frazer elem    MT    rural… 1994-1995     124 Extrem…       NA
##  5   5184 1710950 ford heights … IL    subur… 1994-1995     962 Extrem…       NA
##  6  13560 3011420 frazer elem    MT    rural… 1994-1995     124 Extrem…       NA
##  7   5184 1710950 ford heights … IL    subur… 1994-1995     962 Extrem…       NA
##  8  13560 3011420 frazer elem    MT    rural… 1994-1995     124 Extrem…       NA
##  9    694  407200 sacaton eleme… AZ    rural… 2016-2017     556 Extrem…       NA
## 10  14441 3119560 umo n ho n na… NE    rural… 2016-2017     548 Extrem…       NA
## # … with 115,501 more rows, and 4 more variables: Int_Group <chr>, Race <chr>,
## #   Value <dbl>, Racial_average <dbl>

Conclusion:

As we see above, by sorting in descending order we can see that ford heights sd 169 and frazer elem are one of the most undiversified schools because they have a high concentration of American Indian and Black populations. In fact, 6,340 schools or 12% of the schools had such an undiversified student base in the 1994-1995 school year and 13,935 schools or 22% of the schools had such an undiversified student base in the 2016-2017 school year. Hence, the common misconception is incorrect that schools are not racially diverse if they contain a high number of caucasian students, schools can also be considered un-diverse if they have a high concentration of students of any one race including minorities.

##Step 4: Prepare the output file and write out.

output <- data.frame(total$LEAID, total$LEA_Name, total$State, total$School_Year, round(total$Racial_average))
write_csv(output, "/Users/tponnada/Desktop/DATA607/Racial_diversity.csv")