Explore Missing info

The dataset contains the contaacts which is devices within 5m within a 5min window. So 5m per 5min will be count as one obs, and if the same two devices are within 5m over a 10min window continuously, they appear twice and therefore would be have two records in the dataset.

Our response here:

unique: the count of unique device pairings at a polygon in a day.
intense: reduces the window of what counts as a contact to 2m in 90s
intense_unique: first reduce the intense contact then unique
Not_household takes the unique contacts and removes device pairings that are likely from the same household. (We have distances to common evening location for most devices.)

franklin_recent_df <- read_csv("data/contact/oh_franklin_20210206_20210221.csv")
missing_info()

## # A tibble: 19 x 3
##    attribute          missing_number missing_pct
##    <chr>                       <dbl>       <dbl>
##  1 sqft                       247084      1     
##  2 numstories                 247084      1     
##  3 intense                    115788      0.469 
##  4 intense_unique             115788      0.469 
##  5 not_household               20829      0.0843
##  6 usedesc                      8031      0.0325
##  7 zoning_description           8031      0.0325
##  8 X1                              0      0     
##  9 address                         0      0     
## 10 date                            0      0     
## 11 all                             0      0     
## 12 unique                          0      0     
## 13 city                            0      0     
## 14 szip                            0      0     
## 15 lbcs_activity_desc              0      0     
## 16 lbcs_function_desc              0      0     
## 17 census_blockgroup               0      0     
## 18 lat                             0      0     
## 19 lon                             0      0

sqft and numstories are useless.

franklin_recent_df <- franklin_recent_df %>%
  select(-sqft, -numstories)

Fill in the missing value of intense contact count as 0. Then filter out the obs with missing data.

franklin_recent_df<- 
  franklin_recent_df %>% 
  replace_na(
    list(intense = 0, 
         intense_unique = 0)
  ) %>%
  na.omit()

Explore factors

Location Description

franklin_recent_df %>%
  group_by(lbcs_function_desc, lbcs_activity_desc) %>%
  summarise_at(vars(unique, intense_unique), .funs = list(sum)) %>%
  mutate(intense_pct = scales::percent(intense_unique/unique)) %>%
  arrange(-unique, -intense_unique)

## # A tibble: 7 x 5
## # Groups:   lbcs_function_desc [7]
##   lbcs_function_desc      lbcs_activity_desc   unique intense_unique intense_pct
##   <chr>                   <chr>                 <dbl>          <dbl> <chr>      
## 1 Residence or accommoda… Residential          6.42e5        340610. 53%        
## 2 unknown                 unknown              5.53e5        192951. 35%        
## 3 Education, public admi… Social, institution… 1.07e5         40322. 38%        
## 4 General sales or servi… Shopping, business,… 5.51e4         20951. 38%        
## 5 Private household       Residential          3.74e4         14280. 38%        
## 6 Arts, entertainment, a… Leisure              1.24e4          4998. 40%        
## 7 Transportation, commun… Travel or movement   4.70e1            18  38%

Contact are mostly Residential, either from residence or private household (65%). Apart from the unkowns, assuming the unknowns are uniformly distributed to the rest of the sectors, we should be paying attention to:

General sales or services (Shopping, business, or trade)
Education, public admin., health care (Social, institutional, or infrastructure-related)

Arts, entertainment, and recreation locations have low number of contact.

Map

location_sum <- franklin_recent_df %>%
  group_by(lat, lon, address, szip, city, lbcs_function_desc) %>%
  summarise_at(vars(unique, intense_unique), .funs = list(sum))
location_sum %>% 
  #filter(unique > 50) %>%
  ggplot(aes(log(unique))) +
  geom_density()

Most are below 150, some extreme are above that. Let’s check it out.

location_sum %>%
  filter(unique >= 150) %>%
  ggplot(aes(log(unique))) +
  geom_density()

get_pop_content <- function(address, city, szip, 
                            unique, intense_unique, lbcs_function_desc) {
  content <- 
    paste(
    tags$h4(tags$strong(city)),
    paste(address, szip), br(),
    tags$strong("Location Type: "), tags$u(lbcs_function_desc), br(),
    tags$strong("Unique Contact: "), tags$u(unique), "    ",
    tags$strong("Intense Contact: "), tags$u(intense_unique)
  )
  return(content)
}
location_info <- 
  location_sum %>%
  filter(unique >= 150) %>%
  ungroup() 
popup_content <- 
        location_info %>% 
        select(address, city, szip, unique, intense_unique, lbcs_function_desc) %>%
        pmap_chr(get_pop_content)
location_info <- 
        location_info %>%
        mutate(popup_content = popup_content)

location_info %>%
  leaflet() %>%
  addTiles() %>%
  addCircleMarkers(
    lat = ~lat, 
    lng = ~lon, 
    opacity = 0,#~formattable::normalize(unique, min = 0, max = 1),
    #clusterOptions = markerClusterOptions(),
    radius = ~formattable::normalize(intense_unique, min = 10, max = 30),
    fillOpacity = ~formattable::normalize(unique, min = 0, max = 1),
    popup = ~popup_content)

get rid of the extreme value: The Orchard apartment

Now clearly we have the most number of contact on campus.

# location_info <- location_info %>%
#   filter(address != "5353 WILCOX RD") %>%
#   mutate(group = cut_interval(unique, 4))
# locatin_groups <- location_info %>% distinct(group)

leaflet(location_info %>% filter(address != "5353 WILCOX RD")) %>%
  addTiles() %>%
  addCircleMarkers(
    lat = ~lat, 
    lng = ~lon, 
    opacity = 0,#~formattable::normalize(unique, min = 0, max = 1),
    #clusterOptions = markerClusterOptions(),
    radius = ~formattable::normalize(intense_unique, min = 10, max = 30),
    fillOpacity = ~formattable::normalize(unique, min = 0, max = 1),
    popup = ~popup_content)

Another extreme point: 1932 COLLEGE RD 43210, OSU campus

leaflet(location_info %>% filter(address != "5353 WILCOX RD" & 
                                   address != "1932 COLLEGE RD")
        ) %>%
  addTiles() %>%
  addProviderTiles(providers$Stamen.Toner) %>%
  addCircleMarkers(
    lat = ~lat, 
    lng = ~lon, 
    opacity = 0,#~formattable::normalize(unique, min = 0, max = 1),
    #clusterOptions = markerClusterOptions(),
    radius = ~formattable::normalize(intense_unique, min = 10, max = 30),
    fillOpacity = ~formattable::normalize(unique, min = 0, max = 1),
    popup = ~popup_content)

location_info %>%
  filter(address != "5353 WILCOX RD",
         address != "1932 COLLEGE RD") %>%
  slice_max(unique, n = 100) %>%
  select(address, city, type = lbcs_function_desc, 
         unique_contact = unique, unique_intense_contact = intense_unique) %>%
  DT::datatable()

contact EDA

Net Zhang

3/6/2021

Explore Missing info

Explore factors

Location Description

Map