library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)

setwd("C:/Users/Az/Downloads/My Class Stuff/Monday Class")

dog_bite<-read_csv("Dog_Bite_Data_20260204.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 76472 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Bite Number, Bite Type, Incident Date, Victim Relationship, Bite L...
## dbl  (2): Victim Age, Treatment Cost
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
census<-read_excel("DECENNIALDHC2020.P1-2026-03-23T192017.xlsx", 
    sheet = "Data")
#following example and fixing the zip code data
zip_long <- census %>%
  filter(Label == "Total") %>%                    
  pivot_longer(cols = -c(Label, `Dallas city, Texas`),
    names_to = "zipcode",
    values_to = "population") %>%
  mutate(zipcode = str_remove(zipcode, "^ZCTA5\\s+"), 
    population = as.numeric(gsub(",", "", population))) %>%
  select(zipcode, population)
#cleaning up dog data
dog_bite_clean<-dog_bite %>% select(`Bite Number`,`Bite Type`,`Victim Relationship`,`Bite Circumstance`,`Incident Location`) %>% drop_na()

dog_bite_clean <- dog_bite_clean %>% mutate(zip_code=str_sub(`Incident Location`,-5))
#joining dog data and zip code data
dog_bite_clean<-left_join(dog_bite_clean,zip_long,by=join_by(zip_code==zipcode))
#dropping na's 
dog_bite_clean_2<-dog_bite_clean %>% drop_na()

#Cleaning up th bite type to include only bites
dog_bite_clean_2 <- dog_bite_clean_2 %>% filter(`Bite Type`== "BITE")

#Making binary measure for the zip codes
dog_bite_clean_2 <- dog_bite_clean_2 %>%  mutate(program_zip=ifelse(zip_code=="75238",1,
                               ifelse(zip_code=="75231",1,
                               ifelse(zip_code=="75228",1,0))))
#selecting the variables that I need only
dog_bite_clean_3 <- dog_bite_clean_2 %>% select(`Bite Type`,`Victim Relationship`,`Bite Circumstance`, zip_code, population, program_zip)
# 2) Select the dependent variable you are interested in, along with independent variables which you believe are causing the dependent variable

#variable selected- bite circumstance: petting 
Petting_circumstance <- dog_bite_clean_3 %>% mutate(petting=ifelse(`Bite Circumstance`=="PETTING",1,0))
# 3) create a linear model using the "lm()" command, save it to some object
petting_model<-lm(petting~population+program_zip,data=Petting_circumstance)

# 4) call a "summary()" on your new model
summary(petting_model)
## 
## Call:
## lm(formula = petting ~ population + program_zip, data = Petting_circumstance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.10863 -0.09056 -0.08905 -0.08531  0.91846 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.382e-02  2.641e-03  35.528  < 2e-16 ***
## population  -1.439e-07  5.856e-08  -2.457    0.014 *  
## program_zip  1.953e-02  4.483e-03   4.357 1.32e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2857 on 53005 degrees of freedom
## Multiple R-squared:  0.0004075,  Adjusted R-squared:  0.0003698 
## F-statistic:  10.8 on 2 and 53005 DF,  p-value: 2.037e-05
# 5) interpret the model's r-squared and p-values. How much of the dependent variable does the overall model explain? What are the significant variables? What are the insignificant variables?

#The R squared value is very low, which means that my variables do not explain much of dog bites while someone is petting a dog (these variables explain 0.03% of the dependent variable). The P-value for whether a zip code is receiving the prevention program is very small (.00295) which means that this variable is significant.This is also true for population who also has an even smaller p-value (-.0000001439). 
#6) Choose some significant independent variables. Interpret its Estimates (or Beta Coefficients). How do the independent variables individually affect the dependent variable? 

#This one confused me a little so I am going to try my best to explain this. In looking at the Zip code variable which aims to explain whether a zip code received the prevention program or not, we see that the estimate is positive but small. This means that as petting incident goes up by 1, the odds of prevention program being present go up by 0.019. I believe that this can also be interpreted as there are more bites in zip codes that receive the prevention program (since the residuals are positive).
# 7) Does the model you create meet or violate the assumption of linearity? Show your work with "plot(x,which=1)"

plot(petting_model, which = 1)

# By solely looking at the red line, I would say it is linear. However, I think this may be the case only because I have binary measures, and this specific regression may not be the best fit for my data.