Imagine you work as a data analyst for a power company responsible for maintaining the power grid. The company wants to understand the relationship between daily average temperature and the likelihood of power grid failures within a day. You have collected data on the daily average temperature and whether there was a power grid failure on each day. Your goal is to analyze this data and determine if daily average temperature has an impact on power grid failure.
Let’s see your data. (It turns out is simulated data that I made up.)
data_daily_temp<-read_csv("../data/daily_temp_.csv")
head(data_daily_temp,1)
## # A tibble: 1 × 2
## temperature success
## <dbl> <dbl>
## 1 88.2 0
The temperature variable represents the daily average temperature (in Fahrenheit), and the success variable is a binary outcome (1 = there was not a power grid failure on the day, 0 = there was a power grid failure on the day).
Let’s recode it to have it in terms of failure
data_daily_temp<-data_daily_temp%>%mutate(failure=1-success)
data_daily_temp<-select(data_daily_temp,-success)
head(data_daily_temp,1)
## # A tibble: 1 × 2
## temperature failure
## <dbl> <dbl>
## 1 88.2 1
In this example, you model the relationship using logistic regression. The accuracy of the logistic regression model is high, demonstrating that the model can successfully predict power grid failures based on daily average temperature. See below (for full details, see the .Rmd file):
## [1] "Logistic Regression Accuracy using a threshold prob of 0.5 on the test set: 0.883333333333333"
The relationship between daily temperature and power grid failure appears to follow a S-shaped curve sort of well but we could probably do even better with another model besides logistic regression if you look at that graph. It suggests that we can use logistic regression but like the previous example we may be able to get away with another type of classification model!
