Overview

In this project we look at three different datasets. They all need formatting to become “tidy”. We will then perform analysis on the dataset. The three datasets are as follows: HUD Income limits: + We look at income limits for all USA states and territories and how they range. Brain stoke features: + We perform some machine learning to verify if the variables are a good predictor of strokes. *NBA Player Distance Traveled: + We look at distance traveled by players and see how it can influence game outcomes.

Dataset 1: HUD Income Limits

## Rows: 4765 Columns: 35
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (9): State_Alpha, fips2010, cbsasub, Metro_Area_Name, county_town_name,...
## dbl (26): median2022, l50_1, l50_2, l50_3, l50_4, l50_5, l50_6, l50_7, l50_8...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
##  [1] "State_Alpha"      "fips2010"         "cbsasub"          "median2022"      
##  [5] "Metro_Area_Name"  "county_town_name" "l50_1"            "l50_2"           
##  [9] "l50_3"            "l50_4"            "l50_5"            "l50_6"           
## [13] "l50_7"            "l50_8"            "ELI_1"            "ELI_2"           
## [17] "ELI_3"            "ELI_4"            "ELI_5"            "ELI_6"           
## [21] "ELI_7"            "ELI_8"            "l80_1"            "l80_2"           
## [25] "l80_3"            "l80_4"            "l80_5"            "l80_6"           
## [29] "l80_7"            "l80_8"            "state"            "county"          
## [33] "County_Name"      "state_name"       "metro"

Dataset 1: Data Wrangling

#Give each observation a unique ID - helps keep data group after a pivot_longer
df$ID <- seq.int(nrow(df))

#Convert the flat file into long format
df_longer <- df |>
  janitor::clean_names() |>
  pivot_longer(cols = l50_1:l80_8, 
               names_to = "income_level", 
               values_to = "income_threshold") |>
  mutate(household_size = case_when(
    endsWith(income_level, "1") ~ "1",
    endsWith(income_level, "2") ~ "2",
    endsWith(income_level, "3") ~ "3",
    endsWith(income_level, "4") ~ "4",
    endsWith(income_level, "5") ~ "5",
    endsWith(income_level, "6") ~ "6",
    endsWith(income_level, "7") ~ "7",
    endsWith(income_level, "8") ~ "8"
    )) |>
  mutate(program = case_when(
    startsWith(income_level, "eli") ~ "extra_low_income",
    startsWith(income_level, "l50") ~ "very_low_income",
    startsWith(income_level, "l80") ~ "low_income"
    )) |>
  mutate(metro = recode(metro, "0" = "non_metro_area", "1" = "metro_area")
           )

#Remove unnecessary columns and clean up column values.
df_clean <- df_longer |>
  select(-c(county,income_level)) |>
  mutate(county_name = str_remove_all(county_name, " County"))

#Label US Territories in a list
us_ter <- c("GU", "MP", "AS", "VI", "PR")

df_clean <- df_clean |>
  mutate(territory = case_when(
    state_alpha %in% us_ter ~ "us_territory",
    !(state_alpha %in% us_ter) ~ "non_us_territory"
    ))

Dataset 1: Analysis

*Which States have the lowest and highest income limit average across programs? The top 5 states with the lowest average income limits are: +PR +MP +AS +MS +AR

*Which states have the highest income limits average across programs? The top 5 states with the highest average income limits are: +DC +MA +HI +CT +NJ

*Do metro areas tend to have higher income limits? +As expected, it was found that metropolitan areas have higher income thresholds across all programs.

*Do US territories, on average, have higher income limits compared to the continental US? +Most US territory limits fall on the bottom half of all states.

df_clean |>
  ggplot(aes(x= reorder(factor(state_alpha), -income_threshold), y = income_threshold, fill = program)) +
  stat_summary(fun = "mean", geom = "bar") +
  coord_flip()

df_clean |>
  ggplot(aes(x= reorder(factor(state_alpha), -income_threshold), y = income_threshold, fill = program)) +
  stat_summary(fun = "mean", geom = "bar") +
  coord_flip() +
  facet_wrap(~metro)

df_clean |>
  ggplot(aes(x= reorder(factor(state_alpha), -income_threshold), y = income_threshold, fill = program)) +
  stat_summary(fun = "mean", geom = "bar") +
  coord_flip() +
  facet_wrap(~territory)

Dataset 2: Brain stoke features

#load in data into a dataframe
urlfile <- 'https://raw.githubusercontent.com/jlixander/DATA607/main/Project2/Tidy2/brain_stroke.csv'
df <- read_csv(url(urlfile))
## Rows: 4981 Columns: 11
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (5): gender, ever_married, work_type, Residence_type, smoking_status
## dbl (6): age, hypertension, heart_disease, avg_glucose_level, bmi, stroke
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Dataset 2: Data Wrangling

#Give each observation a unique ID - helps keep data group after a pivot_longer
df$ID <- seq.int(nrow(df))

#####Data Wrangling/tidying
#Remove encoded values from columns
df <- df |>
  janitor::clean_names() |>
  mutate(hypertension = recode(hypertension, "0" = "Negative", "1" = "Positive"),
         stroke = recode(stroke, "0" = "Negative", "1" = "Positive"),
         heart_disease = recode(heart_disease, "0" = "Negative", "1" = "Positive")
         )

#Set column data types
df$age<-round(as.numeric(df$age), 0)

#Check for na values in each column
df %>%
  summarise_all(funs(sum(is.na(.))))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## # A tibble: 1 x 12
##   gender   age hypertens~1 heart~2 ever_~3 work_~4 resid~5 avg_g~6   bmi smoki~7
##    <int> <int>       <int>   <int>   <int>   <int>   <int>   <int> <int>   <int>
## 1      0     0           0       0       0       0       0       0     0       0
## # ... with 2 more variables: stroke <int>, id <int>, and abbreviated variable
## #   names 1: hypertension, 2: heart_disease, 3: ever_married, 4: work_type,
## #   5: residence_type, 6: avg_glucose_level, 7: smoking_status
#Consolidate all condition columns
df_longer <- df |>
  pivot_longer(cols = hypertension:heart_disease, 
               names_to = "condition", 
               values_to = "condition_status")

Dataset 2: Analysis

#create dataframe without id
df_class <- select(df, -id)

#remove chr values to factor
names <- c(1,3,4,5,6,7, 10,11) #create list of columns to turn into factor type
df_class[,names] <- lapply(df_class[,names] , factor)
glimpse(df_class)
## Rows: 4,981
## Columns: 11
## $ gender            <fct> Male, Male, Female, Female, Male, Male, Female, Fema~
## $ age               <dbl> 67, 80, 49, 79, 81, 74, 69, 78, 81, 61, 54, 79, 50, ~
## $ hypertension      <fct> Negative, Negative, Negative, Positive, Negative, Po~
## $ heart_disease     <fct> Positive, Positive, Negative, Negative, Negative, Po~
## $ ever_married      <fct> Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes~
## $ work_type         <fct> Private, Private, Private, Self-employed, Private, P~
## $ residence_type    <fct> Urban, Rural, Urban, Rural, Urban, Rural, Urban, Urb~
## $ avg_glucose_level <dbl> 228.69, 105.92, 171.23, 174.12, 186.21, 70.09, 94.39~
## $ bmi               <dbl> 36.6, 32.5, 34.4, 24.0, 29.0, 27.4, 22.8, 24.2, 29.7~
## $ smoking_status    <fct> formerly smoked, never smoked, smokes, never smoked,~
## $ stroke            <fct> Positive, Positive, Positive, Positive, Positive, Po~
#partition and split data
spl = sample.split(df_class$stroke, SplitRatio = 0.80)
train = subset(df_class, spl==TRUE)
test = subset(df_class, spl==FALSE)
print(dim(train)); print(dim(test))
## [1] 3984   11
## [1] 997  11
#build, predict and evaluate the model
model_glm = glm(stroke ~ . , family="binomial", data = train)
summary(model_glm)
## 
## Call:
## glm(formula = stroke ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2781  -0.3234  -0.1651  -0.0850   3.7052  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -7.448670   1.078206  -6.908  4.9e-12 ***
## genderMale                  0.039554   0.160108   0.247  0.80487    
## age                         0.075568   0.006518  11.594  < 2e-16 ***
## hypertensionPositive        0.471244   0.183879   2.563  0.01038 *  
## heart_diseasePositive       0.316881   0.218163   1.452  0.14636    
## ever_marriedYes            -0.382335   0.241269  -1.585  0.11304    
## work_typeGovt_job          -0.425786   1.115293  -0.382  0.70263    
## work_typePrivate           -0.255638   1.099306  -0.233  0.81612    
## work_typeSelf-employed     -0.721647   1.119635  -0.645  0.51923    
## residence_typeUrban         0.065104   0.155597   0.418  0.67564    
## avg_glucose_level           0.003993   0.001363   2.930  0.00339 ** 
## bmi                         0.008442   0.014129   0.597  0.55019    
## smoking_statusnever smoked -0.144990   0.198720  -0.730  0.46562    
## smoking_statussmokes        0.170668   0.243100   0.702  0.48265    
## smoking_statusUnknown       0.005855   0.238041   0.025  0.98038    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1574.7  on 3983  degrees of freedom
## Residual deviance: 1243.9  on 3969  degrees of freedom
## AIC: 1273.9
## 
## Number of Fisher Scoring iterations: 8
#Check Baseline Accuracy
prop.table(table(train$stroke))
## 
##  Negative  Positive 
## 0.9503012 0.0496988
# Predictions on the training set
predictTrain = predict(model_glm, data = train, type = "response")

# Confusion matrix on training data
table(train$stroke, predictTrain >= 0.5)
##           
##            FALSE TRUE
##   Negative  3785    1
##   Positive   198    0
(3786+1)/nrow(train)
## [1] 0.9505522
#Predictions on the test set
predictTest = predict(model_glm, newdata = test, type = "response")

# Confusion matrix on test set
table(test$stroke, predictTest >= 0.5)
##           
##            FALSE TRUE
##   Negative   947    0
##   Positive    49    1
957/nrow(test)
## [1] 0.9598796

*It was found that the features found in the dataset can be used to predict if an individual is at risk of a stroke. This classification model is 95% accurate.

Dataset 3: NBA Player Distance Traveled

#load in data into a dataframe
urlfile <- 'https://raw.githubusercontent.com/jlixander/DATA607/main/Project2/Tidy3/nba_player_stats.csv'
df <- read_csv(url(urlfile))
## Rows: 605 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (2): PLAYER, TEAM
## dbl (11): GP, W, L, MIN, DIST_FEET, DIST_MILES, DIST_MILES_OFF, DIST_MILES_D...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Dataset 3: Data Wrangling

#Drop Columns
df2 <- df |>
  select(-c(DIST_FEET,GP,DIST_MILES))

#turn df into long format
df_longer <- df2 |>
  pivot_longer(cols = W:L, 
               names_to = "GAME_OUTCOME", 
               values_to = "GAME_CNT") |>
  pivot_longer(cols = AVG_SPEED:AVG_SPEED_DEF,
               names_to = "PLAYER_SPEED",
               values_to = "SPEED")

Dataset 3: Analysis

ggplot(df, aes(x=DIST_MILES, y=AVG_SPEED)) +
  geom_point() +
  geom_smooth(method=lm, se=FALSE, linetype="dashed",
             color="darkred")
## `geom_smooth()` using formula 'y ~ x'

ggplot(df, aes(x=DIST_MILES, y=W)) +
  geom_point() +
  geom_smooth(method=lm, se=FALSE, linetype="dashed",
             color="darkred")
## `geom_smooth()` using formula 'y ~ x'

ggplot(df, aes(x=DIST_MILES_DEF, y=L)) +
  geom_point() +
  geom_smooth(method=lm, se=FALSE, linetype="dashed",
             color="darkred")
## `geom_smooth()` using formula 'y ~ x'

df |>
  ggplot(aes(x=DIST_MILES, y = W, colour= factor(TEAM))) +
  stat_summary(fun = "mean", geom = "point") +
  geom_smooth(method=lm, se=FALSE, linetype="dashed",
             color="darkred")
## `geom_smooth()` using formula 'y ~ x'

The original question was to see the correlation between the distance traveled and the average speed of a player. A slight negative correlation was found. Instead, the correlation between number of games won and distance traveled was compared. A strong correlation was found. However, this is not enough confirmation and a correlation was also plotted for games lost against distance traveled. This also showed a positive correlation. To dive deeper, the strength of each regression line was measured. Unexpectedly, it was found that long distance traveled may have a higher probability of losing a game. i.e.: players who traveled 1mile on average have about 18 wins, vs players who traveled 1 mile on average have about 30 losses.