In this project we look at three different datasets. They all need formatting to become “tidy”. We will then perform analysis on the dataset. The three datasets are as follows: HUD Income limits: + We look at income limits for all USA states and territories and how they range. Brain stoke features: + We perform some machine learning to verify if the variables are a good predictor of strokes. *NBA Player Distance Traveled: + We look at distance traveled by players and see how it can influence game outcomes.
## Rows: 4765 Columns: 35
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (9): State_Alpha, fips2010, cbsasub, Metro_Area_Name, county_town_name,...
## dbl (26): median2022, l50_1, l50_2, l50_3, l50_4, l50_5, l50_6, l50_7, l50_8...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## [1] "State_Alpha" "fips2010" "cbsasub" "median2022"
## [5] "Metro_Area_Name" "county_town_name" "l50_1" "l50_2"
## [9] "l50_3" "l50_4" "l50_5" "l50_6"
## [13] "l50_7" "l50_8" "ELI_1" "ELI_2"
## [17] "ELI_3" "ELI_4" "ELI_5" "ELI_6"
## [21] "ELI_7" "ELI_8" "l80_1" "l80_2"
## [25] "l80_3" "l80_4" "l80_5" "l80_6"
## [29] "l80_7" "l80_8" "state" "county"
## [33] "County_Name" "state_name" "metro"
#Give each observation a unique ID - helps keep data group after a pivot_longer
df$ID <- seq.int(nrow(df))
#Convert the flat file into long format
df_longer <- df |>
janitor::clean_names() |>
pivot_longer(cols = l50_1:l80_8,
names_to = "income_level",
values_to = "income_threshold") |>
mutate(household_size = case_when(
endsWith(income_level, "1") ~ "1",
endsWith(income_level, "2") ~ "2",
endsWith(income_level, "3") ~ "3",
endsWith(income_level, "4") ~ "4",
endsWith(income_level, "5") ~ "5",
endsWith(income_level, "6") ~ "6",
endsWith(income_level, "7") ~ "7",
endsWith(income_level, "8") ~ "8"
)) |>
mutate(program = case_when(
startsWith(income_level, "eli") ~ "extra_low_income",
startsWith(income_level, "l50") ~ "very_low_income",
startsWith(income_level, "l80") ~ "low_income"
)) |>
mutate(metro = recode(metro, "0" = "non_metro_area", "1" = "metro_area")
)
#Remove unnecessary columns and clean up column values.
df_clean <- df_longer |>
select(-c(county,income_level)) |>
mutate(county_name = str_remove_all(county_name, " County"))
#Label US Territories in a list
us_ter <- c("GU", "MP", "AS", "VI", "PR")
df_clean <- df_clean |>
mutate(territory = case_when(
state_alpha %in% us_ter ~ "us_territory",
!(state_alpha %in% us_ter) ~ "non_us_territory"
))
*Which States have the lowest and highest income limit average across programs? The top 5 states with the lowest average income limits are: +PR +MP +AS +MS +AR
*Which states have the highest income limits average across programs? The top 5 states with the highest average income limits are: +DC +MA +HI +CT +NJ
*Do metro areas tend to have higher income limits? +As expected, it was found that metropolitan areas have higher income thresholds across all programs.
*Do US territories, on average, have higher income limits compared to the continental US? +Most US territory limits fall on the bottom half of all states.
df_clean |>
ggplot(aes(x= reorder(factor(state_alpha), -income_threshold), y = income_threshold, fill = program)) +
stat_summary(fun = "mean", geom = "bar") +
coord_flip()
df_clean |>
ggplot(aes(x= reorder(factor(state_alpha), -income_threshold), y = income_threshold, fill = program)) +
stat_summary(fun = "mean", geom = "bar") +
coord_flip() +
facet_wrap(~metro)
df_clean |>
ggplot(aes(x= reorder(factor(state_alpha), -income_threshold), y = income_threshold, fill = program)) +
stat_summary(fun = "mean", geom = "bar") +
coord_flip() +
facet_wrap(~territory)
#load in data into a dataframe
urlfile <- 'https://raw.githubusercontent.com/jlixander/DATA607/main/Project2/Tidy2/brain_stroke.csv'
df <- read_csv(url(urlfile))
## Rows: 4981 Columns: 11
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (5): gender, ever_married, work_type, Residence_type, smoking_status
## dbl (6): age, hypertension, heart_disease, avg_glucose_level, bmi, stroke
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Give each observation a unique ID - helps keep data group after a pivot_longer
df$ID <- seq.int(nrow(df))
#####Data Wrangling/tidying
#Remove encoded values from columns
df <- df |>
janitor::clean_names() |>
mutate(hypertension = recode(hypertension, "0" = "Negative", "1" = "Positive"),
stroke = recode(stroke, "0" = "Negative", "1" = "Positive"),
heart_disease = recode(heart_disease, "0" = "Negative", "1" = "Positive")
)
#Set column data types
df$age<-round(as.numeric(df$age), 0)
#Check for na values in each column
df %>%
summarise_all(funs(sum(is.na(.))))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## # A tibble: 1 x 12
## gender age hypertens~1 heart~2 ever_~3 work_~4 resid~5 avg_g~6 bmi smoki~7
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 0 0 0 0 0 0
## # ... with 2 more variables: stroke <int>, id <int>, and abbreviated variable
## # names 1: hypertension, 2: heart_disease, 3: ever_married, 4: work_type,
## # 5: residence_type, 6: avg_glucose_level, 7: smoking_status
#Consolidate all condition columns
df_longer <- df |>
pivot_longer(cols = hypertension:heart_disease,
names_to = "condition",
values_to = "condition_status")
#create dataframe without id
df_class <- select(df, -id)
#remove chr values to factor
names <- c(1,3,4,5,6,7, 10,11) #create list of columns to turn into factor type
df_class[,names] <- lapply(df_class[,names] , factor)
glimpse(df_class)
## Rows: 4,981
## Columns: 11
## $ gender <fct> Male, Male, Female, Female, Male, Male, Female, Fema~
## $ age <dbl> 67, 80, 49, 79, 81, 74, 69, 78, 81, 61, 54, 79, 50, ~
## $ hypertension <fct> Negative, Negative, Negative, Positive, Negative, Po~
## $ heart_disease <fct> Positive, Positive, Negative, Negative, Negative, Po~
## $ ever_married <fct> Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes~
## $ work_type <fct> Private, Private, Private, Self-employed, Private, P~
## $ residence_type <fct> Urban, Rural, Urban, Rural, Urban, Rural, Urban, Urb~
## $ avg_glucose_level <dbl> 228.69, 105.92, 171.23, 174.12, 186.21, 70.09, 94.39~
## $ bmi <dbl> 36.6, 32.5, 34.4, 24.0, 29.0, 27.4, 22.8, 24.2, 29.7~
## $ smoking_status <fct> formerly smoked, never smoked, smokes, never smoked,~
## $ stroke <fct> Positive, Positive, Positive, Positive, Positive, Po~
#partition and split data
spl = sample.split(df_class$stroke, SplitRatio = 0.80)
train = subset(df_class, spl==TRUE)
test = subset(df_class, spl==FALSE)
print(dim(train)); print(dim(test))
## [1] 3984 11
## [1] 997 11
#build, predict and evaluate the model
model_glm = glm(stroke ~ . , family="binomial", data = train)
summary(model_glm)
##
## Call:
## glm(formula = stroke ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2781 -0.3234 -0.1651 -0.0850 3.7052
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.448670 1.078206 -6.908 4.9e-12 ***
## genderMale 0.039554 0.160108 0.247 0.80487
## age 0.075568 0.006518 11.594 < 2e-16 ***
## hypertensionPositive 0.471244 0.183879 2.563 0.01038 *
## heart_diseasePositive 0.316881 0.218163 1.452 0.14636
## ever_marriedYes -0.382335 0.241269 -1.585 0.11304
## work_typeGovt_job -0.425786 1.115293 -0.382 0.70263
## work_typePrivate -0.255638 1.099306 -0.233 0.81612
## work_typeSelf-employed -0.721647 1.119635 -0.645 0.51923
## residence_typeUrban 0.065104 0.155597 0.418 0.67564
## avg_glucose_level 0.003993 0.001363 2.930 0.00339 **
## bmi 0.008442 0.014129 0.597 0.55019
## smoking_statusnever smoked -0.144990 0.198720 -0.730 0.46562
## smoking_statussmokes 0.170668 0.243100 0.702 0.48265
## smoking_statusUnknown 0.005855 0.238041 0.025 0.98038
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1574.7 on 3983 degrees of freedom
## Residual deviance: 1243.9 on 3969 degrees of freedom
## AIC: 1273.9
##
## Number of Fisher Scoring iterations: 8
#Check Baseline Accuracy
prop.table(table(train$stroke))
##
## Negative Positive
## 0.9503012 0.0496988
# Predictions on the training set
predictTrain = predict(model_glm, data = train, type = "response")
# Confusion matrix on training data
table(train$stroke, predictTrain >= 0.5)
##
## FALSE TRUE
## Negative 3785 1
## Positive 198 0
(3786+1)/nrow(train)
## [1] 0.9505522
#Predictions on the test set
predictTest = predict(model_glm, newdata = test, type = "response")
# Confusion matrix on test set
table(test$stroke, predictTest >= 0.5)
##
## FALSE TRUE
## Negative 947 0
## Positive 49 1
957/nrow(test)
## [1] 0.9598796
*It was found that the features found in the dataset can be used to predict if an individual is at risk of a stroke. This classification model is 95% accurate.
#load in data into a dataframe
urlfile <- 'https://raw.githubusercontent.com/jlixander/DATA607/main/Project2/Tidy3/nba_player_stats.csv'
df <- read_csv(url(urlfile))
## Rows: 605 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): PLAYER, TEAM
## dbl (11): GP, W, L, MIN, DIST_FEET, DIST_MILES, DIST_MILES_OFF, DIST_MILES_D...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Drop Columns
df2 <- df |>
select(-c(DIST_FEET,GP,DIST_MILES))
#turn df into long format
df_longer <- df2 |>
pivot_longer(cols = W:L,
names_to = "GAME_OUTCOME",
values_to = "GAME_CNT") |>
pivot_longer(cols = AVG_SPEED:AVG_SPEED_DEF,
names_to = "PLAYER_SPEED",
values_to = "SPEED")
ggplot(df, aes(x=DIST_MILES, y=AVG_SPEED)) +
geom_point() +
geom_smooth(method=lm, se=FALSE, linetype="dashed",
color="darkred")
## `geom_smooth()` using formula 'y ~ x'
ggplot(df, aes(x=DIST_MILES, y=W)) +
geom_point() +
geom_smooth(method=lm, se=FALSE, linetype="dashed",
color="darkred")
## `geom_smooth()` using formula 'y ~ x'
ggplot(df, aes(x=DIST_MILES_DEF, y=L)) +
geom_point() +
geom_smooth(method=lm, se=FALSE, linetype="dashed",
color="darkred")
## `geom_smooth()` using formula 'y ~ x'
df |>
ggplot(aes(x=DIST_MILES, y = W, colour= factor(TEAM))) +
stat_summary(fun = "mean", geom = "point") +
geom_smooth(method=lm, se=FALSE, linetype="dashed",
color="darkred")
## `geom_smooth()` using formula 'y ~ x'
The original question was to see the correlation between the distance
traveled and the average speed of a player. A slight negative
correlation was found. Instead, the correlation between number of games
won and distance traveled was compared. A strong correlation was found.
However, this is not enough confirmation and a correlation was also
plotted for games lost against distance traveled. This also showed a
positive correlation. To dive deeper, the strength of each regression
line was measured. Unexpectedly, it was found that long distance
traveled may have a higher probability of losing a game. i.e.: players
who traveled 1mile on average have about 18 wins, vs players who
traveled 1 mile on average have about 30 losses.