Welcome to Course Engagement 4!


In this course engagement, we will focus on the following topics:

  1. Visualizing correlations between variables.
  2. Calculating the coefficient coefficient.
  3. Conducting a correlation test.
  4. Performing ordinary least squares (OLS) regression.
  5. Interpreting coefficients in OLS regression and making data predictions.
  6. Meeting the assumptions of OLS regression.
  7. Calculating Odds and Odds Ratios in data analysis.
  8. Performing Logistic regression.
  9. Interpreting coefficients in Logistic regression and making data predictions.
  10. Performing Ordinal regression.
  11. Interpreting coefficients in Logistic regression and making data predictions.

First, let’s load in important libraries for the Course Engagement.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggpubr)
library(ggthemes)
library(corrplot)
## corrplot 0.95 loaded
library(ordinal)
## 
## Attaching package: 'ordinal'
## 
## The following object is masked from 'package:dplyr':
## 
##     slice
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(archdata)

Practice Section:


In the following section, we will explore three scenarios. Have fun! :D

For Dem’ Archaeologists, Can You Predict a Tool’s Weight?


Archaeologists are interested in predicting the weights of stone tools excavated at Food Hood, Texas. Your goal for this practice is to obtain a regression model that can accurately predict weight. We will use the DartPoints dataset from the archdata package.

Let’s load in the dataset:

data("DartPoints")

Now, let’s examine the data:

head(DartPoints)
##   Name Catalog     TARL  Quad Length Width Thickness B.Width J.Width H.Length
## 1 Darl 41-0322 41CV0536 26/59   42.8  15.8       5.8    11.3    10.6     11.6
## 2 Darl 35-2946 41CV0235 21/63   40.5  17.4       5.8      NA    13.7     12.9
## 3 Darl 35-2921 41CV0132 20/63   37.5  16.3       6.1    12.1    11.3      8.2
## 4 Darl 36-3487 41CV0594 10/54   40.3  16.1       6.3    13.5    11.7      8.3
## 5 Darl 36-3321 41CV1023 12/58   30.6  17.1       4.0    12.6    11.2      8.9
## 6 Darl 35-2959 41CV0235 21/63   41.8  16.8       4.1    12.7    11.5     11.0
##   Weight Blade.Sh Base.Sh Should.Sh Should.Or Haft.Sh Haft.Or
## 1    3.6        S       I         S         T       S       E
## 2    4.5        S       I         S         T       S       E
## 3    3.6        S       I         S         T       S       E
## 4    4.0        S       I         S         T       S       E
## 5    2.3        S       I         S         T       S       E
## 6    3.0        S       E         I         T       I       C

We can also learn more about a dataset obtained from a package by using the ? like so:

?DartPoints

This gives us all of the variables and their definitions.

We need to understand the relationships between variables if we are to understand them.

Let’s think about which variables might be the most important in predicting weight. Perhaps thickness is important since if its thiccc, then it probably has more weight, right?

We can evaluate this relationship by generating a scatterplot between the tool’s thickness (independent variable) and weight (dependent variable). Generate a scatterplot that shows this relationship in the code chunk below. Keep in mind that the X-axis usually denotes the independent variable and the Y-axis as the dependent variable.

Let’s also calculate a correlation statistic between these two variables.

cor(DartPoints$Thickness, DartPoints$Weight)
## [1] 0.6001054

Based on this result, the correlation coefficient is positive but how would you interpret the strength of the association between these two variables using the above table?

Conduct a Correlation Test.

We have a correlation coefficient, but we should evaluate if it is statistically significant at an alpha value of 0.05. Let’s set up our hypotheses:

\(H_0\): The correlation is zero.

\(H_1\): The correlation is not zero.

tested.cor = cor.test(DartPoints$Thickness, DartPoints$Weight)

tested.cor
## 
##  Pearson's product-moment correlation
## 
## data:  DartPoints$Thickness and DartPoints$Weight
## t = 7.0774, df = 89, p-value = 3.242e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4497438 0.7173891
## sample estimates:
##       cor 
## 0.6001054

We would reject the null hypothesis for sure.

We know there’s a strong correlation with thickness, but what about the other variables.

Sometimes, making scatterplots for every single variable is quite tedious. Instead, let’s make what is called a correlogram which can compare the correlations of multiple variables all at once.

We can do this using the corrplot package with the corrplot function. The major require of the corrplot function is to generate what is known as a correlation matrix. This just means that we will generate a pairwise correlation coefficient of every numeric variable and structure the output by duplicating the same variables as rows and columns.

Maybe its best if you see the result. Variables 5 - 11 contain numeric values of interest.

cor(DartPoints[,5:11])
##              Length     Width Thickness B.Width J.Width  H.Length    Weight
## Length    1.0000000 0.7689932 0.5890989      NA      NA 0.5091807 0.8799530
## Width     0.7689932 1.0000000 0.5459291      NA      NA 0.4964490 0.8263948
## Thickness 0.5890989 0.5459291 1.0000000      NA      NA 0.4670684 0.6001054
## B.Width          NA        NA        NA       1      NA        NA        NA
## J.Width          NA        NA        NA      NA       1        NA        NA
## H.Length  0.5091807 0.4964490 0.4670684      NA      NA 1.0000000 0.4863970
## Weight    0.8799530 0.8263948 0.6001054      NA      NA 0.4863970 1.0000000

As you can see, we basically created a frequency table structure but with correlation coefficients. The same variables are placed as rows and as columns because there can only be 1 correlation coefficient between two variables. The way to read this is like a frequency table. For example, the correlation coefficient of Length and Length (the same exact variables) should be 1 because they are the same variables. But, as you can see, the correlation coefficient between Length and Width is 0.768…

However, B.Width and J.Width (columns 8 and 9) have a bunch of NA values so we should probably remove these before making a correlogram. This time, let’s save the correlation matrix as correlation_matrix.

correlation_matrix = cor(DartPoints[,c(5,6,7,10,11)],method = "pearson")

Now, we can use the corrplot function! :D

corrplot(correlation_matrix)

In a correlation matrix, the correlation coefficient of the diagonal is always 1 because its comparing the same variables so of course it is 1 (big darkblue circles).

Additionally, below and above the diagonal are exactly the same just mirrored. For this reason, correlograms in publications usually only contain the upper or lower side of the correlation matrix.

corrplot(correlation_matrix, 
         type = "upper", 
         diag = FALSE, 
         sig.level =  0.05)

That is much better. I did a couple thngs. First, I asked the corrplot function for only the upper correlation plot with the argument type . Then, I told corrplot to get rid of the diagonal part (which is only r = 1) with the argument diag = FALSE. Afterwards, I asked corrplot to run a cor.test for each variable in the background and only show a circle if that pairwise correlation coefficient was statistically significant at the alpha level of 0.05.

With this information, we should consider all of these variables in predicting the weight of a lithic (stone) point (type of tool) because each of them had above moderate correlations and were statistically significant. Because we are predicting a numerical continuous variable (weight), we need to run an Ordinary Least Squares regression.

Reviewing the assumptions of the test:

We met independence of observations already based on this dataset because each observation is only inputted once for each variable.

We confirmed that each variable we want to use as predictor variables are linearly correlated with weight through the corrplot which handled all of the correlation tests for us.

The normality of residuals and homoscedasticity are two assumptions we cannot test until after we carry out a regression. So, we should first conduct the test WITHOUT observing the results. The reason is that we do not want to be biased in what the results have to say UNTIL we confirm we met the assumptions.

Conduct the OLS regression and make sure to replace the --- with the correct answer. Remember the formula for a regression is always Y ~ X.

# conduct the linear regression to confirm the assumptions are met

OLS_reg = lm(Weight ~ Length + Width + Thickness + H.Length, data = DartPoints)

Before we jump into testing the assumptions, I have a tip for you: You can use . to have R use a bunch of predictor variables all at once without having to write them all out. The way it works is like this:

OLS_reg = lm(Weight ~ ., data = DartPoints)

Basically you are telling R, from the DartPoints dataset make Weight = the response/dependent variable and make EVERYTHING ELSE as the predictor/independent variables.

Since we have more variables than we are interested within this dataset, we need to filter the dataset within the lm function. Consider this:

str(DartPoints)
## 'data.frame':    91 obs. of  17 variables:
##  $ Name     : Factor w/ 5 levels "Darl","Ensor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Catalog  : chr  "41-0322" "35-2946" "35-2921" "36-3487" ...
##  $ TARL     : chr  "41CV0536" "41CV0235" "41CV0132" "41CV0594" ...
##  $ Quad     : chr  "26/59" "21/63" "20/63" "10/54" ...
##  $ Length   : num  42.8 40.5 37.5 40.3 30.6 41.8 40.3 48.5 47.7 33.6 ...
##  $ Width    : num  15.8 17.4 16.3 16.1 17.1 16.8 20.7 18.7 17.5 15.8 ...
##  $ Thickness: num  5.8 5.8 6.1 6.3 4 4.1 5.9 6.9 7.2 5.1 ...
##  $ B.Width  : num  11.3 NA 12.1 13.5 12.6 12.7 11.7 14.7 14.3 NA ...
##  $ J.Width  : num  10.6 13.7 11.3 11.7 11.2 11.5 11.4 13.4 11.8 12.5 ...
##  $ H.Length : num  11.6 12.9 8.2 8.3 8.9 11 7.6 9.2 8.9 11.5 ...
##  $ Weight   : num  3.6 4.5 3.6 4 2.3 3 3.9 6.2 5.1 2.8 ...
##  $ Blade.Sh : Factor w/ 4 levels "E","I","R","S": 4 4 4 4 4 4 2 1 1 1 ...
##  $ Base.Sh  : Factor w/ 4 levels "E","I","R","S": 2 2 2 2 2 1 2 2 4 2 ...
##  $ Should.Sh: Factor w/ 4 levels "E","I","S","X": 3 3 3 3 3 2 3 2 2 3 ...
##  $ Should.Or: Factor w/ 4 levels "B","H","T","X": 3 3 3 3 3 3 3 3 2 3 ...
##  $ Haft.Sh  : Factor w/ 5 levels "A","E","I","R",..: 5 5 5 5 5 3 3 5 5 5 ...
##  $ Haft.Or  : Factor w/ 5 levels "C","E","P","T",..: 2 2 2 2 2 1 2 2 2 3 ...
# This tells us which columns contain these variables
which(colnames(DartPoints) %in% 
        c("Weight","Length", "Width", "Thickness", "H.Length"))
## [1]  5  6  7 10 11

Now that we identified which column numbers associate with the variables we are interested in, let’s filter the DartPoints dataset to ONLY include those (including Weight).

OLS_reg = lm(Weight ~ ., data = DartPoints[,c(5,6,7,10,11)])

Assumption of Normality of Residuals

We should use a qqPlot for this since its easier:

qqPlot(OLS_reg$residuals)

## [1] 64 91

Uh oh, we can clearly see 5 outlying points/observations :/

This means we violated this assumption and shouldn’t carry out the test. We could remove these observations if we wanted to and examine the regression. However, I want to teach you simply how to conduct it and interpret it so let’s continue.

Assumption of Homoscedasticity:

We will need a scatterplot of the fitted values by the residuals (I showed you this plot for normality in the lecture videos but we can use it for Homoscedasticity).

plot(OLS_reg,1)

If the red line goes wooshy-woosh , it is no good. Therefore, we violated this assumption too :(


The Secret but IMPORTANT Assumption of NO Colinearity of Predictor Variables…

I didn’t mention this one in lecture but I wanted to wait until we reached this engagement to discuss this.

What does this mean? Colinearity between Predictors means that one or more predictor variables have a moderate-to-perfect linear relationship with another predictor variable. This will always bias the R-square of a regression because these variables are often explaining the same percentage of variance meaning that our regression r-square is OVERESTIMATING the predictive potential!

We should use correlograms, and if necessary, chi-square contribution plots to test this. Otherwise, we run into this problem.

In this example, I would remove Length OR Width from the regression model immediately because they are too strongly correlated. I will show you another example of this with Ordinal Regression!


F**k it though, let’s see a summary and pretend that we met the assumptions.

summary(OLS_reg)
## 
## Call:
## lm(formula = Weight ~ ., data = DartPoints[, c(5, 6, 7, 10, 11)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1880 -0.6454  0.1430  0.7912  5.8439 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9.35701    0.98182  -9.530 4.10e-15 ***
## Length       0.18843    0.02431   7.751 1.68e-11 ***
## Width        0.28760    0.05788   4.969 3.39e-06 ***
## Thickness    0.21731    0.15592   1.394    0.167    
## H.Length    -0.01678    0.05601  -0.300    0.765    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.759 on 86 degrees of freedom
## Multiple R-squared:  0.8329, Adjusted R-squared:  0.8252 
## F-statistic: 107.2 on 4 and 86 DF,  p-value: < 2.2e-16

From this, we can already tell that Thickness and H.Length, despite being significantly correlated, were just not as important in making data predictions as Width and Length. Let’s remove Thickness and H.Length and see what happens next.

OLS_reg = lm(Weight ~ ., data = DartPoints[,c(5,6,11)])
summary(OLS_reg)
## 
## Call:
## lm(formula = Weight ~ ., data = DartPoints[, c(5, 6, 11)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9859 -0.7710  0.0744  0.8204  5.1522 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.70453    0.83199 -10.462  < 2e-16 ***
## Length       0.19761    0.02277   8.679 1.88e-13 ***
## Width        0.29892    0.05624   5.316 7.96e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.759 on 88 degrees of freedom
## Multiple R-squared:  0.8292, Adjusted R-squared:  0.8253 
## F-statistic: 213.6 on 2 and 88 DF,  p-value: < 2.2e-16

How would we interpret their coefficients?

Holding all else equal, an increase of 1 mm in length results in a 0.23 unit increase in weight.

Holding all else equal, an increase of 1 mm in width results in an increase of 0.27 units in weight.

Interpret the meaning of the R-squared for this model

R-squared = 0.8253. Our model, which accounts for length and width, explains roughly 82.53% of the variation in the outcome, weight.

You are given the measurements for five new points. They have the length and width measurements specified in the following vectors called Length and Width. Use your regression model to predict new values of Weight for these 5 artifacts.

Using the predict function, let’s predict what the weights of these points should be! Replace --- with the appropriate answer.

newdata = data.frame(Length = c(31, 35, 34, 50, 54), 
                     Width = c(15, 18, 20, 26, 28))

Prediction = predict(OLS_reg, newdata)

Prediction
##         1         2         3         4         5 
##  1.905123  3.592319  3.992560  8.947802 10.336074

Now predict estimated weight values for the original data (DartPoints). Include a 95% prediction interval. Bind the predicted data to the original data, and save the resulting dataframe as points.fitted . Replace --- with the appropriate answer.

Prediction = predict(OLS_reg, interval = "prediction")
## Warning in predict.lm(OLS_reg, interval = "prediction"): predictions on current data refer to _future_ responses
points.fitted = data.frame(Original = DartPoints$Weight,
                            fit = Prediction[,"fit"],
                            lwr = Prediction[,"lwr"],
                            upr = Prediction[,"upr"])
points.fitted
##    Original       fit         lwr       upr
## 1       3.6  4.476015  0.92531040  8.026720
## 2       4.5  4.499799  0.97013150  8.029466
## 3       3.6  3.578164  0.03998064  7.116347
## 4       4.0  4.071677  0.53045259  7.612901
## 5       2.3  2.453820 -1.10279729  6.010437
## 6       3.0  4.577333  1.04155240  8.113113
## 7       3.9  5.446726  1.91956841  8.973883
## 8       6.2  6.469250  2.93811282 10.000387
## 9       5.1  5.952456  2.40902053  9.495892
## 10      2.8  2.658038 -0.88846517  6.204541
## 11      2.5  2.032309 -1.52481144  5.589430
## 12      4.8  3.998743  0.43018829  7.567298
## 13      3.2  2.877416 -0.66663712  6.421469
## 14      3.8  4.547440  1.01060675  8.084274
## 15      4.5  3.736752  0.20026212  7.273242
## 16      4.4  3.182951 -0.35738116  6.723284
## 17      2.5  2.123998 -1.42995086  5.677947
## 18      2.3  2.865776 -0.67841498  6.409967
## 19      4.2  3.037513 -0.50777376  6.582799
## 20      3.3  2.401652 -1.14855417  5.951859
## 21      3.6  4.742961  1.21153052  8.274391
## 22      7.4  6.769682  3.25278573 10.286578
## 23      5.6  5.393553  1.87301684  8.914089
## 24      4.8  4.961836  1.42810343  8.495569
## 25      7.8  7.362501  3.84011802 10.884883
## 26      9.2  8.283130  4.75457372 11.811686
## 27      6.2  6.994592  3.46354625 10.525637
## 28      4.3  4.640135  1.11429575  8.165974
## 29      4.6  5.899712  2.38180364  9.417619
## 30      5.4  5.832309  2.31059085  9.354028
## 31      5.9  7.117681  3.54863807 10.686724
## 32      5.1  5.790777  2.27230819  9.309245
## 33      4.7  5.222244  1.67134078  8.773148
## 34      7.2  8.929122  5.40770817 12.450535
## 35      2.5  4.035676  0.50466142  7.566690
## 36      3.9  5.253720  1.73111945  8.776320
## 37      4.1  4.588899  1.02477611  8.153021
## 38      7.2  9.039993  5.47309287 12.606894
## 39     10.7 10.177486  6.62699296 13.727980
## 40     12.5 12.468210  8.92146407 16.014957
## 41     13.4 12.560402  9.01420472 16.106599
## 42     11.1  9.769702  6.23551802 13.303887
## 43      7.2 11.491819  7.95341866 15.030220
## 44     28.8 27.670296 23.63596688 31.704625
## 45     13.9 12.588208  8.99038026 16.186036
## 46      9.4  9.291853  5.76845492 12.815251
## 47      5.3  6.208264  2.68986371  9.726665
## 48      7.9  6.737201  3.20368956 10.270712
## 49      7.3  8.271918  4.74764354 11.796192
## 50     12.2 10.383716  6.84562223 13.921810
## 51      9.3 14.231581 10.35170165 18.111460
## 52     11.1  8.821621  5.25918151 12.384060
## 53     14.8 15.167799 11.53998196 18.795615
## 54     10.7 10.639598  7.10844434 14.170752
## 55     11.1 10.419140  6.82578708 14.012492
## 56     12.3 10.902092  7.37130792 14.432877
## 57     13.1 14.294234 10.72252477 17.865943
## 58      6.1  7.047839  3.53272266 10.562956
## 59      9.2  7.728249  4.19598612 11.260512
## 60      9.4  9.156620  5.63256303 12.680677
## 61      6.7  6.850233  3.33541034 10.365056
## 62     15.3 12.675446  9.12330953 16.227582
## 63     15.1 12.092759  8.55258533 15.632932
## 64      4.6 13.585868 10.00512268 17.166614
## 65      4.3  3.820823  0.28296041  7.358686
## 66     11.6 11.808492  8.26834666 15.348638
## 67     10.5  8.740138  5.21648010 12.263797
## 68      6.8  7.042736  3.51972652 10.565746
## 69      9.1  8.159388  4.57693839 11.741838
## 70      9.4  8.710172  5.17084726 12.249496
## 71      9.5  8.767517  5.22936580 12.305667
## 72     10.4  8.780665  5.26046500 12.300866
## 73      7.5  6.811717  3.29675802 10.326677
## 74      8.7  8.327251  4.74713415 11.907368
## 75      6.9  7.394907  3.88043357 10.909381
## 76     15.0 10.487697  6.89647801 14.078915
## 77     11.4 11.177810  7.52593242 14.829688
## 78      6.3  4.719680  1.19459227  8.244768
## 79      7.5  5.235468  1.71401728  8.756918
## 80      5.9  5.795377  2.26194204  9.328812
## 81      5.4  5.547540  2.01183042  9.083250
## 82      9.5 10.636507  7.05009795 14.222915
## 83      5.4  5.325648  1.80392786  8.847367
## 84      7.1  6.943430  3.42116706 10.465693
## 85      9.7  9.777748  6.25083611 13.304660
## 86     12.6 11.721907  8.17058683 15.273227
## 87     10.5 10.228220  6.70275685 13.753683
## 88      5.6  8.012087  4.45544696 11.568727
## 89      4.9  5.995498  2.47253701  9.518458
## 90      5.2  7.305230  3.78995157 10.820509
## 91     16.3 11.147843  7.60664537 14.689041

Use mutate to create a new column called correct.est. Code this new column as follows: it should be coded as "Correct" if the true value of Weight falls in the 95% prediction interval, and "Not Correct" if the true value of Weight does not fall in the 95% prediction interval. Make sure to save the new dataframe.

Hint: use an ifelse() statement and two conditional statements!

points.fitted = points.fitted %>% 
  mutate(correct.est = ifelse(Original >= lwr & Original <= upr, "Correct", "Not Correct"))

Let’s create a plot that shows the relationship between width and weight. Add a smoothed linear line. Add points onto the scatter plot, and color them by your new correct.est variable in order to see which original datapoints were correctly/incorrectly estimated by the model.

points.fitted %>% 
  mutate(Width = DartPoints$Width) %>%
  ggplot(., aes(x = Width, y = Original)) + 
    geom_point(aes(col = correct.est)) +   
    geom_smooth(method = "lm") +
    ggpubr::theme_pubclean()
## `geom_smooth()` using formula = 'y ~ x'


Those Who Lived Inside The Wall

Earlier in this course, we examined the Snodgrass dataset without realizing it. Now it is time to analyze this dataset on the Snodgrass site which contains Mississippian (meta-cultural complex) House pits (imprinted changes in soil color that indicates a home once existed there) from around 1325-1420 CE along with archaeological artifacts.

Our goal is simple. We want to know if families that lived during this time had higher odds of living on the inside of the walls if they had larger homes and more access to ceramic pottery? Living on the inside of walls in several Mississippian cities likely indicated higher familial prestige and trust. It also was safer from attacks due to the towering and wide wooden pallisades that surrounded them.

To test this, we can perform a logistic regression. But first, let’s do some exploratory data analysis. Let’s make a boxplot examining the differences in Area between houses on the inside and outside of the pallisade wall.

To create our boxplot, we need data, yes? Let’s load it innnnnnn.

data("Snodgrass")
ggplot(Snodgrass, aes(Inside, Area, fill = Inside)) + 
  geom_boxplot() + 
  labs(x = "House Position to the Pallisade Wall") + 
  ggpubr::theme_pubclean()

We see a pretty clear difference. But this doesn’t tell us about odds at all. It can inform a bit, but we need to perform a regression to fully understand. Replace the --- with the appropriate answer

logit_reg = glm(Inside ~ Area, data = Snodgrass, family = "binomial")

summary(logit_reg)
## 
## Call:
## glm(formula = Inside ~ Area, family = "binomial", data = Snodgrass)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  8.663071   1.818444   4.764 1.90e-06 ***
## Area        -0.034760   0.007515  -4.626 3.74e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 123.669  on 90  degrees of freedom
## Residual deviance:  57.728  on 89  degrees of freedom
## AIC: 61.728
## 
## Number of Fisher Scoring iterations: 6

From our summary, the coefficient for area is negative. Why is it negative? Let’s exponetiate the coefficients to understand the odds.

exp(logit_reg$coefficients)
##  (Intercept)         Area 
## 5785.2744390    0.9658375

The coefficients suggest, holding all else equal, the odds of having a higher area house outside the wall is 0.96 times (less than) the odds for having a higher area of houses inside of the wall!

Given this is a statistically significant result, we can clearly see that having a smaller house area is associated with being outside of the wall.

However, this doesn’t reflect our hypothesis test. We need to reorganize the levels so that Outside is the baseline category and Inside the walls are the outcome..

# first check what are the levels
levels(Snodgrass$Inside)
## [1] "Inside"  "Outside"

AHA, its reversed. Let’s fix it!

Snodgrass = Snodgrass %>% 
  mutate(Snodgrass, Inside = factor(Inside, levels = c("Outside", "Inside")))

Now, let’s redo it…

logit_reg = glm(Inside ~ Area, data = Snodgrass, family = "binomial")

summary(logit_reg)
## 
## Call:
## glm(formula = Inside ~ Area, family = "binomial", data = Snodgrass)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.663071   1.818444  -4.764 1.90e-06 ***
## Area         0.034760   0.007515   4.626 3.74e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 123.669  on 90  degrees of freedom
## Residual deviance:  57.728  on 89  degrees of freedom
## AIC: 61.728
## 
## Number of Fisher Scoring iterations: 6

and let’s exponentiate the coefficients again.

exp(logit_reg$coefficients)
##  (Intercept)         Area 
## 0.0001728526 1.0353708193

Much better!

Making an Odds Plot!

Would you like to know how I made that odds plot in the lecture videos and apply it to this data? Its comically easy!

# Start with your logistic regression output and make a prediction

log_odds = predict(logit_reg, Snodgrass, type = "link")

# now we have the log odds for each observation, but we need to exponentiate!

odds <- exp(log_odds)

# add this variable to the Snodgrass dataset!

Snodgrass %>% mutate(Odds = odds) %>%
ggplot(., aes(x = Area, y = Odds)) +
  geom_point(alpha = 0.5, size = 2) +
  geom_smooth(method = "loess", se = FALSE, color = "blue", formula = y ~ x) +
  labs(title = "Logistic Regression Plot",
       x = "House Area",
       y = "Odds of Living Inside the Wall") +
  ggpubr::theme_pubclean()

As we can see, the prediction of inside/outside of the wall only matters after about 350 square meters. This is an important finding.


Rihanna Ordained “Shine Bright Like a Diamond”!


You heard her and so we shall by looking at the shiniest of the shiny:

Diamonds! :D

First, let’s load the diamonds.csv file that contains interesting information on shinies!

diamonds = read.csv("diamonds.csv")

This is a subset of the diamonds data that comes pre-loaded with the tidyverse. It contains information about the carat, cut, color, clarity, and sale price of roughly 54,000 diamonds.

Maybe I am internally basic, but I shine brighter when the quality of a diamond’s cut is brilliant! If I’m gonna shine bright, its gotta be superb! So let’s try and predict a diamond’s cut quality based on the variables we got.

That’s right, we will use ordinal regression!

Start by loading the data, assigning it to the name diamonds, and taking a look at the first few rows of the dataset

diamonds <- read.csv("diamonds.csv")
head(diamonds)
##   X ...1 carat       cut color clarity depth table price    x    y    z
## 1 1    1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2 2    2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3 3    3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4 4    4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5 5    5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6 6    6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

We don’t need to guess. Let’s make a correlogram with diamond cut quality and its price, carat size, and depth.

You might be saying….but that’s a categorical ordinal variable. Yeah, and? We can make it numeric by transforming it as a numeric factor since its ranked. We just need to be certain that the levels for diamond cut rating is from lowest to highest.

Normally, this wouldn’t work and its advised to take extreme caution with the results of this when we do something like this, but it gives us a general idea.

levels(diamonds$cut)
## NULL
# the result is null, so we need to transform it as a factor....

levels(as.factor(diamonds$cut))
## [1] "Fair"      "Good"      "Ideal"     "Premium"   "Very Good"

Ehh, what are these rating levels!? Let’s change this up so Ideal is the lowest of the categories!

diamonds = diamonds %>% 
  mutate(cut = factor(cut, levels = c("Ideal", "Fair",
                      "Good", "Very Good", "Premium")))

Okay, let’s construct the plot..

diamonds %>% 
  mutate(cut = as.numeric(cut)) %>% 
  select(cut, price, carat, depth) %>%
  cor(.) %>%
  corrplot(., type = "upper", diag = FALSE)


The IMPORTANT Assumption of Regression is No Colinearity of Variables!

Remember that this means that no predictor variable should have a strong linear relationship with another predictor variable. Clearly price and caret have a very strong predictive relationship. Therefore, we need to remove one of them before doing our model. Since carat seems to have a stronger correlation with cut, we should keep caret and remove price instead.


For fun-sies, let’s add color into the regression tooooo.

ord_reg = clm(cut ~ carat + color + depth, data = diamonds)

summary(ord_reg)
## formula: cut ~ carat + color + depth
## data:    diamonds
## 
##  link  threshold nobs  logLik    AIC       niter max.grad cond.H 
##  logit flexible  53900 -72994.31 146012.63 6(2)  9.09e-10 2.3e+07
## 
## Coefficients:
##         Estimate Std. Error z value Pr(>|z|)    
## carat   0.628570   0.017842  35.229   <2e-16 ***
## colorE  0.060098   0.028821   2.085   0.0370 *  
## colorF  0.007115   0.029027   0.245   0.8064    
## colorG -0.048895   0.028357  -1.724   0.0847 .  
## colorH  0.068076   0.030274   2.249   0.0245 *  
## colorI -0.078330   0.033956  -2.307   0.0211 *  
## colorJ  0.049322   0.041402   1.191   0.2335    
## depth  -0.126412   0.004949 -25.544   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##                   Estimate Std. Error z value
## Ideal|Fair         -7.7302     0.3067  -25.20
## Fair|Good          -7.6055     0.3068  -24.79
## Good|Very Good     -7.2299     0.3066  -23.58
## Very Good|Premium  -6.2099     0.3059  -20.30

Interpret the coefficients

Let’s exponentiate them first of course.

coefs = exp(ord_reg$coefficients)
as.matrix(coefs) # better than using data.frame in this case
##                           [,1]
## Ideal|Fair        0.0004393551
## Fair|Good         0.0004976838
## Good|Very Good    0.0007245718
## Very Good|Premium 0.0020094243
## carat             1.8749280424
## colorE            1.0619404305
## colorF            1.0071406692
## colorG            0.9522813299
## colorH            1.0704463714
## colorI            0.9246588329
## colorJ            1.0505587480
## depth             0.8812520338

How can we interpret an ordinal regression coefficient? Like so:

Holding all else equal, the odds of having a larger carat diamond in a higher diamond cut quality is 1.87 times the odds of having a larger caret diamond in a lower diamond cut quality or there is a 87% increase in the odds of having a higher diamond cut quality when the carat is larger!

That makes sense! What about color? Color “D” is the baseline for some reason so we would interpret this as:

Holding all else equal, the odds of having a diamond color rating of E is 1.06 times the odds of having a diamond color rating of D as the diamond cut quality increases.

Let’s Make a Prediction!

Using our final ordinal model, let’s predict what the cut quality would be if we had a diamond that had a color E rating, a depth of 62, and is 0.23 carets.

First, create a new prediction dataset with the single observation, then use the predict function to acquire the new prediction. Replace the --- with the appropriate answer.

newdata = data.frame(color = "E", depth = 62, carat = 0.23)
predict(ord_reg, newdata)
## $fit
##       Ideal       Fair       Good Very Good   Premium
## 1 0.4756814 0.03114458 0.09256341 0.2064102 0.1942004

Which category of cut quality has the highest probability based on this different diamond? diamonds rated E ————————————————————————

Very Excellent! Now, give this all a try in the graded section!

Graded Section:

library(tidyverse)
library(ggpubr)
library(ggthemes)
library(corrplot)
library(ordinal)
library(car)
library(archdata)

The following graded sections will target real anthropological datasets. We will explore three different scenarios, each on a separate question. Unlike other course engagements, this one is a little different.

For two of the three scenarios, I want you to practice reading about what variables mean from datasets. Most of the time, when we search for anthropological data, we need to lookup what the variables mean in our dataset. In the files panel, I added two texts: Bone CS data information.txt which describes the study and variables that the dataset is based on and BRFSS Variables.html which provides the variables and values for the BRFSS dataset.


Can We Use Bone Biomarkers to Predict Age?


Have you ever heard of forensic anthropologists informing law enforcement about victim stature even if the only human remains uncovered were bones? That’s because Biological Anthropologists found strong predictive relationships with the dimensions of bones as a person ages overtime (well, as long as they keep having a good diet and not super stressed out during childhood which is why I’m short).

For this Graded Section, we will create a regression model that can best predict an individuals age based on skeletal measurements using the Bone Cross_Sections dataset which derives from a study: Bone Structural Data for the Denver Longitudinal Growth Study. It is a longitudinal study meaning that this dataset contains multiple repetitions of the same individual at different times of their life. The dataset contains a total of 20 individuals. We will use this data to predict age.

Let’s load in the data and get an idea of the data table’s structure:

{r} Bones = read_xlsx("Bone Cross_Sections.xlsx") str(Bones)

As you can see, some of these variables have . , a dot, as a value in numeric variables. There is no measurement because the individuals are too young. Therefore, we should change these to NA.

Because there is a lot of variables, I will use the mutate_all function in R to automatically do this for all variables in the dataset.

{r} Bones <- Bones %>% mutate_all(~ ifelse(. == ".", NA, .))

Most of these variables are numeric, making them good variables to conduct a massive correlogram to find out which variables are most important in predicting age. Instead, let’s take a more controlled approach.

Because corrplots can be very big, let’s break up our corrplot into 2 different ones. The first one contains our age variables and the first 20 numerical variables and the second one contains the age variables and the last 20 numerical variables.

Step 1: create a new dataset called Bones_num with only numeric variables. For age, we will use the rounded ages of individuals from the Examagernd variable.

```{r} Bones_num = Bones[,4:42] %>% mutate_all(as.numeric)

cor_1 = cor(Bones_num[,4:23])

cor_2 = cor(Bones_num[,c(4,24:39)])

corrplot(cor_1)


Problematically, the ? means that there are NA values in the variables. Watch what happens if we attempt to remove them.

```{r} na.omit(Bones_num)

We get nothing left. This means our dataset is complicated and we cannot just massively generate correlations to predict age. We have a few options:

Option 1: do every comparison manually via a correlation. This would take a while with 42 variables but it is doable. it would be pointless with over 100 variables but with less than 40 should be okay for a few minutes of work.

```{r} corrs = matrix(NA, nrow = 39, ncol = 2) colnames(corrs) <- c(“Comparison”, “r”) for(variable in 2:39) { corrs[variable, 1] = paste(“Age”, “~”, colnames(Bones_num)[variable]) number = Bones_num %>% select(1, variable) %>% filter(!is.na(1) & !is.na(variable)) %>% cor(., use = “complete.obs”, method = “pearson”) corrs[variable, 2] = number[1,2] } corrs[1,1] = paste(“Age”, “~”, “Age”) corrs[1,2] <- 1.00

corrs


[**Doing something like this is a bit insane**]{.underline}, but can be done if you're an expert like me. Basically, from this output, we dont' get any information of colinearity between predictor variables so it is hard to tell. Based on these results, as long as we have non-NA values, we can use almost all of these variables for prediction as they all have strong to very strong correlations.

------------------------------------------------------------------------

**Option 2:** Think critically about which variables would return the best results.

**For a long time, longbone measurements** (such as our femur (thigh bone), tibia (shin bone), humerus (arm bone) and clavicle (collarbone) **are great bones for predicting age** based off of length, width, epiphyseal cap development, and even their bony cross-section. **Bone cross-sections are argued to get bigger with age** as individuals stress their bones more and more. However, **the actual relationship between age and cross-section thickness is not 1:1** and is externally influenced by the environment, cultural and habitual practices.

Why not use measurements from a classic bone: the femur (thigh bone)?

------------------------------------------------------------------------

**Option 3:** Find which variables have the least amount of NAs and use those.

I use a **for loop (`for(variable in 1:39)`)** here as well. Basically, I had R sum up all of the na values (NA = TRUE) using the `table` function and `is.na`. the `[2]` is the TRUE result (how many NAs) from the output of table (which gives you the frequency of FALSE and TRUE).

```{r} NAs = numeric(39)
for(variable in 1:39) {
  NAs[variable] <- table(is.na(Bones_num[,variable]))[2]
}
NAs[1] = 0

na_data = data.frame(Variable = colnames(Bones_num), NAs = NAs) %>%
  summarise(Variable = Variable, Num_NAs = sort(NAs, decreasing = FALSE))
na_data

From these results, it is very clear that only the measurements of the Femur bone have the least NAs and highest likelihood of being useable in a regression.

Let’s predict age based on measurements of the femur with fewer than 25 NA values.

Question 1 Which variables have fewer than 25 NA values and are thus usable for this prediction? Select the best options on Brightspace. Use the code chunk below to do some testing:

{r} na_data %>% filter(Num_NAs < 25)


Let’s make a corrplot with all of the eligible variables. Don’t forget to omit all NA values.

I created a code script that you can modify and replace the ---. If it is too advanced for you, try making your own corrplot.

```{r} index = na_data\(Variable[which(na_data\)Num_NAs <= 25)]

corrs = cor(na.omit(Bones_num[, index]))

corrplot(BRFSS_cors, type = “upper”, diag = FALSE, sig.level = 0.05)


`Question2` Based on the corrplot you developed, what is the overall direction of correlations of the variables with age?

You should also create a new dataset called `Bones_trim` with the final selection of variables. Although some variables are clearly colinear, we will not remove them just yet.

```{r} Bones_trim = Bones_num[, index]

Question3 How many variables does Bones_trim have? Input the exact number in Brightspace.


Backwards Variable Elimination Approach.

We are going to perform our regression however we need to remove any variable that is not statistically significant. There are several ways to do this and there is a statistical procedure to follow. We are going to perform Backwards Variable Elimination. Where we will remove each variable one at a time where we remove a variable, re-run the regression with those variables minus the one we removed, and then remove the next one that isn’t significant. The way we do this is based off of the p-value. We will remove the variable with the highest p-value (the number is greater: 0.5 > 0.05). Do this until we only have variables that have a p-value lower than 0.05.

{r} reg = lm(Examagernd ~ ., data = Bones_trim) summary(reg)

Question4 Which variables will be kept in the final regression model? Select all that apply on Brightspace.


Question5 Which of the following ranges best describes your final “Adjusted R-squared” value for your regression? Select the best option on Brightspace.


Question6 How would you interpret the coefficient for F50TA (The cortical bone area in the middle of the femur bone)? Select the best option on Brightspace.

Evaluate the prediction accuracy.

Let’s take the original ages from the variable Examagernd (rounded ages) and then calculate the 95% prediction interval using the predict function in R.

I will start the code for you, then you should replace the ---.

```{r} pred_int = predict(reg, bewdata = Bones_trim, interval = “prediction”)

pred_data = data.frame(Age = Bones_trim$Examagernd, pred_int)


Now, let's calculate whether the 95% prediction interval for ages accurately contains the true, original ages. This is just like what was done in lecture.

```{r} pred_data = pred_data %>% mutate(correct.est = ifelse(correct.est = ifelse (Age >= lwr & Age <= upr, "Correct", "Not Correct"))
table(pred_data$correct.est)

Question7 How many observations were correctly estimated in the 95% prediction interval? Write the exact number in Brightspace.

`


Predicting Titanic Survival (Pre-2023)


Have you ever wondered what are the chances of surviving the Titanic based on your ship class. One could argue that if you were in an upper class (and thus closer to the deck), you had a better shot at getting in on a life boat rather than those in the lower classes (and further below on the ship where the water would flow in). We can calculate the odds of survival given a passenger was in an upper ship class as compared to lower ship classes.

{r} data("TitanicSurvival") Titanic = as.data.frame(TitanicSurvival)

Use the following code block to examine the dataset to understand it better. You can also use ?TitanicSurvival to get a description.

Additionally, I want you to rename the variable passengerClass to Class to make it simpler and make survived as Survived.

{r} Titanic = rename'(Titanic, Survived = survived, Class = passengerClass)

Exploratory Data Analysis


Let’s first try to understand the distribution of this categorical data. Let’s make a plot that can show us how many individuals were in each ship class and compare each ship class separately whether they survived or not.

I will start out a code for you, but feel free to make your own. Be sure to replace the ---. If you do make your own, delete this code and change it to how you like.

```{r} ggplot(Titanic, aes(x = Class, y = after_stat(count), fill = Survived)) + geom_bar(position = “dodge”) + scale_fill_manual(values = c(“blue”, “orange”)) + labs(x = “Ship Class by Survival”, y = “Number of Observations”) + ggpubr::theme_pubclean()


Now we can see better who ended making it out alive and who did not. Let's make a frequency plot where our outcome variable (y-axis) is Survived and our exposure variable is Class (x-axis).

```{r} freq_titanic = table(Titanic$Class, Titanic$Survived)
freq_titanic

Is there is a statistical difference (alpha = 0.05) between ship class and survival?

Question8 What kind of hypothesis test should we run here to answer this question? Select the best option on brightspace.


Hypothesis:

\(H_0\):There is no association between ship class and survival

\(H_1\):There is an association between ship class and survival.

In the code chunk below, conduct the test and then conduct a posthoc test that will help you understand which groups contribute the most to survival.

```{r} Titanic = Titanic %>% mutate(Class = factor(Class, levels = c(“3rd”, “2nd”, “1st”)))

titanic.test = chisq.test(table(Titanic\(Class, Titanic\)Survived)) summary(titanic.test)


`Question9` Which value in the variable Class appears to be driving the greatest difference in the value "yes" for the variable Survived? Select the best answer on Brightspace.

------------------------------------------------------------------------

Now that we can see there is a difference, it is very likely that the odds of survival for this class will be much higher than the rest. We can test this using regression.

`Question10` What statistic should we conduct if we wanted to describe the odds of survival for a particular ship class? Select the best choice on Brightspace.

In the code chunk below, conduct this statistic. Save the output as `titanic.test`

```{r} titanic.test = chisq.test(freq_titanic)
summary(new.reg)

Now, view a summary of this test:

{r} summary(titanic.test)

Of course we want to see if the odds of surviving in a higher class were much higher than those who were not. Based on this output, answer this question:

Question11 Which of the following is TRUE regarding this summary. Select the best option in Brightspace.

  1. Holding all else equal, the odds of not surviving the Titanic are 0.4861 times the odds of surviving the Titanic.
  2. Holding all else equal, the odds of surviving the Titanic are 0.4861 times the odds of not surviving the Titanic.
  3. The baseline category for Class is 3rd class.
  4. The baseline category for Class is 2nd class.
  5. The baseline category for Class is 1st class.
  6. The R-square for this test is 0.48.

Now, we can clearly see based on our question and the summary output, we need to do something different. In the following code chunk:

  1. make this change. :D
  2. rerun the statistic. :P
  3. create a new summary. :]

```{r} Titanic = Titanic %>% mutate(Class = factor(Class, levels = c(“3rd”, “2nd”, “1st”)))

titanic.test = chisq.test(table(Titanic\(Class, Titanic\)Survived)) summary(titanic.test)


`Question12` Interpret the coefficient for 1st Class. Select the best option on Brightspace.

1.  Holding all else equal, the log odds of surviving the Titanic in 1st Class are 1.56 times the odds of surviving the Titanic in comparison to 3rd Class.
2.  Holding all else equal, the log odds of surviving the Titanic in 1st Class are 0.78 times the odds of surviving the Titanic in comparison to 3rd Class.
3.  Holding all else equal, the log odds of surviving the Titanic in 1st Class are 1.56 times the log odds of surviving the Titanic in comparison to 3rd Class.
4.  Holding all else equal, the log odds of surviving the Titanic in 1st Class are 1.56 times the odds of surviving the Titanic in comparison to 3rd Class.
5.  Holding all else equal, the log odds of surviving the Titanic in 3rd Class are 1.56 times the log odds of surviving the Titanic in comparison to 1st Class.
6.  Holding all else equal, the odds of surviving the Titanic in 3rd Class are 0.78 times the odds of surviving the Titanic in comparison to 1st Class.

------------------------------------------------------------------------

`Question13` To better interpret the coefficients, what should we do next?

1.  The coefficients provided in the summary function are already interpretable in terms of raw units.
2.  Nothing unless we want to make a prediction using new data.
3.  I want to click a different option to *seem* unpredictable and cool.
4.  Take the log of the coefficients.
5.  Take the squared residuals of the coefficients and divide it by the test statistic.
6.  Take the exponent of the coefficients.
7.  Toss that bad boi in a corrplot and hope for the best.

------------------------------------------------------------------------

#### Building a fuller model including variables: sex and age.

While your ship class did matter in terms of your survival, what about reported sex and age?

Let's investigate this as well. We will need to revise our statistic and include the new variables. Save the new test under a different name.

```{r} new.reg = <- glm(Survived ~ Class + Sex + Age, data = Titanic, family = "binomial")
summary(new.reg)

Now, let’s make the results more interpretable so we can talk about the odds of surviving based on age.

{r} exp(new.reg$coef)

Did you notice that when we added more context to the regression, we actually have a new coefficient for class 2 and class 1.

Question14 What is the new coefficient for 1st Class? Input the answer in Brightspace with only 2 decimal spaces. So if the coefficient is 0.08122431, write 0.08 in Brightspace.


Question15 Interpret the coefficient for “sexmale”. Select the best option on Brightspace.

  1. Holding all else equal, the log odds of males surviving the Titanic in are 0.08 times the odds of females surviving the Titanic.
  2. Holding all else equal, the log odds of females surviving the Titanic in 1st Class are 0.08 times the odds of males surviving the Titanic.
  3. Holding all else equal, the odds of males surviving the Titanic in are 0.08 times, or are reduced by 91.77% (1 - 0.08 = 0.9177), the odds of females surviving the Titanic.
  4. Holding all else equal, the odds of females surviving the Titanic in are 0.08 times, or are reduced by 91.77% (1 - 0.08 = 0.9177), the odds of males surviving the Titanic.
  5. Holding all else equal, the log odds of surviving the Titanic in 1st Class are 9.78 times the log odds of surviving the Titanic in comparison to 1st Class.
  6. Holding all else equal, the odds of surviving the Titanic in 3rd Class are 0.78 times the odds of surviving the Titanic in comparison to 1st Class.

In conclusion, with those odds, Jack was uhm cooked…or well…frozen.


The CDC Behavioral Risk Factor Surveillance Survey

From the CDC Website,

“BRFSS’s objective is to collect uniform state-specific data on health risk behaviors, chronic diseases and conditions, access to health care, and use of preventive health services related to the leading causes of death and disability in the United States. Factors assessed by the BRFSS in 2022 included health status and healthy days, exercise, screenings of different types of cancer, disability, oral health, chronic health conditions, tobacco use, alcohol consumption, and health-care access (core section). Optional Module topics for 2022 included social determinants of health and health equity, reactions to race, prediabetes and diabetes, cognitive decline, caregiver, cancer survivorship (type, treatment, pain management) and sexual orientation/gender identity (SOGI).”

Here is a list of the variables included in the dataset along with their definitions and values. BRFSS Variables (Use Search to Find Variables).

BRFSS = read_csv("BRFSS.csv")
## New names:
## Rows: 348170 Columns: 18
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (18): ...1, SEXVAR, GENHLTH, PHYSHLTH, MENTHLTH, PERSDOC3, SLEPTIM1, EDU...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

Anthropologists using this data might want to predict self-reported general health rating based on the variables collected on this survey.

However, its kind of hard to figure out which variables to use. Fortunately for us, there are many variables that are either numeric or categorical ordinal. When we have practically only categorical ordinal or numerical variables, we can calculate a correlation matrix using the Spearman RANK correlation coefficient (p). This is why I had it in lecture. Let’s attempt to make a correlation matrix with the Spearman correlations and then create a CORRPLOT :D

```{r} BRFSS_cors = cor(BRFSS, use = “complete.obs”, method = “spearman”)

corrplot(BRFSS_cors, type = “upper”, diag = FALSE, sig.level = 0.05)


`Question16` Examine the corrplot output and Focus on the GENHLTH (General Health Rating) row. Of all of the variables, which variable has the strongest correlation with GENHLTH? Write the name of the variable as it is shown in the corrplot on Brightspace, keeping it case sensitive.

------------------------------------------------------------------------

This variable is very clear but what about the others. Some are very faded. Luckily for us, we can actually trim the correlation matrix to only include variables that have a correlation that is higher than having no association. We should examine these for our further analysis.

So let's trim our correlation matrix to have a value if it is greater than 0.2 and transform the rest as 0.

```{r} BRFSS_cors[abs(BRFSS_cors) <= 0.2] <- 0

corrplot(BRFSS_cors, type = "upper", diag = FALSE, sig.level = 0.05)

Question17 Examine the corrplot output and Focus on the GENHLTH (General Health Rating) row. Of all of the variables, which variables should we consider for future analysis? Select all that apply on Brightspace.


Using these variables, we should build a regression model that can predict the variable GENHLTH using the variables that we found have some correlation with GENHLTH. You need to create a regression and then investigating which of these predictor variables are statistically significant for the model. We will use the Variable Backwards Elimination Approach as I discussed earlier in this Course Engagement.

Now I know you want to immediately create the model…However, you need to consider the variables we are dealing with. For categorical ordinal variables, we haven’t even confirmed that all of the levels for each variable are in order and make logical sense. This can bias our interpretations. We need to evaluate each of the five predictor variables AND our response variable GENHLTH.

If a variable is not ranked from lowest to highest, we need to make it that way. The reason is because this will be our hypothesis:

\(H_0\): There is no difference in the odds of individuals who have higher general health and the predictor variables.

\(H_1\): Individuals who report low number of days of poor physical and mental health, are highly educated, are employed, and have a higher income have higher odds of having better self-reported general health.

There are two other variables we should consider here: SEXVAR and PERSDOC3. These are not categorical ordinal, rather they are categorical nominal variables. Therefore, sometimes the correlation plot might say there is no relationship even though there is. We should conduct a Chi Square for each variable and see if there is a statistical difference or not.

Chi Square for SEXVAR and GENHLTH

Here is our opportunity to relevel GENHLTH first in case it needs it. Lets examine the levels for GENHLTH.

{r} levels(as.factor(BRFSS$GENHLTH))

As we can see it needs some releveling so that it goes from 1 to 5. In ordinal regression, it’ll make the first level as the highest!

```{r} BRFSS = BRFSS %>% mutate(GENHLTH = factor(GENHLTH, levels = c(“1”, “2”, “3”, “4”, “5”)))


Now let's conduct the Chi Square. I know you can do this part!

```{r} sex.test = chisq.test(BRFSS$SEXVAR, BRFSS$GENHLTH))

As we can see, we actually should include this variable. But let’s examine where the greatest differences lie with a contribution plot!

```{r} contrib = sex.test\(residuals^2 / sex.test\)statistic

corrplot(contrib, is.corr = FALSE)


This tells us the largest differences are between general health ratings of (1 amazing) and 4 (meh). Both males and females deviate from the expected count frequencies for these levels.

What about if you have access to a personal medical doctor? Because of the way the variable is coded, I think we should recode it to this:

1 (only 1 Doctor) –\> One Doctor

2 (more than one Doctor) –\> More Doctors

3 (No Doctor) –\> No Doctor

7 (Not Sure) –\> Impersonal Doctor

9 (Refused) –\> NA

And we will relevel it as a factor in this order:

No Doctor, Impersonal Doctor, One Doctor, More Doctors.

In the following code chunk, make these adjustments using the `mutate`, `ifelse`, and `factor` functions. I setup the code for you. All you gotta do is replace the `---`.


``` r
BRFSS = BRFSS %>% mutate(PERSDOC3 = 
                           ifelse(PERSDOC3 == 1, "One Doctor", 
                           ifelse(PERSDOC3 == 2, "More Doctors",
                           ifelse(PERSDOC3 == 3, "No Doctor",
                           ifelse(PERSDOC3 == 7, "Impersonal Doctor",
                           ifelse(PERSDOC3 == 9, NA, PERSDOC3)))))) %>%
                  mutate(PERSDOC3 = factor(PERSDOC3, 
                                           levels = c("No Doctor", 
                                                      "Impersonal Doctor", 
                                                      "One Doctor", 
                                                      "More Doctors")))

With this, let’s conduct the Chi Square test!

doc.test = chisq.test(BRFSS$PERSDOC3, BRFSS$GENHLTH)

contrib = doc.test$residuals^2 / doc.test$statistic

corrplot(contrib, is.corr = FALSE)

In sum, we should also include this one as well.


Before continuing onwards, we need to change up the unique values of a few variables that have some nonesensical values: INCOME3, EDUCA, and EMPLOY1.

I will kindly do this for you :) but take note of what I did.

BRFSS = BRFSS %>% mutate(INCOME3 = ifelse(INCOME3 == 77 | INCOME3 == 99, NA,
                                          INCOME3),
                         EDUCA = ifelse(EDUCA == 9, NA, EDUCA),
                         EMPLOY1 = ifelse(EMPLOY1 == 9, NA, EMPLOY1))

Question18 What regression should we conduct for this? Select the best choice on Brightspace.

Conduct the test then! Don’t forget to…select the variables of interest prior to making your regression. If it doesn’t work, you did something wrong. You can do it yourself or use this code outline, but replace the ---

```{r} reg.ord <- BRFSS %>% select(GENHLTH, SEXVAR, X_AGE_G) %>% clm(as.factor(GENHLTH) ~ SEXVAR + X_AGE_G, data = .)

summary(reg.ord) ```

Now, let’s examine our coefficients, but not as the log odds but in just odds form.

These are some exciting (interesting, but not happy…) results indeed when we consider the implications. For example, let’s examine the coefficient for Education level.

Question19 Interpret the coefficient for EDUCA (education level) in relation to having a higher general health score.

  1. Holding all else equal, individuals with a higher education level tend to have roughly a 20% decrease in the odds of having a higher self-reported general health.
  2. Holding all else equal, individuals with a higher education level tend to also have roughly a 79.98% increase in the odds of having a higher self-reported general health.
  3. Holding all else equal, individuals with a higher education level have -0.22 times the odds of having a higher self-reported general health than those with a lower education.
  4. Holding all else equal, individuals with a lower education level have -0.22 times the odds of having a higher self-reported general health than those with a higher education.
  5. EDUCA was not a variable that was selected in our regression.

Question20 Out of all the coefficients, which variable (or unique value) is associated with having the highest odds of having a higher self-reported general health rating? Select the best option on Brightspace

  1. SEXVAR
  2. Impersonal Doctor
  3. One Doctor
  4. More Doctors
  5. PHYSHLTH
  6. MENTHLTH
  7. EDUCA
  8. EMPLOY1
  9. INCOME3

Question Extra Credit Knit this document as either an html or pdf and submit it to Brightspace for 2 extra points :)