Loading Required Libraries

For any analyses, we need to load libraries. They allow us conduct a variety of tasks. They are like ‘keys to the different rooms’ within a house. Without a key to master bedroom, nobody can go and take rest there, for example.

a. tidyverse: The tidyverse is a set of packages that work in harmony because they share common data representations and API design. The core tidyverse includes the packages that you’re likely to use in everyday data analyses.

b. ggplot2 : ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

c. modelr: The modelr package provides functions that help you create elegant pipelines when modeling. It is designed primarily to support teaching the basics of modeling within the tidyverse, particularly in R for Data Science.

d. forcats: The goal of the forcats package is to provide a suite of tools that solve common problems with factors, including changing the order of levels or the values.

e. splines: Polynomial regression only captures a certain amount of curvature in a nonlinear relationship. An alternative, and often superior, approach to modeling nonlinear relationships is to use splines (P. Bruce and Bruce 2017).

Splines provide a way to smoothly interpolate between fixed points, called knots. Polynomial regression is computed between knots. In other words, splines are series of polynomial segments strung together, joining at knots (P. Bruce and Bruce 2017).

The R package splines includes the function bs for creating a b-spline term in a regression model. We need to specify two parameters: the degree of the polynomial and the location of the knots to run splines smoothly.

f. car: The package contains mostly functions for applied regression, linear models, and generalized linear models, with an emphasis on regression diagnostics, particularly graphical diagnostic methods. There are also some utility functions.

g. lmtest: A collection of tests, data sets, and examples for diagnostic checking in linear regression models. Furthermore, some generic tools for inference in parametric models are provided.

h. broom: The broom package takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy tibbles. The concept of “tidy data”, as introduced by Hadley Wickham, offers a powerful framework for data manipulation and analysis.

library(tidyverse)
library(ggplot2)
library(modelr)
library(forcats)
library(splines)
library(car)
library(lmtest)
library(broom)
library(stargazer)
library("dplyr")
library("AER")

For this demonstration I created a simulated fake data set. The data set is about all the high schools in a school district. Let’s assume that this school district has 500 high school (really big school district!). It will have five variables as follow:

fnteacher = Number of female teachers per high schools in a district

fsalary = Average salary of the female teachers by school

femaleperc = Percentage of female teachers by school

totteacher = Total teachers by school

asalary = Average salary of the teachers in the school

If you want to know more about this data set please visit this link: https://rpubs.com/nirmal/758804

file.exists("C:/Users/nirma/Documents/EDX courses/MicroMaster MIT/14.310x-Data Analysis for Social Scientists/Programs/simulated_school_data.RData")
[1] TRUE
load(file = "simulated_school_data.RData")
head(schooldata)
  totteacher femaleperc asalary fsalary fnteacher
1        120         74   69301   42765        89
2        107         70   55837   44540        75
3        115         64   65690   52415        74
4        114         68   54404   53446        78
5         97         66   53841   52278        64
6         93         73   61138   53962        68

Checking the Correlation Between Variables

ggstatsplot::ggcorrmat(data = schooldata, matrix.type = "lower", type = "robust")

Looks like there’s there isn’t much correlation going on. I am going to change the percentage of the female teachers per school into a factor variable.

 schooldata$femaleperc = factor(schooldata$femaleperc)#changing the integer variable into the factor variable
str(schooldata)
'data.frame':   500 obs. of  5 variables:
 $ totteacher: int  120 107 115 114 97 93 110 98 117 116 ...
 $ femaleperc: Factor w/ 21 levels "60","61","62",..: 15 11 5 9 7 14 12 9 21 6 ...
 $ asalary   : num  69301 55837 65690 54404 53841 ...
 $ fsalary   : num  42765 44540 52415 53446 52278 ...
 $ fnteacher : num  89 75 74 78 64 68 78 67 94 75 ...

As seen, the femaleperc is now a factor variable with 21 distinct categories. I want to check how the percentage of female teachers by school reacts with the average female salary in that school.

 ggplot(schooldata)+
 geom_point(aes(femaleperc, fsalary))

I have a hard time tracing any pattern. I want to use box plot for each of the percentage of the female teachers by school and try to understand what’s going on between these two variables.

 ggplot(schooldata, aes(x = femaleperc, y = fsalary))+
 geom_boxplot(aes(group = cut_width(femaleperc, 1)))+
 geom_smooth(method = "lm", se = FALSE)

It looks much better and easy to understand. The boxplots summarizing the median salary and the range in each percentage groups. The median lines go jigjag, thus, it was not much helpful.

Attaching the *schooldata data set for ease of programming attach command in R attaches a data set so that we do not need to keep referring back to which data set we are referring. We can call a variable from the data set without specifying and R will refer to that data set.

 attach(schooldata)
 
 #Running Regression
 mod1 <- lm(fsalary ~ femaleperc)
 summary(mod1)

Call:
lm(formula = fsalary ~ femaleperc)

Residuals:
     Min       1Q   Median       3Q      Max 
-13639.5  -6024.4    -41.1   5784.5  12889.0 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   52077.2     1472.2  35.374   <2e-16 ***
femaleperc61   1532.3     2058.2   0.744    0.457    
femaleperc62   -573.2     1997.0  -0.287    0.774    
femaleperc63  -3516.8     2136.1  -1.646    0.100    
femaleperc64  -1040.7     1997.0  -0.521    0.603    
femaleperc65   1908.4     2082.0   0.917    0.360    
femaleperc66  -1368.3     2015.9  -0.679    0.498    
femaleperc67    793.4     1919.5   0.413    0.680    
femaleperc68  -1531.6     2107.9  -0.727    0.468    
femaleperc69    357.8     2058.2   0.174    0.862    
femaleperc70  -2194.2     1962.9  -1.118    0.264    
femaleperc71   1079.7     1997.0   0.541    0.589    
femaleperc72  -2147.1     2107.9  -1.019    0.309    
femaleperc73   -814.4     1894.7  -0.430    0.668    
femaleperc74  -2493.0     1997.0  -1.248    0.213    
femaleperc75   -202.8     2082.0  -0.097    0.922    
femaleperc76  -2755.0     1906.7  -1.445    0.149    
femaleperc77  -2643.0     2107.9  -1.254    0.210    
femaleperc78   2659.1     2082.0   1.277    0.202    
femaleperc79    871.7     2058.2   0.424    0.672    
femaleperc80   -987.2     1962.9  -0.503    0.615    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6746 on 479 degrees of freedom
Multiple R-squared:  0.05622,   Adjusted R-squared:  0.01682 
F-statistic: 1.427 on 20 and 479 DF,  p-value: 0.104

The average salary for a female teacher per school was $52,077.2. The Model was statistically significant better than a zero. There was no difference in average salary of female teachers based on the percentage of female teachers per school. Meaning, schools with 60% female high school teachers or 80% do have similar average salary for the female teachers.

Getting confidence interval:

The command lm() stores a lot of information about the model we can later use. For example:

  1. coef allows us to reach the values of the estimated of our coefficients, and
  2. coefci gives us their confidence intervals.
  3. Both the R-squared and the p-values can be seen from the summary output of the linear model.
coefci(mod1)
                 2.5 %     97.5 %
(Intercept)  49184.423 54969.9583
femaleperc61 -2511.926  5576.5453
femaleperc62 -4497.172  3350.7107
femaleperc63 -7714.091   680.4467
femaleperc64 -4964.612  2883.2707
femaleperc65 -2182.611  5999.3725
femaleperc66 -5329.401  2592.7700
femaleperc67 -2978.332  4565.0840
femaleperc68 -5673.454  2610.1727
femaleperc69 -3686.472  4401.9998
femaleperc70 -6051.177  1662.8703
femaleperc71 -2844.252  5003.6307
femaleperc72 -6288.904  1994.7227
femaleperc73 -4537.268  2908.4492
femaleperc74 -6416.932  1430.9507
femaleperc75 -4293.753  3888.2296
femaleperc76 -6501.604   991.5456
femaleperc77 -6784.854  1498.7727
femaleperc78 -1431.896  6750.0867
femaleperc79 -3172.563  4915.9089
femaleperc80 -4844.214  2869.8333

Here we go. We have the 95% confidence interval (two tail) for the intercept and even for the corresponding categories. For example, to calculate the salary CI for the schools that have 61% of female teachers we use the below values:

  • Lower Bound: 49184.423 - 2511.926 = $46,672.497
  • Upper Bound: 54969.9583 + 5576.5453 = $60,546.5036

Exporting required information in nicer way

 tidy(coeftest(mod1))
# A tibble: 21 x 5
   term         estimate std.error statistic   p.value
   <chr>           <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)    52077.     1472.    35.4   1.10e-135
 2 femaleperc61    1532.     2058.     0.744 4.57e-  1
 3 femaleperc62    -573.     1997.    -0.287 7.74e-  1
 4 femaleperc63   -3517.     2136.    -1.65  1.00e-  1
 5 femaleperc64   -1041.     1997.    -0.521 6.03e-  1
 6 femaleperc65    1908.     2082.     0.917 3.60e-  1
 7 femaleperc66   -1368.     2016.    -0.679 4.98e-  1
 8 femaleperc67     793.     1920.     0.413 6.80e-  1
 9 femaleperc68   -1532.     2108.    -0.727 4.68e-  1
10 femaleperc69     358.     2058.     0.174 8.62e-  1
# ... with 11 more rows

This table looks little cleaner. These examples show that R builds and runs model_matrix(combined2, fsalary ~ femaleperc)

Getting and Plotting Prediction

library(modelr)
#Adding prediction and residuals in the existing data table
schooldata1 <- schooldata %>%
 add_predictions(mod1) %>%
 add_residuals(mod1) 
head(schooldata1)
  totteacher femaleperc asalary fsalary fnteacher     pred     resid
1        120         74   69301   42765        89 49584.20 -6819.200
2        107         70   55837   44540        75 49883.04 -5343.037
3        115         64   65690   52415        74 51036.52  1378.480
4        114         68   54404   53446        78 50545.55  2900.450
5         97         66   53841   52278        64 50708.87  1569.125
6         93         73   61138   53962        68 51262.78  2699.219
#Ploting the prediction
 ggplot(schooldata1, aes(femaleperc))+
 geom_point(aes(y = fsalary))+
 geom_line(aes(y=pred), colour = "red", size = 2)

 #Plotting Residuals
 ggplot(schooldata1,aes(femaleperc, resid))+
 geom_point()+
 geom_ref_line(h = 0, colour = "red", size = 2.5)

# Testing Hypothesis linearHypothesis(mod1, c(“fncandidates=0”))#Comparing restricted vs non restricted models linearHypothesis(mod1, c(“fncandidagtes=(Intercept)”)) #Not saying Intercept=0, it is testing the intercept - the number of candidates is zero linearHypothesis(mod1,c(“(Intercept)=0”, “fncandidates=1”))# Testing complex hypothesis, two constraints tested together

#Dummy Variables mod3 <- lm(ftvoteshare~RESprior) summary(mod3)