For any analyses, we need to load libraries. They allow us conduct a variety of tasks. They are like ‘keys to the different rooms’ within a house. Without a key to master bedroom, nobody can go and take rest there, for example.
a. tidyverse: The tidyverse is a set of packages that work in harmony because they share common data representations and API design. The core tidyverse includes the packages that you’re likely to use in everyday data analyses.
b. ggplot2 : ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
c. modelr: The modelr package provides functions that help you create elegant pipelines when modeling. It is designed primarily to support teaching the basics of modeling within the tidyverse, particularly in R for Data Science.
d. forcats: The goal of the forcats package is to provide a suite of tools that solve common problems with factors, including changing the order of levels or the values.
e. splines: Polynomial regression only captures a certain amount of curvature in a nonlinear relationship. An alternative, and often superior, approach to modeling nonlinear relationships is to use splines (P. Bruce and Bruce 2017).
Splines provide a way to smoothly interpolate between fixed points, called knots. Polynomial regression is computed between knots. In other words, splines are series of polynomial segments strung together, joining at knots (P. Bruce and Bruce 2017).
The R package splines includes the function bs for creating a b-spline term in a regression model. We need to specify two parameters: the degree of the polynomial and the location of the knots to run splines smoothly.
f. car: The package contains mostly functions for applied regression, linear models, and generalized linear models, with an emphasis on regression diagnostics, particularly graphical diagnostic methods. There are also some utility functions.
g. lmtest: A collection of tests, data sets, and examples for diagnostic checking in linear regression models. Furthermore, some generic tools for inference in parametric models are provided.
h. broom: The broom package takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy tibbles. The concept of “tidy data”, as introduced by Hadley Wickham, offers a powerful framework for data manipulation and analysis.
library(tidyverse)
library(ggplot2)
library(modelr)
library(forcats)
library(splines)
library(car)
library(lmtest)
library(broom)
library(stargazer)
library("dplyr")
library("AER")
For this demonstration I created a simulated fake data set. The data set is about all the high schools in a school district. Let’s assume that this school district has 500 high school (really big school district!). It will have five variables as follow:
fnteacher = Number of female teachers per high schools in a district
fsalary = Average salary of the female teachers by school
femaleperc = Percentage of female teachers by school
totteacher = Total teachers by school
asalary = Average salary of the teachers in the school
If you want to know more about this data set please visit this link: https://rpubs.com/nirmal/758804
file.exists("C:/Users/nirma/Documents/EDX courses/MicroMaster MIT/14.310x-Data Analysis for Social Scientists/Programs/simulated_school_data.RData")
[1] TRUE
load(file = "simulated_school_data.RData")
head(schooldata)
totteacher femaleperc asalary fsalary fnteacher
1 120 74 69301 42765 89
2 107 70 55837 44540 75
3 115 64 65690 52415 74
4 114 68 54404 53446 78
5 97 66 53841 52278 64
6 93 73 61138 53962 68
ggstatsplot::ggcorrmat(data = schooldata, matrix.type = "lower", type = "robust")
Looks like there’s there isn’t much correlation going on. I am going to change the percentage of the female teachers per school into a factor variable.
schooldata$femaleperc = factor(schooldata$femaleperc)#changing the integer variable into the factor variable
str(schooldata)
'data.frame': 500 obs. of 5 variables:
$ totteacher: int 120 107 115 114 97 93 110 98 117 116 ...
$ femaleperc: Factor w/ 21 levels "60","61","62",..: 15 11 5 9 7 14 12 9 21 6 ...
$ asalary : num 69301 55837 65690 54404 53841 ...
$ fsalary : num 42765 44540 52415 53446 52278 ...
$ fnteacher : num 89 75 74 78 64 68 78 67 94 75 ...
As seen, the femaleperc is now a factor variable with 21 distinct categories. I want to check how the percentage of female teachers by school reacts with the average female salary in that school.
ggplot(schooldata)+
geom_point(aes(femaleperc, fsalary))
I have a hard time tracing any pattern. I want to use box plot for each of the percentage of the female teachers by school and try to understand what’s going on between these two variables.
ggplot(schooldata, aes(x = femaleperc, y = fsalary))+
geom_boxplot(aes(group = cut_width(femaleperc, 1)))+
geom_smooth(method = "lm", se = FALSE)
It looks much better and easy to understand. The boxplots summarizing the median salary and the range in each percentage groups. The median lines go jigjag, thus, it was not much helpful.
attach(schooldata)
#Running Regression
mod1 <- lm(fsalary ~ femaleperc)
summary(mod1)
Call:
lm(formula = fsalary ~ femaleperc)
Residuals:
Min 1Q Median 3Q Max
-13639.5 -6024.4 -41.1 5784.5 12889.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52077.2 1472.2 35.374 <2e-16 ***
femaleperc61 1532.3 2058.2 0.744 0.457
femaleperc62 -573.2 1997.0 -0.287 0.774
femaleperc63 -3516.8 2136.1 -1.646 0.100
femaleperc64 -1040.7 1997.0 -0.521 0.603
femaleperc65 1908.4 2082.0 0.917 0.360
femaleperc66 -1368.3 2015.9 -0.679 0.498
femaleperc67 793.4 1919.5 0.413 0.680
femaleperc68 -1531.6 2107.9 -0.727 0.468
femaleperc69 357.8 2058.2 0.174 0.862
femaleperc70 -2194.2 1962.9 -1.118 0.264
femaleperc71 1079.7 1997.0 0.541 0.589
femaleperc72 -2147.1 2107.9 -1.019 0.309
femaleperc73 -814.4 1894.7 -0.430 0.668
femaleperc74 -2493.0 1997.0 -1.248 0.213
femaleperc75 -202.8 2082.0 -0.097 0.922
femaleperc76 -2755.0 1906.7 -1.445 0.149
femaleperc77 -2643.0 2107.9 -1.254 0.210
femaleperc78 2659.1 2082.0 1.277 0.202
femaleperc79 871.7 2058.2 0.424 0.672
femaleperc80 -987.2 1962.9 -0.503 0.615
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6746 on 479 degrees of freedom
Multiple R-squared: 0.05622, Adjusted R-squared: 0.01682
F-statistic: 1.427 on 20 and 479 DF, p-value: 0.104
The average salary for a female teacher per school was $52,077.2. The Model was statistically significant better than a zero. There was no difference in average salary of female teachers based on the percentage of female teachers per school. Meaning, schools with 60% female high school teachers or 80% do have similar average salary for the female teachers.
The command lm() stores a lot of information about the model we can later use. For example:
coefci(mod1)
2.5 % 97.5 %
(Intercept) 49184.423 54969.9583
femaleperc61 -2511.926 5576.5453
femaleperc62 -4497.172 3350.7107
femaleperc63 -7714.091 680.4467
femaleperc64 -4964.612 2883.2707
femaleperc65 -2182.611 5999.3725
femaleperc66 -5329.401 2592.7700
femaleperc67 -2978.332 4565.0840
femaleperc68 -5673.454 2610.1727
femaleperc69 -3686.472 4401.9998
femaleperc70 -6051.177 1662.8703
femaleperc71 -2844.252 5003.6307
femaleperc72 -6288.904 1994.7227
femaleperc73 -4537.268 2908.4492
femaleperc74 -6416.932 1430.9507
femaleperc75 -4293.753 3888.2296
femaleperc76 -6501.604 991.5456
femaleperc77 -6784.854 1498.7727
femaleperc78 -1431.896 6750.0867
femaleperc79 -3172.563 4915.9089
femaleperc80 -4844.214 2869.8333
Here we go. We have the 95% confidence interval (two tail) for the intercept and even for the corresponding categories. For example, to calculate the salary CI for the schools that have 61% of female teachers we use the below values:
tidy(coeftest(mod1))
# A tibble: 21 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 52077. 1472. 35.4 1.10e-135
2 femaleperc61 1532. 2058. 0.744 4.57e- 1
3 femaleperc62 -573. 1997. -0.287 7.74e- 1
4 femaleperc63 -3517. 2136. -1.65 1.00e- 1
5 femaleperc64 -1041. 1997. -0.521 6.03e- 1
6 femaleperc65 1908. 2082. 0.917 3.60e- 1
7 femaleperc66 -1368. 2016. -0.679 4.98e- 1
8 femaleperc67 793. 1920. 0.413 6.80e- 1
9 femaleperc68 -1532. 2108. -0.727 4.68e- 1
10 femaleperc69 358. 2058. 0.174 8.62e- 1
# ... with 11 more rows
This table looks little cleaner. These examples show that R builds and runs model_matrix(combined2, fsalary ~ femaleperc)
library(modelr)
#Adding prediction and residuals in the existing data table
schooldata1 <- schooldata %>%
add_predictions(mod1) %>%
add_residuals(mod1)
head(schooldata1)
totteacher femaleperc asalary fsalary fnteacher pred resid
1 120 74 69301 42765 89 49584.20 -6819.200
2 107 70 55837 44540 75 49883.04 -5343.037
3 115 64 65690 52415 74 51036.52 1378.480
4 114 68 54404 53446 78 50545.55 2900.450
5 97 66 53841 52278 64 50708.87 1569.125
6 93 73 61138 53962 68 51262.78 2699.219
#Ploting the prediction
ggplot(schooldata1, aes(femaleperc))+
geom_point(aes(y = fsalary))+
geom_line(aes(y=pred), colour = "red", size = 2)
#Plotting Residuals
ggplot(schooldata1,aes(femaleperc, resid))+
geom_point()+
geom_ref_line(h = 0, colour = "red", size = 2.5)
# Testing Hypothesis linearHypothesis(mod1, c(“fncandidates=0”))#Comparing restricted vs non restricted models linearHypothesis(mod1, c(“fncandidagtes=(Intercept)”)) #Not saying Intercept=0, it is testing the intercept - the number of candidates is zero linearHypothesis(mod1,c(“(Intercept)=0”, “fncandidates=1”))# Testing complex hypothesis, two constraints tested together
#Dummy Variables mod3 <- lm(ftvoteshare~RESprior) summary(mod3)