Rows: 1,000
Columns: 13
$ fage <int> 34, 36, 37, NA, 32, 32, 37, 29, 30, 29, 30, 34, 28, 28,…
$ mage <dbl> 34, 31, 36, 16, 31, 26, 36, 24, 32, 26, 34, 27, 22, 31,…
$ mature <chr> "younger mom", "younger mom", "mature mom", "younger mo…
$ weeks <dbl> 37, 41, 37, 38, 36, 39, 36, 40, 39, 39, 42, 40, 40, 39,…
$ premie <chr> "full term", "full term", "full term", "full term", "pr…
$ visits <dbl> 14, 12, 10, NA, 12, 14, 10, 13, 15, 11, 14, 16, 20, 15,…
$ gained <dbl> 28, 41, 28, 29, 48, 45, 20, 65, 25, 22, 40, 30, 31, NA,…
$ weight <dbl> 6.96, 8.86, 7.51, 6.19, 6.75, 6.69, 6.13, 6.74, 8.94, 9…
$ lowbirthweight <chr> "not low", "not low", "not low", "not low", "not low", …
$ sex <chr> "male", "female", "female", "male", "female", "female",…
$ habit <chr> "nonsmoker", "nonsmoker", "nonsmoker", "nonsmoker", "no…
$ marital <chr> "married", "married", "married", "not married", "marrie…
$ whitemom <chr> "white", "white", "not white", "white", "white", "white…
Weekly Discussion: Gauss-Markov Assumptions and Residual Analysis
I. Gauss-Markov Assumptions
Given the linear equation \(Y_{i} = \beta_{0} + \beta_{1}X_{i} + u_{i}\) , the Gauss-Markov Assumptions are:
- The conditional distribution of the error term \(u_{i}\) given \(X_{i}\) is zero: \(E(u_{i} | X_{i}) = 0\) : The error and and X value are not correlated.
- \((X_{i}, Y_{i}),\;i = 1, ..., n\) are independent and identically distributed. This refers to the observations being drawn by simple random sampling.
- Large outliers are unlikely: we know that outliers can skew the regression results due to data errors; in addition we say that X and Y have finite kurtosis.
- Errors are homoskedastic: the variance of the distribution of the error term \(u_{i}\) is constant and does not depend on \(X_{i}\). \(var(u_{i} | X_{i} = x)\) is constant and does not depend on x.
II. Data Selection and Linear Regression
I selected the dataset from the OpenIntro package “births14” (2014) which contains information on 1000 births recorded in the U.S. and is used to examine the relation between habits and practices of expectant mothers and the birth of their children.
Given the dummy variables, for the purpose of the exercise I’ll focus on the linear regression for the numeric variables.
weight: weight of baby at birth (lbs). This will be the dependent variable
The independent variables I will use are:
mage: mother’s age
weeks: length of pregnancy
visits: number of hospital visits
gained: weight gained by mother
Therefore the full regression equation is:
\[weight = \beta_{0} + \beta_{1}mage + \beta_{2}weeks + \beta_{3}visits+ \beta_{4}gained\]
Results
===============================================
Dependent variable:
---------------------------
weight
-----------------------------------------------
mage 0.017***
(0.006)
weeks 0.264***
(0.015)
visits 0.019**
(0.009)
gained 0.010***
(0.002)
Constant -4.005***
(0.586)
-----------------------------------------------
Observations 910
R2 0.304
Adjusted R2 0.301
Residual Std. Error 1.073 (df = 905)
F Statistic 98.782*** (df = 4; 905)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
We find that the variable show significance, with number of hospital visits being the less impactful in the equation. The constant (intercept) of the equation however does not have practical significance due to the negative coefficient.
III. Linear Regression Plots
Given the linear model from above, here is a graphic of the four residual plots:
- Residuals vs Fitted: We should have the values clustered close to the zero line. Here there doesn’t seem to be an obvious pattern, although the values tend to be concentrated in the same region.
- Q-Q plot: Here we are looking for the points to be relatively close to the straight line in order to assess for normality, without any type of pattern. If there were a pattern, it would be an indicator that the distribution is not normal. It looks like the points in the graph are close to line.
- Scale-location: For the scale-location chart, we are looking to determine homoscedasticity. In my model there is no clear pattern per se, but the data points see to again be concentrated in the same region.
- Residuals vs. Leverage plot: this chart helps identify outliers that may be influencing the model due to being either substantially large or small. Here the points are not particularly from from zero or far from center (i.e. top right and bottom left), although there is slight funneling as we move along the x-axis.
Looking at the chart, overall is seems that the model adheres to the Gauss-Markov conditions, although the residuals vs. leverage plot does seem to suggest heteroscedasticity, perhaps due to outliers. I’m curious whether or not the categorical variables in the dataset would influence these models. I would hypothesize that the smoking status of the mother would indicate a low birth weight score for the baby, although just glancing at the data, low birth weight is not exclusively associated with a mother who smokes.
Final question - would anyone have advice for incorporating dummy variables into the data set? I tried a couple different packages to no avail.