# load data
if (!require("stringr")) install.packages('stringr')
if (!require("AER")) install.packages('AER')
if (!require("data.table")) install.packages('data.table')
if (!require("dplyr")) install.packages('dplyr')
if(!require("ggplot2")) install.packages('ggplot2')
library(dplyr)
library(data.table)
library(stringr)
library(ggplot2)
library(AER)
# load data
data("CollegeDistance")
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
Though the data set contains 14 different variables, we will only use score, the achievement test score obtained during the student's senior year of high school, urban, whether the student's high school is located in an urban area, and education, the number of years of education attained 6 years after high school graduation, in any of the models we create. Of these chosen variables, we can see that score, and education are quantitative and urban is categorical.
Is Score related to the number of years of eduction and wether if they developed in an urban area?
What are the cases, and how many are there?
Cross-section data from the High School and Beyond survey conducted by the Department of Education in 1980, with a follow-up in 1986
A data frame containing 4,739 observations on 14 variables.
nrow(CollegeDistance)
## [1] 4739
Describe the method of data collection.
Data is available in the AER CRAN package 4K plus rows of data will allow to do mutiple regression models can be assigned immediately to dataframe for tidy process.
What type of study is this (observational/experiment)?
This is Observational Study.
If you collected the data, state self-collected. If not, provide a citation/link.
Cross-section data from the High School and Beyond survey conducted by the Department of Education in 1980, with a follow-up in 1986. The survey included students from approximately 1,100 high schools.
Source:
Online complements to Stock and Watson (2007).
References
Rouse, C.E. (1995). Democratization or Diversion? The Effect of Community Colleges on Educational Attainment. Journal of Business & Economic Statistics, 12, 217-224.
Stock, J.H. and Watson, M.W. (2007). Introduction to Econometrics, 2nd ed. Boston: Addison Wesley.
What is the response variable? Is it quantitative or qualitative?
Score(numerical)-base year composite test score. These are achievement tests given to high school seniors in the sample.
You should have two independent variables, one quantitative and one qualitative.
Education(numerical)-number of years of education.
urban(categorical)-factor. Is the school in an urban area?
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
Showing the first pieces of data of the full data frame and Statistics summary
head(CollegeDistance)
## gender ethnicity score fcollege mcollege home urban unemp wage distance
## 1 male other 39.15 yes no yes yes 6.2 8.09 0.2
## 2 female other 48.87 no no yes yes 6.2 8.09 0.2
## 3 male other 48.74 no no yes yes 6.2 8.09 0.2
## 4 male afam 40.40 no no yes yes 6.2 8.09 0.2
## 5 female other 40.48 no no no yes 5.6 8.09 0.4
## 6 male other 54.71 no no yes yes 5.6 8.09 0.4
## tuition education income region
## 1 0.88915 12 high other
## 2 0.88915 12 low other
## 3 0.88915 12 low other
## 4 0.88915 12 low other
## 5 0.88915 13 low other
## 6 0.88915 12 low other
#Creating Data Frame with only needed information
cd<-CollegeDistance %>% select(3,7,10)
names(cd)
## [1] "score" "urban" "distance"
summary(cd)
## score urban distance
## Min. :28.95 no :3635 Min. : 0.000
## 1st Qu.:43.92 yes:1104 1st Qu.: 0.400
## Median :51.19 Median : 1.000
## Mean :50.89 Mean : 1.803
## 3rd Qu.:57.77 3rd Qu.: 2.500
## Max. :72.81 Max. :20.000
Analysis of the numerical values where Scores behave in a relatively normal model
hist(CollegeDistance$score)
hist(CollegeDistance$education)
boxplot(CollegeDistance$score)
boxplot(CollegeDistance$distance)