Data Preparation

Introduction

# load data
if (!require("stringr")) install.packages('stringr')
if (!require("AER")) install.packages('AER')
if (!require("data.table")) install.packages('data.table')
if (!require("dplyr")) install.packages('dplyr')
if(!require("ggplot2")) install.packages('ggplot2')

library(dplyr)
library(data.table)
library(stringr)
library(ggplot2)
library(AER)

# load data
data("CollegeDistance")

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Though the data set contains 14 different variables,  we will only use score, the achievement test score obtained during the student's senior year of high school, urban, whether the student's high school is located in an urban area, and education, the number of years of education attained 6 years after high school graduation, in any of the models we create. Of these chosen variables, we can see that score,  and education are quantitative and urban is categorical.

The Question

Is Score related to the number of years of eduction and wether if they developed in an urban area?

Cases

What are the cases, and how many are there?

Cross-section data from the High School and Beyond survey conducted by the Department of Education in 1980, with a follow-up in 1986

A data frame containing 4,739 observations on 14 variables. 
nrow(CollegeDistance)
## [1] 4739

Data collection

Describe the method of data collection.

Data is available in the AER  CRAN package 4K plus rows of data will allow to do mutiple regression models can be assigned immediately to dataframe for tidy process.

Type of study

What type of study is this (observational/experiment)?

This is Observational Study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

Cross-section data from the High School and Beyond survey conducted by the Department of Education in 1980, with a follow-up in 1986. The survey included students from approximately 1,100 high schools. 

Source:
Online complements to Stock and Watson (2007). 

References
Rouse, C.E. (1995). Democratization or Diversion? The Effect of Community Colleges on    Educational Attainment. Journal of Business & Economic Statistics, 12, 217-224. 
Stock, J.H. and Watson, M.W. (2007). Introduction to Econometrics, 2nd ed. Boston: Addison Wesley. 

Dependent Variable

What is the response variable? Is it quantitative or qualitative?

Score(numerical)-base year composite test score. These are achievement tests given to high school seniors in the sample.

Independent Variable

You should have two independent variables, one quantitative and one qualitative.

Education(numerical)-number of years of education.
urban(categorical)-factor. Is the school in an urban area?

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

Showing the first pieces of data of the full data frame and Statistics summary
head(CollegeDistance)
##   gender ethnicity score fcollege mcollege home urban unemp wage distance
## 1   male     other 39.15      yes       no  yes   yes   6.2 8.09      0.2
## 2 female     other 48.87       no       no  yes   yes   6.2 8.09      0.2
## 3   male     other 48.74       no       no  yes   yes   6.2 8.09      0.2
## 4   male      afam 40.40       no       no  yes   yes   6.2 8.09      0.2
## 5 female     other 40.48       no       no   no   yes   5.6 8.09      0.4
## 6   male     other 54.71       no       no  yes   yes   5.6 8.09      0.4
##   tuition education income region
## 1 0.88915        12   high  other
## 2 0.88915        12    low  other
## 3 0.88915        12    low  other
## 4 0.88915        12    low  other
## 5 0.88915        13    low  other
## 6 0.88915        12    low  other
#Creating Data Frame with only needed information
cd<-CollegeDistance %>% select(3,7,10)
names(cd)
## [1] "score"    "urban"    "distance"
summary(cd)
##      score       urban         distance     
##  Min.   :28.95   no :3635   Min.   : 0.000  
##  1st Qu.:43.92   yes:1104   1st Qu.: 0.400  
##  Median :51.19              Median : 1.000  
##  Mean   :50.89              Mean   : 1.803  
##  3rd Qu.:57.77              3rd Qu.: 2.500  
##  Max.   :72.81              Max.   :20.000
Analysis of the numerical values where Scores behave in a relatively normal model 
   hist(CollegeDistance$score)

   hist(CollegeDistance$education)

   boxplot(CollegeDistance$score)

   boxplot(CollegeDistance$distance)