Exam-2-Code.utf8

This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

An example of Field Integrity in our dataset can be seen in the various categories of data, most every “score” value, which is a simple numeric value from 1-100. We also have categorical data in the “icon” category, where each rating is not a number, but can be used to create an order. We also have a few NAs in street address, which could compromise field integrity and would have to be considered before performing analysis.

We will begin by reading the CSV file for our Chicago Train Data, identifying the proper variables for Question 5.

mydata = read.csv(file="Chicago_Public_Schools_Train.csv")
class(mydata)

## [1] "data.frame"

head(mydata)

safetyscore = mydata$Safety.Score
familyinvolve = mydata$Family.Involvement.Score

Now I will add a chunk of code that allows me to do problem 5, where we must calculate the correlation coefficient of Safety Score and Family Involvement Score.

#Question 5
cor(safetyscore, familyinvolve)

## [1] 0.7144638

For this code, we can see that there is a very solid positive correlation between our two variables. This could be indicative of a slight causal relationship, but this cannot be said for sure, only the decently strong correlation can be confirmed from this calculation.

Now I will insert the necessary code for calculating the signal to noise ratio for the variable Environment Score.

#Question 6
environscore = mydata$Environment.Score
envmean = mean(environscore)
envsd = sd(environscore)
signoiseenv = envmean/envsd
signoiseenv

## [1] 2.953136

This signal to noise ratio for Environment Score means that for every instance of a “noise” observation, there are 3 that can be considered signals. This is a good signal to noise ratio, although it could be higher. The worst possible scenario in this case may have been to have had a signal to noise that was less than 1.

Now for Question 7 I will identify the necessary variables for the calculation of outliers in the variable ISAT Exceeding Math. Then I will calulate the upper and lower limits for the variable and use the min and max function to see if our minimum and maximum exceed the values of our limits.

#Question 7
exmath = mydata$ISAT.Exceeding.Math
avgexmath = mean(exmath)
sdexmath = sd(exmath)
upperexmath = avgexmath+(3*sdexmath)
lowerexmath = avgexmath-(3*sdexmath)
upperexmath

## [1] 73.77911

lowerexmath

## [1] -30.12311

max(exmath)

## [1] 92.8

min(exmath)

## [1] 3.2

After conducting our calculations, we can see that there are outliers in the upper end of our data. So we should conduct a data frame to fish out our identification of outliers.

#Question 7 Continued
exmath_df = data.frame(exmath)
exmath_df

After manually checking the data frame, we can see that 75.1 and 92.8 were the only outliers in our data.

Now for Question 8, we will identify all necessary variables, create the two linear regression models, and then pull the necessary values for comparison.

#Question 8
#Reuse exmath from 7 safetyscore from 5
penvscore = mydata$Parent.Environment.Score
rmisconducts = mydata$Rate.of.Misconducts
teacherscore = mydata$Teachers.Score

regression1 = lm(exmath ~ safetyscore+penvscore)
regression2 = lm(exmath ~ rmisconducts+teacherscore)
summary(regression1)

## 
## Call:
## lm(formula = exmath ~ safetyscore + penvscore)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.032  -7.241  -1.196   5.276  47.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.40039   14.27842   1.289   0.2006    
## safetyscore  0.60722    0.05855  10.371   <2e-16 ***
## penvscore   -0.55923    0.28658  -1.951   0.0539 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.04 on 97 degrees of freedom
## Multiple R-squared:  0.5264, Adjusted R-squared:  0.5166 
## F-statistic:  53.9 on 2 and 97 DF,  p-value: < 2.2e-16

summary(regression2)

## 
## Call:
## lm(formula = exmath ~ rmisconducts + teacherscore)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.319  -8.709  -2.169   3.127  58.882 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  11.83466    5.33293   2.219  0.02881 * 
## rmisconducts -0.25621    0.07865  -3.258  0.00155 **
## teacherscore  0.28462    0.08754   3.251  0.00158 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.18 on 97 degrees of freedom
## Multiple R-squared:  0.2476, Adjusted R-squared:  0.2321 
## F-statistic: 15.96 on 2 and 97 DF,  p-value: 1.018e-06

From the above data, we can now create our regression equations:

Regression Equation 1: y= 18.40039+0.60722(xsafetyscore)-0.55923(xpenvscore) R-squared: 0.5264, Adjusted R-Squared: 0.5166

Regression Equation 2: y= 11.83466-0.25621(xrmisconducts)+0.28462(xteacherscore) R-Squared: 0.2476, Adjusted R-Squared: 0.2321

Of the two models. it seems that Regression Equation 1 is better at predicting values, although the two could be a lot better.

#This code will be used for question 8.c.
exmath_predict = coef(regression1)[1] + coef(regression1)[2]*(42) + coef(regression1)[3]*(55)
exmath_predict

## (Intercept) 
##    13.14597

#Answer in this case is 13.14597

Extra Credit a) We can see from the model and the rsquared values, as well as the correlation coefficients, that there is no strong causal relationship between either model and ISAT Exceeding Math. There is hardly even a correlation in model 2, and model 1 barely makes it past .5 in its rsquared, meaning there is weak positive correlation, and barely any evidence to indicate a causal relationship.

plot(exmath, safetyscore)

scatter.smooth(exmath, safetyscore)

plot(exmath)

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.