Using the “swiss” dataset, conduct a regression of two variables of interest. Interpret the assumptions and results.
Exploring swiss dataset
plot(swiss)
From the plot I would be interested to look further into Fertility and Agriculture
lets observe the distribution of Fertility
hist(swiss$Fertility)
It appears to be normally distributed
Lets observe the distribution for Agriculture
hist(swiss$Agriculture)
Let’s do a pearson’s correlation between Agriculture and Fertility
cor(swiss$Agriculture, swiss$Fertility)
## [1] 0.3530792
There is a positive and somewhat linear relationship between the two variables
Checking the relationship between the variables
cor(swiss)
## Fertility Agriculture Examination Education Catholic
## Fertility 1.0000000 0.35307918 -0.6458827 -0.66378886 0.4636847
## Agriculture 0.3530792 1.00000000 -0.6865422 -0.63952252 0.4010951
## Examination -0.6458827 -0.68654221 1.0000000 0.69841530 -0.5727418
## Education -0.6637889 -0.63952252 0.6984153 1.00000000 -0.1538589
## Catholic 0.4636847 0.40109505 -0.5727418 -0.15385892 1.0000000
## Infant.Mortality 0.4165560 -0.06085861 -0.1140216 -0.09932185 0.1754959
## Infant.Mortality
## Fertility 0.41655603
## Agriculture -0.06085861
## Examination -0.11402160
## Education -0.09932185
## Catholic 0.17549591
## Infant.Mortality 1.00000000
from the results we can see a relationship between Agriculture and Fertility so lets plot those variables
plot(swiss$Agriculture, swiss$Fertility, main="Scatterlpot" ,col='red')
we can see plenty of variation beteen the observations
Lets run a regression model to see how well agriculture predicts fertility
mylm<- lm(swiss$Fertility~swiss$Agriculture)
plot(mylm)
abline(mylm, col='red')
lets observe the summary statistics
summary(mylm)
##
## Call:
## lm(formula = swiss$Fertility ~ swiss$Agriculture)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.5374 -7.8685 -0.6362 9.0464 24.4858
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.30438 4.25126 14.185 <2e-16 ***
## swiss$Agriculture 0.19420 0.07671 2.532 0.0149 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.82 on 45 degrees of freedom
## Multiple R-squared: 0.1247, Adjusted R-squared: 0.1052
## F-statistic: 6.409 on 1 and 45 DF, p-value: 0.01492
The r squared statistic is 12.47% which means that our dependent variable,Agriculturen explains about 12.47% of the variation in Fertility around its mean.
Residual standard error of 11.82 which measures the variation of observations around the regression line is large.
there is a slightly positive slope for agriculture (0.19420) and the p-vlaue for the hypothesis test that the slope is equal to zero is 0.0149 so with a lower p-value we reject the null hypothesis
lastly I will run the anova function to compare my summary results
anova(mylm)
## Analysis of Variance Table
##
## Response: swiss$Fertility
## Df Sum Sq Mean Sq F value Pr(>F)
## swiss$Agriculture 1 894.8 894.84 6.4089 0.01492 *
## Residuals 45 6283.1 139.62
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1