Using the “swiss” dataset, conduct a regression of two variables of interest. Interpret the assumptions and results.

Exploring swiss dataset

plot(swiss)

From the plot I would be interested to look further into Fertility and Agriculture

lets observe the distribution of Fertility

hist(swiss$Fertility)

It appears to be normally distributed

Lets observe the distribution for Agriculture

hist(swiss$Agriculture)

Let’s do a pearson’s correlation between Agriculture and Fertility

cor(swiss$Agriculture, swiss$Fertility)
## [1] 0.3530792

There is a positive and somewhat linear relationship between the two variables

Checking the relationship between the variables

cor(swiss)
##                   Fertility Agriculture Examination   Education   Catholic
## Fertility         1.0000000  0.35307918  -0.6458827 -0.66378886  0.4636847
## Agriculture       0.3530792  1.00000000  -0.6865422 -0.63952252  0.4010951
## Examination      -0.6458827 -0.68654221   1.0000000  0.69841530 -0.5727418
## Education        -0.6637889 -0.63952252   0.6984153  1.00000000 -0.1538589
## Catholic          0.4636847  0.40109505  -0.5727418 -0.15385892  1.0000000
## Infant.Mortality  0.4165560 -0.06085861  -0.1140216 -0.09932185  0.1754959
##                  Infant.Mortality
## Fertility              0.41655603
## Agriculture           -0.06085861
## Examination           -0.11402160
## Education             -0.09932185
## Catholic               0.17549591
## Infant.Mortality       1.00000000

from the results we can see a relationship between Agriculture and Fertility so lets plot those variables

plot(swiss$Agriculture, swiss$Fertility, main="Scatterlpot" ,col='red')

we can see plenty of variation beteen the observations

Lets run a regression model to see how well agriculture predicts fertility

mylm<- lm(swiss$Fertility~swiss$Agriculture)
plot(mylm)

abline(mylm, col='red')

lets observe the summary statistics

summary(mylm)
## 
## Call:
## lm(formula = swiss$Fertility ~ swiss$Agriculture)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.5374  -7.8685  -0.6362   9.0464  24.4858 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       60.30438    4.25126  14.185   <2e-16 ***
## swiss$Agriculture  0.19420    0.07671   2.532   0.0149 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.82 on 45 degrees of freedom
## Multiple R-squared:  0.1247, Adjusted R-squared:  0.1052 
## F-statistic: 6.409 on 1 and 45 DF,  p-value: 0.01492

The r squared statistic is 12.47% which means that our dependent variable,Agriculturen explains about 12.47% of the variation in Fertility around its mean.

Residual standard error of 11.82 which measures the variation of observations around the regression line is large.

there is a slightly positive slope for agriculture (0.19420) and the p-vlaue for the hypothesis test that the slope is equal to zero is 0.0149 so with a lower p-value we reject the null hypothesis

lastly I will run the anova function to compare my summary results

anova(mylm)
## Analysis of Variance Table
## 
## Response: swiss$Fertility
##                   Df Sum Sq Mean Sq F value  Pr(>F)  
## swiss$Agriculture  1  894.8  894.84  6.4089 0.01492 *
## Residuals         45 6283.1  139.62                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1