First thing, the schools data in SPSS format are saved on the desktop. Desktop is set as the working directory, so the code to retrieve the data can be simple. We won’t have to specify any path.
Before loading importing the data from SPSS, we need to load the library “haven” so we can use the function read_sav()
library(haven)
school <- read_sav("school.sav")
school <- as.data.frame(school)
The data are now imported. So, let’s check the first 6 rows
head(school)
## f1sch_id bytxcomp tchinfl tchhappy teachmor g10ctrl1
## 1 393 56.97488 10.79412 3.7153530 5 2
## 2 831 61.25633 12.80000 1.4093887 4 2
## 3 847 57.55003 12.31818 3.1898999 4 2
## 4 877 55.71822 12.36364 0.6113877 4 2
## 5 1784 66.89900 13.50000 6.8838068 5 3
## 6 1995 56.14330 10.71429 4.0169053 5 2
We can also look at the last 6 rows by using the function tail()
tail(school)
## f1sch_id bytxcomp tchinfl tchhappy teachmor g10ctrl1
## 961 99553 50.51440 12.533333 1.346583 3 1
## 962 99666 49.40917 10.826087 4.408847 4 1
## 963 99688 52.17325 8.272727 -3.863894 4 1
## 964 99739 31.85600 11.500000 3.810915 3 1
## 965 99765 60.65325 13.375000 1.404974 4 1
## 966 99848 49.78333 8.850000 -3.001611 3 1
The functions head() and tail() show 6 obserbvations, by default. To look at a specific number of rows, we can spcify that by another argument. for example, read(school, 10) to look at the fiorst ten rows.
How may rows and column do we have in the data? we can use the dim() function
dim(school) # (this means 966 rows and 6 colums. the first is always the number of row)
## [1] 966 6
Suppose we want to look at some descriptions of all the variables in the data set, we can use the summary() on the whole data
summary(school)
## f1sch_id bytxcomp tchinfl tchhappy
## Min. : 81 Min. :30.95 Min. : 4.500 Min. :-19.2685
## 1st Qu.:26407 1st Qu.:47.60 1st Qu.: 9.855 1st Qu.: -2.3468
## Median :51530 Median :51.91 Median :11.441 Median : 0.3438
## Mean :50605 Mean :51.88 Mean :11.468 Mean : 0.1523
## 3rd Qu.:74479 3rd Qu.:55.91 3rd Qu.:13.000 3rd Qu.: 2.9278
## Max. :99848 Max. :70.50 Max. :20.000 Max. : 13.4438
## NA's :17 NA's :3
## teachmor g10ctrl1
## Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:1.000
## Median :4.000 Median :1.000
## Mean :3.866 Mean :1.328
## 3rd Qu.:4.000 3rd Qu.:1.000
## Max. :5.000 Max. :5.000
## NA's :6
we may also choose a particular variable. Say we want to look at the summary statistics of the variable ‘teacher happiness’, we can use the dollar sign. Let’s see how this works.
summary(school$tchhappy)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -19.2685 -2.3468 0.3438 0.1523 2.9278 13.4438 3
The function hist() shall return the histogram of a specified variable. Let’s look at ‘teacher hapiness’. we can add arguments like ‘main’ for title, ‘xlab’ for labeling the x axis.
hist(school$tchhappy, xlab = "teacher Hapiness", main = "My Histogram")
Now let’s see the scatteplot of ‘teacher hapiness’ and ‘teacher inlfuence’. For this, we use the function plot().
plot(school$tchhappy, school$tchinfl, main = "Scatter Plot", xlab = "Happiness", ylab = "Influence")
To get a fancier plot, we can use ggplot. Let’s load and plot the graph with it
library(ggplot2)
Now the plot. We will devote another entire module for ggplot later. This one is just to show a preview
ggplot(school, aes(tchhappy, tchinfl)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
ggtitle("My Plot") +
xlab("Hapiness") +
ylab("Influence") +
theme_bw()
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).
The plot shows a positive relationship between teacher happiness and teacher influence. This suggests that the happier a teacher is, the more likely is he to have some influence on students. Buts let’s test the correlation between these two variables with the cor.test() function, to see if the association is statistically significant or not.
cor.test(school$tchhappy, school$tchinfl)
##
## Pearson's product-moment correlation
##
## data: school$tchhappy and school$tchinfl
## t = 14.715, df = 961, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3758271 0.4790161
## sample estimates:
## cor
## 0.4288193
the results shows a significant relationship. Let’s proceed to view the bocxplot of ‘teacher happyness’
boxplot(school$tchhappy, main = "hapiness")
Sometimes we transform variables in the process of anallyzing data. Transformations come in different forms based on what we want to achive. For example we may want to center a variable to its mean for statistical result interpretation purposes. One way to do this would be to perform the operaion directly in the specificatio of the analysis. Another way would be to create a new variable, say (teach influence - (mean of teach influence), that we attach to the data. Let’s do this. We shall call this new variable c.tchinfl (teacher influence centered on its mean)
let’s first load the package dplyr. this package is one of the best for data wranggling, and is widely used by data scientists. we will use the mutate() function to attacht the new transformed variable to our data.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
school <- school %>% mutate(c.tchinfl = tchinfl - mean(tchinfl))
Now let’s look at the first few rows to check the data
head(school, 4)
## f1sch_id bytxcomp tchinfl tchhappy teachmor g10ctrl1 c.tchinfl
## 1 393 56.97488 10.79412 3.7153530 5 2 -0.6733904
## 2 831 61.25633 12.80000 1.4093887 4 2 1.3324920
## 3 847 57.55003 12.31818 3.1898999 4 2 0.8506738
## 4 877 55.71822 12.36364 0.6113877 4 2 0.8961284
There will be an entire module for the demonstration of how to use dplyr for data manipulation later.
Some useful simple functions. Let’s call them the variable ‘teacher hapiness’. We use ‘na.rm = TRUE’ to remove missing values while performing the operation
mean(school$tchhappy, na.rm = TRUE) # compute the mean of the variable
## [1] 0.1522909
var(school$tchhappy, na.rm = TRUE) # the variance. use sd()to get the standard dev.
## [1] 17.08527
range(school$tchhappy, na.rm = TRUE) # the range
## [1] -19.26853 13.44379
What about a simple linear regression? Let’s predict teacher influence with teacher hapiness with the function lm(), and take a quick look at the output. We store the analysis in an object, let’s call it ‘my.reg’
my.reg <- lm(tchinfl ~ 1 + tchhappy, data = school)
coef(my.reg) # get the regression coefficients
## (Intercept) tchhappy
## 11.4222178 0.2455881
Or to view the overall summary, we can use the summary() function
summary(my.reg)
##
## Call:
## lm(formula = tchinfl ~ 1 + tchhappy, data = school)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4147 -1.3895 -0.0404 1.2904 7.7647
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.42222 0.06900 165.55 <2e-16 ***
## tchhappy 0.24559 0.01669 14.71 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.14 on 961 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.1839, Adjusted R-squared: 0.183
## F-statistic: 216.5 on 1 and 961 DF, p-value: < 2.2e-16
sometimes, we may just want to see the R squared value. For this we can use the library called ‘rsq’
library(rsq)
rsq(my.reg)
## [1] 0.183886
The package ‘rsq’ can be useful in the context of multiple regression, to obatin partail and semi partial R squared values, and to perform other tasks effficiently. This topic will be developped later more in depth.