Looking at data in R

First thing, the schools data in SPSS format are saved on the desktop. Desktop is set as the working directory, so the code to retrieve the data can be simple. We won’t have to specify any path.

Before loading importing the data from SPSS, we need to load the library “haven” so we can use the function read_sav()

library(haven)
school <- read_sav("school.sav")
school <- as.data.frame(school)

The data are now imported. So, let’s check the first 6 rows

head(school)

##   f1sch_id bytxcomp  tchinfl  tchhappy teachmor g10ctrl1
## 1      393 56.97488 10.79412 3.7153530        5        2
## 2      831 61.25633 12.80000 1.4093887        4        2
## 3      847 57.55003 12.31818 3.1898999        4        2
## 4      877 55.71822 12.36364 0.6113877        4        2
## 5     1784 66.89900 13.50000 6.8838068        5        3
## 6     1995 56.14330 10.71429 4.0169053        5        2

We can also look at the last 6 rows by using the function tail()

tail(school)

##     f1sch_id bytxcomp   tchinfl  tchhappy teachmor g10ctrl1
## 961    99553 50.51440 12.533333  1.346583        3        1
## 962    99666 49.40917 10.826087  4.408847        4        1
## 963    99688 52.17325  8.272727 -3.863894        4        1
## 964    99739 31.85600 11.500000  3.810915        3        1
## 965    99765 60.65325 13.375000  1.404974        4        1
## 966    99848 49.78333  8.850000 -3.001611        3        1

The functions head() and tail() show 6 obserbvations, by default. To look at a specific number of rows, we can spcify that by another argument. for example, read(school, 10) to look at the fiorst ten rows.

How may rows and column do we have in the data? we can use the dim() function

dim(school) # (this means 966 rows and 6 colums. the first is always the number of row)

## [1] 966   6

Suppose we want to look at some descriptions of all the variables in the data set, we can use the summary() on the whole data

summary(school)

##     f1sch_id        bytxcomp        tchinfl          tchhappy       
##  Min.   :   81   Min.   :30.95   Min.   : 4.500   Min.   :-19.2685  
##  1st Qu.:26407   1st Qu.:47.60   1st Qu.: 9.855   1st Qu.: -2.3468  
##  Median :51530   Median :51.91   Median :11.441   Median :  0.3438  
##  Mean   :50605   Mean   :51.88   Mean   :11.468   Mean   :  0.1523  
##  3rd Qu.:74479   3rd Qu.:55.91   3rd Qu.:13.000   3rd Qu.:  2.9278  
##  Max.   :99848   Max.   :70.50   Max.   :20.000   Max.   : 13.4438  
##                  NA's   :17                       NA's   :3         
##     teachmor        g10ctrl1    
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:1.000  
##  Median :4.000   Median :1.000  
##  Mean   :3.866   Mean   :1.328  
##  3rd Qu.:4.000   3rd Qu.:1.000  
##  Max.   :5.000   Max.   :5.000  
##  NA's   :6

we may also choose a particular variable. Say we want to look at the summary statistics of the variable ‘teacher happiness’, we can use the dollar sign. Let’s see how this works.

summary(school$tchhappy)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
## -19.2685  -2.3468   0.3438   0.1523   2.9278  13.4438        3

The function hist() shall return the histogram of a specified variable. Let’s look at ‘teacher hapiness’. we can add arguments like ‘main’ for title, ‘xlab’ for labeling the x axis.

hist(school$tchhappy, xlab = "teacher Hapiness", main = "My Histogram")

Now let’s see the scatteplot of ‘teacher hapiness’ and ‘teacher inlfuence’. For this, we use the function plot().

plot(school$tchhappy, school$tchinfl, main = "Scatter Plot", xlab = "Happiness", ylab = "Influence")

To get a fancier plot, we can use ggplot. Let’s load and plot the graph with it

library(ggplot2)

Now the plot. We will devote another entire module for ggplot later. This one is just to show a preview

ggplot(school, aes(tchhappy, tchinfl)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  ggtitle("My Plot") +
  xlab("Hapiness") +
  ylab("Influence") +
  theme_bw()

## Warning: Removed 3 rows containing non-finite values (stat_smooth).

## Warning: Removed 3 rows containing missing values (geom_point).

The plot shows a positive relationship between teacher happiness and teacher influence. This suggests that the happier a teacher is, the more likely is he to have some influence on students. Buts let’s test the correlation between these two variables with the cor.test() function, to see if the association is statistically significant or not.

cor.test(school$tchhappy, school$tchinfl)

## 
##  Pearson's product-moment correlation
## 
## data:  school$tchhappy and school$tchinfl
## t = 14.715, df = 961, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3758271 0.4790161
## sample estimates:
##       cor 
## 0.4288193

the results shows a significant relationship. Let’s proceed to view the bocxplot of ‘teacher happyness’

boxplot(school$tchhappy, main = "hapiness")

Sometimes we transform variables in the process of anallyzing data. Transformations come in different forms based on what we want to achive. For example we may want to center a variable to its mean for statistical result interpretation purposes. One way to do this would be to perform the operaion directly in the specificatio of the analysis. Another way would be to create a new variable, say (teach influence - (mean of teach influence), that we attach to the data. Let’s do this. We shall call this new variable c.tchinfl (teacher influence centered on its mean)

let’s first load the package dplyr. this package is one of the best for data wranggling, and is widely used by data scientists. we will use the mutate() function to attacht the new transformed variable to our data.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

school <- school %>% mutate(c.tchinfl = tchinfl - mean(tchinfl))

Now let’s look at the first few rows to check the data

head(school, 4)

##   f1sch_id bytxcomp  tchinfl  tchhappy teachmor g10ctrl1  c.tchinfl
## 1      393 56.97488 10.79412 3.7153530        5        2 -0.6733904
## 2      831 61.25633 12.80000 1.4093887        4        2  1.3324920
## 3      847 57.55003 12.31818 3.1898999        4        2  0.8506738
## 4      877 55.71822 12.36364 0.6113877        4        2  0.8961284

There will be an entire module for the demonstration of how to use dplyr for data manipulation later.

Some useful simple functions. Let’s call them the variable ‘teacher hapiness’. We use ‘na.rm = TRUE’ to remove missing values while performing the operation

mean(school$tchhappy, na.rm = TRUE) # compute the mean of the variable

## [1] 0.1522909

var(school$tchhappy, na.rm = TRUE) # the variance. use sd()to get the standard dev.

## [1] 17.08527

range(school$tchhappy, na.rm = TRUE) # the range

## [1] -19.26853  13.44379

What about a simple linear regression? Let’s predict teacher influence with teacher hapiness with the function lm(), and take a quick look at the output. We store the analysis in an object, let’s call it ‘my.reg’

my.reg <- lm(tchinfl ~ 1 + tchhappy, data = school)

coef(my.reg) # get the regression coefficients

## (Intercept)    tchhappy 
##  11.4222178   0.2455881

Or to view the overall summary, we can use the summary() function

summary(my.reg)

## 
## Call:
## lm(formula = tchinfl ~ 1 + tchhappy, data = school)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4147 -1.3895 -0.0404  1.2904  7.7647 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.42222    0.06900  165.55   <2e-16 ***
## tchhappy     0.24559    0.01669   14.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.14 on 961 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.1839, Adjusted R-squared:  0.183 
## F-statistic: 216.5 on 1 and 961 DF,  p-value: < 2.2e-16

sometimes, we may just want to see the R squared value. For this we can use the library called ‘rsq’

library(rsq)

rsq(my.reg)

## [1] 0.183886

The package ‘rsq’ can be useful in the context of multiple regression, to obatin partail and semi partial R squared values, and to perform other tasks effficiently. This topic will be developped later more in depth.

Looking at data in R

J mess

November 1, 2018