Merging re-cap and descriptive stats

This recitation is adapted from Evelyne Brie’s notes

Plan for the day

Re-cap of merging
Quantiles and two-way tables
Descriptive Stats
Discuss the problem set
Excersise

Before getting started

Make sure you load the dplyr package in your R enviornment and download the lfp1.csv, lfp2.csv, lfp.og.csv, pokemon.csv from Canvas

# Data to download:
# lfp1.csv, lfp2.csv, lfp.og.csv, pokemon.csv
setwd("~/Google Drive/Penn/TA/Intro to DS/Rk_Recitation/Data")

# Load the dplyr package

# Data to download:
# lfp1.csv, lfp2.csv, lfp.og.csv, pokemon.csv

lfp1 <- read.csv("lfp1.csv")
lfp2 <- read.csv("lfp2.csv")
lfp <- read.csv("lfpog.csv")
pokedex <- read.csv("pokemon.csv")

Re-cap of mergin data

Re-name the id column in lfp2 since we need to ensure that the column using which we merge two dataframes has the same name

lfp2$id <- lfp2$id.no

Types of merges

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

no.match <- anti_join(lfp1,lfp2, by= "id")
no.match

## [1] id   lfp  k5   k618 age 
## <0 rows> (or 0-length row.names)

Why are there 0 observations in this merged data?

What anti_join does is, it looks at the first df in the argument and checks if every id has a match in the second df.

no.match <- anti_join(lfp2,lfp1, by= "id")
no.match

##    id.no  wc  hc       lwg    inc id
## 1     28  no yes 1.5532677 17.000 28
## 2     29 yes yes 1.9818200 15.100 29
## 3     30  no  no 1.7693523 15.700 30
## 4     31  no  no 0.4307832  5.119 31
## 5     32 yes yes 0.8997613 16.750 32
## 6     33 yes yes 1.7666236 13.600 33
## 7     34  no  no 1.2729658 17.100 34
## 8     35 yes yes 1.3367952 16.734 35
## 9     36  no  no 0.9016919 14.197 36
## 10    37  no yes 0.8651225 10.320 37

Now why does it have 10 obs?

The order of the data is of extreme importance in these cases.

Difference between, right_join and left_join?

lfp.merge1 <- left_join(lfp1,lfp2, by= "id")
lfp.merge2 <- right_join(lfp1,lfp2, by= "id")
dim(lfp.merge1)

## [1] 743  10

dim(lfp.merge2)

## [1] 753  10

Repeat afer me: the order in which the dataframes are being entered into the argument matters In left_join, it considers the unique id in lfp1 and finds its corresponding match in lfp2 and joins them in the merged data. In addition to that, in case, it cannot find a match for a given id in lfp2, it wouldn’t drop it, but rather, it would populate those rows with an NA.

In right_join, it starts with lfp2 ie. the df on the right or the second one mentioned in the argument and tries to find its match in lfp1 (or the first df in the argument) and merge accordingly.

Difference between full_join and inner_join?

You can use the setdiff() function to identify rows that appear in x but not in y. Let’s try it:

setdiff(lfp1$id,lfp2$id)

## integer(0)

setdiff(lfp2$id,lfp1$id)

##  [1] 28 29 30 31 32 33 34 35 36 37

Why are the two answers different?

Quantiles and two-way tables

lfp <- read.csv("lfpog.csv")

Identifying the quartiles in the dataset (ie. breaking the df into 4 groups)

qt <- quantile(lfp$age,c(0,.2,.4,.6,.8,1))
qt

##   0%  20%  40%  60%  80% 100% 
##   30   34   40   45   50   60

This can also be done as follows:

qt<- quantile(lfp$age, prob = seq(0, 1, length = 6)) 
qt

##   0%  20%  40%  60%  80% 100% 
##   30   34   40   45   50   60

#length will always be 1 more tahn the no. of groups you want
# In this case, I want 4 groups, so I specify length as 5.

Output is the same for both

How do you read the output? Eg: it says 20% of the individuals within the df, are below 34 years or 80% of the individuals in the df, are below 50 years

Now, create a new variable that cuts age into these 4 buckets defined in qt:

lfp$ageQT <- cut(lfp$age,breaks=qt,include.lowest = TRUE) #include.lowest means, it also includes the values at 0 percentile in the calculation when it counts the observations in the first group. 
table(lfp$ageQT)

## 
## [30,34] (34,40] (40,45] (45,50] (50,60] 
##     162     156     144     144     147

Now, repeat to get terciles on the income (and create a new column indicating them)

qt<- quantile(lfp$inc, prob = seq(0, 1, length = 4)) 
qt

##        0% 33.33333% 66.66667%      100% 
##    -0.029    14.540    21.660    96.000

lfp$incQT <- cut(lfp$inc,breaks=qt,include.lowest = TRUE)
table(lfp$incQT)

## 
## [-0.029,14.5]   (14.5,21.7]     (21.7,96] 
##           251           251           251

Now, create a two way table with the number of indviduals falling in each income tercile split by age quartile

table(lfp$ageQT,lfp$incQT)

##          
##           [-0.029,14.5] (14.5,21.7] (21.7,96]
##   [30,34]            63          57        42
##   (34,40]            52          49        55
##   (40,45]            45          46        53
##   (45,50]            40          50        54
##   (50,60]            51          49        47

Descriptive stats:

Functions we will be using: mean(), median(), var(), sd(), cor(), cov()

Before we get started, load the Pokemon df from last week (ignore this if you already loaded it at the start)

There are NAs in the df and so for the purpose of this analysis, we can remove NAs from the whole dataset.

pokedex <- na.omit(pokedex)
dim(pokedex)

## [1] 684  41

1. Mean

Mean is the sum total of set of values divided by the total number of observations.

Let’s find the mean height of all the Pokemons in the dataset. If we try to do it manually, it would be like follows:

#Sum of height_m of each Pokemon divided by total number of Pokemons

sum(pokedex$height_m)/length(pokedex$height_m)

## [1] 1.063596

Or we could directly use the mean() function:

mean(pokedex$height_m)

## [1] 1.063596

2. Median

It identifies the mid point of a given set of values and divides the data into parts with higher and lower values

median(pokedex$height_m)

## [1] 0.9

How do youinterpret this result?

3. Variance

Variance is the average squared deviation of the values from the mean. It is a measure of how spread out the data is. Manually, you could calculate the variance as follows:

sum((pokedex$height_m -mean(pokedex$height_m))^2)/ (length(pokedex$height_m) -1)

## [1] 0.8552029

#using n-1 as the correction factor since this is a small sample
 #or you could just use the var() function
 
 var(pokedex$height_m)

## [1] 0.8552029

4. Standard deviation

SD is the square root of the variance. The standard deviation is expressed in the same units as the mean is, whereas the variance is expressed in squared units.

 sqrt(sum((pokedex$height_m -mean(pokedex$height_m))^2)/ (length(pokedex$height_m) -1))

## [1] 0.9247718

 # Or 
 sd(pokedex$height_m)

## [1] 0.9247718

5. Covariance

Covariance is a measure which telss us the extent to which one variable changes in tandem with another. It’s value can lie between -∞ and ∞. A large covariance can mean a strong relationship between variables. If a larger value relates to a lower value in the other dataset, the covariance will be negative. It will be positive if the lower value corresponds to a lower value in the corresponding dataset.

 cov(pokedex$sp_defense, pokedex$sp_attack)

## [1] 371.7915

The result doesn’t tell you much about the magnitude of the relationship except for its direction. That is where correlation comes to play

6. Correlation

Correlation is a statistical measure that indicates how strongly two variables are related. It’s a scaled version of covariance. It’s value always lies between -1 and +1.

 cor(pokedex$sp_defense, pokedex$sp_attack)

## [1] 0.4790966

Since we see it has a strong positive relation, let’s see how the scatter plot for this looks

plot(pokedex$sp_defense, pokedex$sp_attack)

Exercise

Using the pokedex data, get the tercile of the base_happiness. Now, create a two way table of the number of pokemons that lie in each group divided by the quartile of height_m.\

Hint: Use quantile() and cut()\

Merge the lfp1 and lfp2 data and find the correlation and cov between, inc and age in merged data.\

Hint: Use cov() and cor()\

Create a scatter plot of age and inc\

Hint: Use plot()