Fundamentals of Data Processing

In this notebook we will go over some fundamentals of data processing. These are tasks you’ll have to do over and over throughout your data analysis career, so it’ll pay big dividends in the future to master the skills showcased here.

Topic 1: Sorting Data

Motivation: Very often we need to sort data by a particular column. This can be easily done in R using the order function!

# Task 1: Suppose we'd like to sort the visits from least to most expensive (or vice versa!)
# We can do this using the order function
visits = visits[order(visits$PS_Payments),]
visits$PS_Payments[1:10] # We see two payments in the dataset are negative!

##  [1] -108.08  -50.34    0.00    0.00    0.00    0.00    0.00    0.00    0.00
## [10]    0.00

# To sort them the other way, use the decreasing param in order()
visits = visits[order(visits$PS_Payments, decreasing = T),]
visits$PS_Payments[1:10] # We see the most expensive visits are 4655, 4366 etc.

##  [1] 4655.00 4366.71 4214.00 3518.12 3238.00 3163.30 3021.51 2990.00 2982.00
## [10] 2789.41

Topic 2: Sampling Data

Motivation: Like sorting, many times we just want a small randomly chosen sample of the data. For instances, sometimes we want to get a sense of a large dataset so we generate a representive sample.

# This dataset has a lot of visits! 125000+
# To make things more managable, lets sample ten percent of the visits, uniformly at random. This sample should still be representative but will be much smaller.
# SYNTAX: sample(num_rows, desired_num_of_rows)

visits_sampled = visits[sample(nrow(visits), round(nrow(visits))/10),] # Note size is now 12595 observations, randomly chosen.

# Most aggregate stats remain the same for the two samples e.g.
print(c(mean(visits$Patient_Age), mean(visits_sampled$Patient_Age)))

## [1] 5.577527 5.593569

Topic 3: Summarizing data with head(), tail(), summary(), and boxplots!

Motivation: Once we have our data we often want to get a sense of whats in it by looking at some representive observations, seeing summary statistics about the features, and quickly visualizing the numeric features distributions

# The head() and tail() functions will give you the first six and last six observations respectively. Good way of making sure nothing is wrong with the tail end of your dataset too!
head(visits)

tail(visits)

# Suppose we want to remove anonomolous observations in terms of payment.
boxplot(visits$PS_Payments) # seems like a few big ones are anomolies

Topic 4: Pruning outliers with IQR or 3 Standard Deviations Rule

Motivation: After examining our data we may find spurious outliers that can infer with our methodologies (especially things involving means!). There are a number of common rules to determine if a point is an outlier, below we highlight two.

# Suppose we want to remove anonomolous observations in terms of payment.
boxplot(visits$PS_Payments) # seems like a few big ones are anomolies

quants = quantile(visits$PS_Payments)
L = quants[2]
U = quants[4]
diff = U - L
L = L-1.5*diff
U = U + 1.5*diff
print(c(L,U))

##    25%    75% 
##   5.13 160.97

# Only take observations in the range [L, U]
visits_pruned = visits[visits$PS_Payments >= L & visits$PS_Payments <= U,] # 25 thousand obs removed

## 3 Std method is very similar, only take points within 3 sample standard deviations of the sample mean.
mean_pay = mean(visits$PS_Payments)
std_pay = sd(visits$PS_Payments)
U = mean_pay + 3*std_pay
L = mean_pay - 3*std_pay
print(c(L,U))

## [1] -263.8002  453.2708

# Only take observations in the range [\mu - 3\sigma, \mu + 3\sigma]
visits_pruned_v2 = visits[visits$PS_Payments >= L & visits$PS_Payments <= U,] # Slightly more aggressive than IQR

Topic 5: Data normalization

Motivation: Often times we want the numeric fetures of our data to all be on the same “scale”. We accomplish this by normalizing or standardizing the data so it has sample mean 0 and sample std 1. Note this sacrifices something in terms of interpretability.

# Lets normalize patient age using the scale() function.
norm_age = scale(visits$Patient_Age) # scale takes in numeric vectors and returns the standardized vectors
print(c(mean(norm_age), sd(norm_age))) # Mean 0, sd 1

## [1] 8.216374e-17 1.000000e+00

visits = data.frame(visits, norm_age) # We can add our normalized features to our dataframe using the data.frame() function.

Topic 6: Log transformation

Motivation: Often times our data does not conform to the assumptions of our tests or methods. In these cases we can sometimes save ourselves by transforming the data. On common transformation is the run the data through the log() function.

# Example: Suppose our data data appears to have multiplicative error as shown by the residuals having a funnel shape. In this case we can perform a log transform to make the data fit the regression assumptions.

model = lm(data = visits_sampled, PS_Charges ~ Patient_Age) # Constant Var is bad
model_1 = lm(data = visits_sampled, log(PS_Charges) ~ Patient_Age) # Constant Var is bad

# What if we use pruning and logs!
visits_sampled_pruned = visits_sampled[visits_sampled$PS_Charges >= 500 & visits_sampled$PS_Charges <= 5000,]

model_2 = lm(data = visits_sampled_pruned, log1p(PS_Charges) ~ Patient_Age)
# Better fit of assumptions!

Lecture 17 - Data Processing I

Topics for today!