Julius Schmid
This is an application of the data visualization class. In this application, we will perform an analysis of a given data set, including visualizations, time series analysis, and regression analysis.
Data Preparation
First, we import the necessary libraries.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(psych)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(reshape2)
Let us start by reading the input csv file.
df <- read.csv("XYZ LLC 2021 P&L.CSV")
To see what the data set looks like, we print the first 6 rows, using the head() function.
head(df)
## X X01.01.2022 X02.01.2022 X03.01.2022 X04.01.2022
## 1 Ordinary Income/Expense NA NA NA NA
## 2 Income NA NA NA NA
## 3 Rental Income NA NA NA NA
## 4 Unit 1115 16662.87 0 16662.87 16662.87
## 5 Unit 1126 15500.00 15500 15500.00 15500.00
## 6 Unit 1215 16300.00 16300 16300.00 16300.00
## X05.01.2022 X06.01.2022 X07.01.2022 X08.01.2022 X09.01.2022 X10.01.2022
## 1 NA NA NA NA NA NA
## 2 NA NA NA NA NA NA
## 3 NA NA NA NA NA NA
## 4 16662.87 16662.87 16662.87 16662.87 0 16662.87
## 5 15500.00 15500.00 15500.00 15500.00 15500 15500.00
## 6 16300.00 16300.00 16300.00 16300.00 16300 16300.00
## X11.01.2022 X12.01.2022 X01.01.2023 X01.02.2023 X03.01.2023 X04.01.2023
## 1 NA NA NA NA NA NA
## 2 NA NA NA NA NA NA
## 3 NA NA NA NA NA NA
## 4 16662.87 0 16662.87 16662.87 16662.87 0
## 5 15500.00 15500 15500.00 15500.00 15500.00 15500
## 6 16300.00 16300 0.00 16300.00 16300.00 16300
We immediately remark that, for our analysis, it would be better to have the dates as rows and the units as columns. Hence, we transpose the data frame from above.
# Make changes to the original dataset (especially to the headers)
# => Facilitates the transposing process
df_pre_transpose <- read.csv("XYZ LLC 2021 P&L.CSV", header = FALSE)
colnames(df_pre_transpose) <- c(1:17)
df_pre_transpose[1,1] <- 'Date'
# Transpose the data frame
df_transposed <- data.frame(t(df_pre_transpose[-1]))
colnames(df_transposed) <- df_pre_transpose[, 1]
Again, let us apply the head() function to see what the first 6 rows look like.
head(df_transposed)
## Date Ordinary Income/Expense Income Rental Income Unit 1115 Unit 1126
## 2 01-01-2022 16662.87 15500
## 3 02-01-2022 0 15500
## 4 03-01-2022 16662.87 15500
## 5 04-01-2022 16662.87 15500
## 6 05-01-2022 16662.87 15500
## 7 06-01-2022 16662.87 15500
## Unit 1215 Unit 1223 Unit 1313 Unit 1314 Unit 1315 Total Rental Income Expense
## 2 16300 3000 14761 17232 16225 99680.87
## 3 16300 3000 14761 17232 16225 83018
## 4 16300 3000 14761 0 16225 82448.87
## 5 16300 3000 0 17232 16225 84919.87
## 6 16300 3000 14761 17232 16225 99680.87
## 7 16300 3000 14761 17232 16225 99680.87
## Bank Charges Bright Star Credit Union Bank Charges - Other Total Bank Charges
## 2 63.78 218.97 282.75
## 3 63.78 218.97 282.75
## 4 63.78 218.97 282.75
## 5 63.78 218.97 282.75
## 6 63.78 218.97 282.75
## 7 63.78 218.97 282.75
## Commission Unit 1215 Unit 1223 Unit 1315 Total Commission
## 2 1475 1475 2950
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 1500 0 1500
## 6 0 0 0 0
## 7 0 0 0 0
## Depreciation Expense Filing Fees HOA Dues Total Electricity Total Utilities
## 2 21802 138.75 307.01 307.01
## 3 21802 138.75 307.01 307.01
## 4 21802 138.75 307.01 307.01
## 5 21802 138.75 307.01 307.01
## 6 21802 138.75 307.01 307.01
## 7 21802 138.75 307.01 307.01
## Total Expense Net Income
## 2 25787.52 73893.35
## 3 22837.52 60180.48
## 4 22837.52 59611.35
## 5 24337.52 60582.35
## 6 22837.52 76843.35
## 7 22837.52 76843.35
This is exactly what we would expect the data to look like. For further use, we rename this data frame to ‘data’.
data <- df_transposed
After transposing, note that the dataframe contains entire columns with NA values. Our next objective is to cleanse the data, which is done by eliminating the NA values and looking for outliers.
Let us start by eliminating the columns that have NA values. We can do that by using the na.omit() function.
# Replace all empty values with NA
data[data == ''] <- NA
# Remove all columns with NA values
data <- data[ , colSums(is.na(data))==0]
# Print first 6 rows of the new data frame
head(data)
## Date Unit 1115 Unit 1126 Unit 1215 Unit 1223 Unit 1313 Unit 1314
## 2 01-01-2022 16662.87 15500 16300 3000 14761 17232
## 3 02-01-2022 0 15500 16300 3000 14761 17232
## 4 03-01-2022 16662.87 15500 16300 3000 14761 0
## 5 04-01-2022 16662.87 15500 16300 3000 0 17232
## 6 05-01-2022 16662.87 15500 16300 3000 14761 17232
## 7 06-01-2022 16662.87 15500 16300 3000 14761 17232
## Unit 1315 Total Rental Income Bright Star Credit Union Bank Charges - Other
## 2 16225 99680.87 63.78 218.97
## 3 16225 83018 63.78 218.97
## 4 16225 82448.87 63.78 218.97
## 5 16225 84919.87 63.78 218.97
## 6 16225 99680.87 63.78 218.97
## 7 16225 99680.87 63.78 218.97
## Total Bank Charges Unit 1215.1 Unit 1315.1 Total Commission
## 2 282.75 1475 1475 2950
## 3 282.75 0 0 0
## 4 282.75 0 0 0
## 5 282.75 0 0 1500
## 6 282.75 0 0 0
## 7 282.75 0 0 0
## Depreciation Expense Filing Fees Total Electricity Total Utilities
## 2 21802 138.75 307.01 307.01
## 3 21802 138.75 307.01 307.01
## 4 21802 138.75 307.01 307.01
## 5 21802 138.75 307.01 307.01
## 6 21802 138.75 307.01 307.01
## 7 21802 138.75 307.01 307.01
## Total Expense Net Income
## 2 25787.52 73893.35
## 3 22837.52 60180.48
## 4 22837.52 59611.35
## 5 24337.52 60582.35
## 6 22837.52 76843.35
## 7 22837.52 76843.35
Next, we remark that most of the columns are interpreted as a character.
sapply(data, class)
## Date Unit 1115 Unit 1126
## "character" "character" "character"
## Unit 1215 Unit 1223 Unit 1313
## "character" "character" "character"
## Unit 1314 Unit 1315 Total Rental Income
## "character" "character" "character"
## Bright Star Credit Union Bank Charges - Other Total Bank Charges
## "character" "character" "character"
## Unit 1215.1 Unit 1315.1 Total Commission
## "character" "character" "character"
## Depreciation Expense Filing Fees Total Electricity
## "character" "character" "character"
## Total Utilities Total Expense Net Income
## "character" "character" "character"
However, to analyze the data, we need to have numeric values in numeric columns. Except for ‘Date’, we transform all character columns into numerical ones.
data[c(2:21)] <- as.data.frame(apply(data[c(2:21)], 2, as.numeric))
sapply(data, class)
## Date Unit 1115 Unit 1126
## "character" "numeric" "numeric"
## Unit 1215 Unit 1223 Unit 1313
## "numeric" "numeric" "numeric"
## Unit 1314 Unit 1315 Total Rental Income
## "numeric" "numeric" "numeric"
## Bright Star Credit Union Bank Charges - Other Total Bank Charges
## "numeric" "numeric" "numeric"
## Unit 1215.1 Unit 1315.1 Total Commission
## "numeric" "numeric" "numeric"
## Depreciation Expense Filing Fees Total Electricity
## "numeric" "numeric" "numeric"
## Total Utilities Total Expense Net Income
## "numeric" "numeric" "numeric"
Now, we can execute calculations within the data.
At this point, we realize that there exist two columns representing Unit 1215, and two columns representing Unit 1315, with differing values between the two rows (no duplicates). We can assume that they belong together because there are two rows named Unit 1215 and two rows named Unit 1315 in in the original (not transposed) data frame. The respective second rows could refer to a later adjustment for the same date. This would make sense since most of the entries for the second Unit 1215 and the second Unit 1315 are 0, respectively.
Since we want to have only one column per distinct column, we aggregate the two Unit 1215 columns together and do the same for Unit 1315. After that, we drop the second columns for both Units to make them distinct.
# Aggregate the two columns for Unit 1215 and Unit 1315
data['Unit 1215'] = data['Unit 1215'] + data['Unit 1215.1']
data['Unit 1315'] = data['Unit 1315'] + data['Unit 1315.1']
# Drop Unit 1215.1 and Unit 1315.1
data <- select(data, -c('Unit 1215.1', 'Unit 1315.1'))
# Print the first rows of the data
head(data)
## Date Unit 1115 Unit 1126 Unit 1215 Unit 1223 Unit 1313 Unit 1314
## 2 01-01-2022 16662.87 15500 17775 3000 14761 17232
## 3 02-01-2022 0.00 15500 16300 3000 14761 17232
## 4 03-01-2022 16662.87 15500 16300 3000 14761 0
## 5 04-01-2022 16662.87 15500 16300 3000 0 17232
## 6 05-01-2022 16662.87 15500 16300 3000 14761 17232
## 7 06-01-2022 16662.87 15500 16300 3000 14761 17232
## Unit 1315 Total Rental Income Bright Star Credit Union Bank Charges - Other
## 2 17700 99680.87 63.78 218.97
## 3 16225 83018.00 63.78 218.97
## 4 16225 82448.87 63.78 218.97
## 5 16225 84919.87 63.78 218.97
## 6 16225 99680.87 63.78 218.97
## 7 16225 99680.87 63.78 218.97
## Total Bank Charges Total Commission Depreciation Expense Filing Fees
## 2 282.75 2950 21802 138.75
## 3 282.75 0 21802 138.75
## 4 282.75 0 21802 138.75
## 5 282.75 1500 21802 138.75
## 6 282.75 0 21802 138.75
## 7 282.75 0 21802 138.75
## Total Electricity Total Utilities Total Expense Net Income
## 2 307.01 307.01 25787.52 73893.35
## 3 307.01 307.01 22837.52 60180.48
## 4 307.01 307.01 22837.52 59611.35
## 5 307.01 307.01 24337.52 60582.35
## 6 307.01 307.01 22837.52 76843.35
## 7 307.01 307.01 22837.52 76843.35
As a last step in data preprocessing, we need to rename the columns since R might have difficulties to interpret empty spaces within a title. Further, we add an additional binary column “year”, which might be helpful when trying to investigate if there is a difference between the 2022 and the 2023 values.
colnames(data) <- c("Date", "Unit1115", "Unit1126", "Unit1215",
"Unit1223", "Unit1313", "Unit1314",
"Unit1315", "TotalRentalIncome",
"BrightStarCreditUnion", "BankChargesOther",
"TotalBankCharges", "TotalCommission",
"DepreciationExpense", "FillingFees",
"TotalElectricity", "TotalUtilities",
"TotalExpense", "NetIncome")
year <- c(2022,2022,2022,2022,2022,2022,2022,2022,2022,2022,2022,2022,
2023,2023,2023,2023)
data$year <- year
Data Visualization
In order to display meaningful graphs, we need to examine the correclations between the variables. To do that, we use the pairs.panels() function out of the psych library. We consider only attributes that have a nonzero standard deviation.
# Select all the columns with a positive standard deviation
column_std <- sapply(data, sd)
## Warning in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm =
## na.rm): NAs introduced by coercion
stdgt0 <- which(column_std > 0)
data_stdgt0 <- data[, stdgt0]
# Print pairwise correlation for all these columns
pairs.panels(data_stdgt0)
The highest correlation can be found between Total Rental Income and Net Income. Let us create a scatter plot of these two columns, using the ggplot() function geom_point(). We add an additional regression line with geom_smooth(). Since many points have the exact same numerical values and are on top of each other, we add some noise with the geom_jitter() function, which helps us to better understand the relative distribution of the different points.
ggplot(data = data,
mapping = aes(x = TotalRentalIncome,
y = NetIncome)) +
geom_point(colour = "#f05053") +
geom_smooth(method="loess", colour = "#093028") +
geom_jitter(width = 500, height = 500, colour = "#f05053") +
labs(x = "Total Rental Income",
y = "Net Income",
title = "Net Income in terms of Total Rent Income",
subtitle = "Data points are different dates",
caption = "Source: XYZ LLC 2021 P&L.CSV")
## `geom_smooth()` using formula = 'y ~ x'
For a second plot, we compare the different units to each other. We will do this with the aid of different line plots that are displayed in the same graph. We will use the geom_line() function and use only values for 2022, since they have a consistent time difference.
# Select only values out of 2022
index_2022 <- which(year == 2022)
data_2022 <- data[index_2022,]
# Select all the Unit columns together with the Date
data_2022 <- data_2022[,c(1:8)]
data_2022 <- melt(data_2022 , id.vars = 'Date', variable.name = 'Unit')
# Create line plot for each Unit
ggplot(data = data_2022,
mapping = aes(x = Date,
y = value,
group=Unit)) +
geom_line(aes(colour = Unit)) +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Date",
y = "Amount",
title = "Amount for each Unit against Date",
subtitle = "Each Line represents one Unit",
caption = "Source: XYZ LLC 2021 P&L.CSV")
We remark that the units attain constant values most of the time, with single outliers. Further, this graph shows that the amount of Unit 1223 is way lower than the amounts for all other units. The other units are all on a similar level.
Linear Regression To start, we’ll use all variables with a standard greater than 0 as independent variables, assuming that all constant variables don’t have a significant contribution to the model.
By trying to predict several different variables, we note that there are direct dependencies between the variables. Parameters like total rental income, net income, or total commission can be calculated deterministically through a formula without the need of a prediction. Consider this example below. We “predict” the value for total commission through a sum of different other parameters. The R-squared value is 1, which means we theoretically have a “perfect” model.
# Select only values out of 2022
index_2022 <- which(year == 2022)
data_2022 <- data[index_2022,]
# Perform the linear regression
linear_regression <- lm(data_2022$TotalCommission ~ data_2022$Unit1115 + data_2022$Unit1215 + data_2022$Unit1313 + data_2022$Unit1314 + data_2022$Unit1315 + data_2022$NetIncome + data_2022$TotalExpense + data_2022$TotalRentalIncome)
print(summary(linear_regression))
##
## Call:
## lm(formula = data_2022$TotalCommission ~ data_2022$Unit1115 +
## data_2022$Unit1215 + data_2022$Unit1313 + data_2022$Unit1314 +
## data_2022$Unit1315 + data_2022$NetIncome + data_2022$TotalExpense +
## data_2022$TotalRentalIncome)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.701e-12 0.000e+00 1.327e-13 6.318e-13 6.968e-13
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.819e+04 7.294e-11 3.864e+14 < 2e-16 ***
## data_2022$Unit1115 1.000e+00 1.190e-15 8.400e+14 < 2e-16 ***
## data_2022$Unit1215 1.373e-14 2.556e-15 5.373e+00 0.00171 **
## data_2022$Unit1313 1.000e+00 1.252e-15 7.990e+14 < 2e-16 ***
## data_2022$Unit1314 1.000e+00 1.192e-15 8.390e+14 < 2e-16 ***
## data_2022$Unit1315 NA NA NA NA
## data_2022$NetIncome -1.000e+00 1.189e-15 -8.409e+14 < 2e-16 ***
## data_2022$TotalExpense NA NA NA NA
## data_2022$TotalRentalIncome NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.261e-12 on 6 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.169e+30 on 5 and 6 DF, p-value: < 2.2e-16
The relevant predictor variables seem to be Unit 1115, Unit 1215, Unit 1313, Unit 1314, and Net Income. Eliminating all other variables gives us:
# Select only values out of 2022
index_2022 <- which(year == 2022)
data_2022 <- data[index_2022,]
# Perform the linear regression
linear_regression <- lm(data_2022$TotalCommission ~ data_2022$Unit1115 + data_2022$Unit1215 + data_2022$Unit1313 + data_2022$Unit1314 + data_2022$NetIncome)
print(summary(linear_regression))
##
## Call:
## lm(formula = data_2022$TotalCommission ~ data_2022$Unit1115 +
## data_2022$Unit1215 + data_2022$Unit1313 + data_2022$Unit1314 +
## data_2022$NetIncome)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.701e-12 0.000e+00 1.327e-13 6.318e-13 6.968e-13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.819e+04 7.294e-11 3.864e+14 < 2e-16 ***
## data_2022$Unit1115 1.000e+00 1.190e-15 8.400e+14 < 2e-16 ***
## data_2022$Unit1215 1.373e-14 2.556e-15 5.373e+00 0.00171 **
## data_2022$Unit1313 1.000e+00 1.252e-15 7.990e+14 < 2e-16 ***
## data_2022$Unit1314 1.000e+00 1.192e-15 8.390e+14 < 2e-16 ***
## data_2022$NetIncome -1.000e+00 1.189e-15 -8.409e+14 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.261e-12 on 6 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.169e+30 on 5 and 6 DF, p-value: < 2.2e-16
This seems to be the accurate model for the calculation of total commission. All p-values are below an assumed alpha value of 0.05.
Consider a more interesting use case: We try to predict the value for a unit. In this case, we want to predict Unit 1215. We could determine its value deterministically again, but now we want to predict its values only with the aid of other Units.
# Select only values out of 2022
index_2022 <- which(year == 2022)
data_2022 <- data[index_2022,]
# Perform the linear regression
linear_regression <- lm(data_2022$Unit1215 ~ data_2022$Unit1115 + data_2022$Unit1313 + data_2022$Unit1314 + data_2022$Unit1315)
print(summary(linear_regression))
## Warning in summary.lm(linear_regression): essentially perfect fit: summary may
## be unreliable
##
## Call:
## lm(formula = data_2022$Unit1215 ~ data_2022$Unit1115 + data_2022$Unit1313 +
## data_2022$Unit1314 + data_2022$Unit1315)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.091e-14 -5.822e-15 -2.768e-15 2.768e-15 4.091e-14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.500e+01 3.103e-13 2.417e+14 <2e-16 ***
## data_2022$Unit1115 -3.408e-18 1.153e-18 -2.956e+00 0.0212 *
## data_2022$Unit1313 -3.682e-18 1.491e-18 -2.469e+00 0.0429 *
## data_2022$Unit1314 -3.563e-18 1.673e-18 -2.130e+00 0.0707 .
## data_2022$Unit1315 1.000e+00 1.954e-17 5.117e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.631e-14 on 7 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 7.201e+32 on 4 and 7 DF, p-value: < 2.2e-16
Even in this case, our model seems to be almost perfect. Assuming that our threshold is a p-value of 0.05, we would need to eliminate Unit 1314 next.
# Select only values out of 2022
index_2022 <- which(year == 2022)
data_2022 <- data[index_2022,]
# Perform the linear regression
linear_regression <- lm(data_2022$Unit1215 ~ data_2022$Unit1115 + data_2022$Unit1313 + data_2022$Unit1315)
print(summary(linear_regression))
## Warning in summary.lm(linear_regression): essentially perfect fit: summary may
## be unreliable
##
## Call:
## lm(formula = data_2022$Unit1215 ~ data_2022$Unit1115 + data_2022$Unit1313 +
## data_2022$Unit1315)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.631e-14 -9.560e-16 -3.350e-16 4.190e-16 3.262e-14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.500e+01 1.672e-13 4.485e+14 <2e-16 ***
## data_2022$Unit1115 0.000e+00 6.031e-19 0.000e+00 1
## data_2022$Unit1313 0.000e+00 7.861e-19 0.000e+00 1
## data_2022$Unit1315 1.000e+00 1.041e-17 9.609e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.421e-14 on 8 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 3.292e+33 on 3 and 8 DF, p-value: < 2.2e-16
In the last step, the p-values for Unit 1115 and Unit 1313 have risen above the threshold and hence will be discarded, too.
# Select only values out of 2022
index_2022 <- which(year == 2022)
data_2022 <- data[index_2022,]
# Perform the linear regression
linear_regression <- lm(data_2022$Unit1215 ~ data_2022$Unit1315)
print(summary(linear_regression))
## Warning in summary.lm(linear_regression): essentially perfect fit: summary may
## be unreliable
##
## Call:
## lm(formula = data_2022$Unit1215 ~ data_2022$Unit1315)
##
## Residuals:
## Min 1Q Median 3Q Max
## 0 0 0 0 0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75 0 Inf <2e-16 ***
## data_2022$Unit1315 1 0 Inf <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0 on 10 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: Inf on 1 and 10 DF, p-value: < 2.2e-16
We reached our final linear regresion model: Unit 1215 can be “approximated” through the sum 75 + 1 * Unit 1315.
Time Series Analysis Lastly, we will do a time series plot of the different units.
plot.ts(data_2022[,c(2:7)], main = "Time Series Analysis of Units in 2022")
Again, we can see that the unit values are mostly constant, with singular outliers.