library(tidyverse)
## ── Attaching packages ───────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.3.4 ✔ dplyr 0.7.4
## ✔ tidyr 0.7.2 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ──────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr, warn.conflicts = FALSE)
dths <- read_csv("households.csv")
## Parsed with column specification:
## cols(
## Household = col_integer(),
## `Family Size` = col_integer(),
## Location = col_integer(),
## Ownership = col_integer(),
## `First Income` = col_integer(),
## `Second Income` = col_integer(),
## `Monthly Payment` = col_integer(),
## Utilities = col_integer(),
## Debt = col_integer()
## )
sort(names(dths))
## [1] "Debt" "Family Size" "First Income" "Household"
## [5] "Location" "Monthly Payment" "Ownership" "Second Income"
## [9] "Utilities"
In general, variables (and data) either represent measurements on some continuous scale, or they represent information about some categorical or discrete characteristics.
Continuous:“Family Size”, “First Income”, “Second Income”,“Monthly Payment”, “Utilities”, “Debt” Categorical:“Household”,Location“,”Ownership"
Nominal variables are data whose levels are labels or descriptions, and which cannot be ordered. Ordinal variables can be ordered, or ranked in logical order, but the interval between levels of the variables are not necessarily known.
Nominal:“Household”,“Location”,“Ownership”
hist(dths$`Family Size`,xlab = "Family Size", ylab = "Frequency", main = "Histogram of Family Size")
hist(dths$`First Income`,xlab = "First Income", ylab = "Frequency", main = "Histogram of First Income")
hist(dths$`Second Income`,xlab = "Second Income", ylab = "Frequency", main = "Histogram of Second Income")
hist(dths$`Monthly Payment`,xlab = "Monthly Payment", ylab = "Frequency", main = "Histogram of Monthly Payment")
hist(dths$Utilities,xlab = "Ultilities", ylab = "Frequency", main = "Histogram of Utilities")
hist(dths$Debt,xlab = "Household", ylab = "Frequency", main = "Histogram of Debt")
The histogram of debt tells that the range of the debt is from 0 to 10000. Households have a debt of 3000 to 4000 are most common. Debt is a continuous variable.The distributions of Debt tend to symmetric.
max(dths$Debt)
## [1] 9104
min(dths$Debt)
## [1] 227
quantile(dths$Debt, c(0.25,0.5,0.75))
## 25% 50% 75%
## 2948.5 4267.5 5675.5
IQR(dths$Debt)
## [1] 2727
the interquartile range equals to the difference between 75th and 25th percentiles