Creating work directory, Calling libraries, Reading the file and importing it into a data frame
#Creating work directory
setwd("C:/Users/keiva/Dropbox (Personal)/GW/06- Fall 2017/01- Programming in business analytics/03- Week 03 ( 14 Sep 2017)/Assignment")
#Calling libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.1
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'ggplot2' was built under R version 3.4.1
## Warning: package 'tibble' was built under R version 3.4.1
## Warning: package 'tidyr' was built under R version 3.4.1
## Warning: package 'readr' was built under R version 3.4.1
## Warning: package 'purrr' was built under R version 3.4.1
## Warning: package 'dplyr' was built under R version 3.4.1
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(dplyr)
#Reading the file and importing it into a data frame
mydf <- read_csv("households.csv")
## Parsed with column specification:
## cols(
## Household = col_integer(),
## `Family Size` = col_integer(),
## Location = col_integer(),
## Ownership = col_integer(),
## `First Income` = col_integer(),
## `Second Income` = col_integer(),
## `Monthly Payment` = col_integer(),
## Utilities = col_integer(),
## Debt = col_integer()
## )
A.a – Indicating the type of data for each of the variables included in the survey.
names(mydf) <- c("Household", "Family Size", "Location", "Ownership",
"First Income", "Second Income", "Monthly Payment",
"Utilities", "Debt")
type_of_data<- lapply(mydf, class)
print("the types of data in the survey are:")
## [1] "the types of data in the survey are:"
print(type_of_data)
## $Household
## [1] "integer"
##
## $`Family Size`
## [1] "integer"
##
## $Location
## [1] "integer"
##
## $Ownership
## [1] "integer"
##
## $`First Income`
## [1] "integer"
##
## $`Second Income`
## [1] "integer"
##
## $`Monthly Payment`
## [1] "integer"
##
## $Utilities
## [1] "integer"
##
## $Debt
## [1] "integer"
The data types are all integer.
A.b – For each of the categorical variables in the survey, indicate whether the variable is nominal or ordinal. Explain your reasoning in each case.
values_Household <- table(mydf$Household)
head(values_Household)
##
## 1 2 3 4 5 6
## 1 1 1 1 1 1
Household is ordinal because we have 500 different values for each record
values_Family_Size <- table(mydf$`Family Size`)
head(values_Family_Size)
##
## 1 2 3 4 5 6
## 90 114 133 85 52 15
Family Size is nominal (Categorical) because we have 10 unique values for each record
values_Location <- table(mydf$Location)
head(values_Location)
##
## 1 2 3 4
## 129 123 120 128
Location is nominal (Categorical) because we have 4 unique values for each record
values_Ownership <- table(mydf$Ownership)
head(values_Ownership)
##
## 0 1
## 218 282
Ownership is nominal (Categorical) because we have 2 unique values for each record
values_First_Income <- table(mydf$`First Income`)
head(values_First_Income)
##
## 16252 16971 17881 18119 18276 18706
## 1 1 1 1 1 1
First Income is ordinal because we have 500 values for each record and can be any positive integer
values_Second_Income <- table(mydf$`Second Income`)
head(values_Second_Income)
##
## 9549 10509 10680 11539 11802 11948
## 1 1 1 1 1 1
Second Income is ordinal because it can be NA or any positive integer
values_Monthly_Payment <- table(mydf$`Monthly Payment`)
head(values_Monthly_Payment)
##
## 334 349 355 369 383 389
## 1 1 1 1 1 1
Monthly payment is ordinal because it can be any positive integer
values_Utilities <- table(mydf$Utilities)
head(values_Utilities)
##
## 190 191 192 194 195 196
## 1 3 5 3 2 4
Utilities is nominal(categorical) because an integer between 190 and 278 has assigned to each record
values_Debt <- table(mydf$Debt)
head(values_Debt)
##
## 227 555 818 854 879 983
## 1 1 1 1 1 1
Debt is ordinal because it can be any positive integer
A.c – Create a histogram for each of the numerical variables in this data set. Indicate whether each of these distributions is approximately symmetric or skewed. Which, if any, of these distributions are skewed to the right? Which, if any, are skewed to the left?
par(mfrow=c(2,3))
hist(mydf$`Family Size`,main = "Family Size",col="blue")
hist(mydf$`First Income`,main = "First Income",col="orange")
hist(mydf$`Second Income`,main = "Second Income",col="purple")
hist(mydf$`Monthly Payment`,main = "Monthly Payment",col="blue")
hist(mydf$Utilities,main = "Utilities", col="red")
hist(mydf$Debt, main = "Debt", col="green")
Family size: skewed to the right First income: skewed to the right Second income: skewed to the right Monthly payment: skewed to the right Utilities: Approximately symmetric Debt: Approximately symmetric
A.d —-Find the maximum and minimum debt levels for the households in this sample
debt_quantiles <- quantile(mydf$Debt)
print(debt_quantiles)
## 0% 25% 50% 75% 100%
## 227.0 2948.5 4267.5 5675.5 9104.0
A.f — Find and interpret the interquartile range for the indebtedness levels of these selected households
inter_quartile <- IQR(mydf$Debt)
sprintf("IQR=%d",inter_quartile)
## [1] "IQR=2727"
A.g — Write a report that is less than 250 words that summarizes your analysis:
The quartiles for the distribution of debt over households are not highly different. therefore, we can conclude that the debt is normally distributed. we do not have any particular outlier in the distribution of indebtness. Family size is skewed to the right. it means that as that most of the families are less in size. First income in skewed to the right while it has a mean around 40000. which means as the first income increases after 40000, the number of families who have this first income decreases. Second income is the same as first income but with mean about 20000. also we have some NA data which shows some families do not have 2nd income. Most of the families have a monthly payment around 500 and as it increases, the number of families decreases. Debt shows that most of the people have debt equal to 4000 utilities show that people mostly spend 220 or 250 for utilities.