1.Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.
Question to ask: 1. what is the median income in Chile? 2. Is there a diffrence in earnings in the age - older -> higher income? 3. Is there a gap bewteen median and the highest income in the population?
Chile <- read.csv(file="/data/Chile.csv", header=TRUE, sep=",")
str(Chile)
## 'data.frame': 2700 obs. of 9 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ region : chr "N" "N" "N" "N" ...
## $ population: int 175000 175000 175000 175000 175000 175000 175000 175000 175000 175000 ...
## $ sex : chr "M" "M" "F" "F" ...
## $ age : int 65 29 38 49 23 28 26 24 41 41 ...
## $ education : chr "P" "PS" "P" "P" ...
## $ income : int 35000 7500 15000 35000 35000 7500 35000 15000 15000 15000 ...
## $ statusquo : num 1.01 -1.3 1.23 -1.03 -1.1 ...
## $ vote : chr "Y" "N" "Y" "N" ...
summary(Chile)
## X region population sex
## Min. : 1.0 Length:2700 Min. : 3750 Length:2700
## 1st Qu.: 675.8 Class :character 1st Qu.: 25000 Class :character
## Median :1350.5 Mode :character Median :175000 Mode :character
## Mean :1350.5 Mean :152222
## 3rd Qu.:2025.2 3rd Qu.:250000
## Max. :2700.0 Max. :250000
##
## age education income statusquo
## Min. :18.00 Length:2700 Min. : 2500 Min. :-1.80301
## 1st Qu.:26.00 Class :character 1st Qu.: 7500 1st Qu.:-1.00223
## Median :36.00 Mode :character Median : 15000 Median :-0.04558
## Mean :38.55 Mean : 33876 Mean : 0.00000
## 3rd Qu.:49.00 3rd Qu.: 35000 3rd Qu.: 0.96857
## Max. :70.00 Max. :200000 Max. : 2.04859
## NA's :1 NA's :98 NA's :17
## vote
## Length:2700
## Class :character
## Mode :character
##
##
##
##
mean(Chile$income, na.rm=TRUE)
## [1] 33875.86
median(Chile$income,na.rm=TRUE)
## [1] 15000
table(Chile$region)
##
## C M N S SA
## 600 100 322 718 960
table(Chile$vote)
##
## A N U Y
## 187 889 588 868
table(Chile$region,Chile$income,Chile$sex)
## , , = F
##
##
## 2500 7500 15000 35000 75000 125000 200000
## C 23 70 87 71 23 4 6
## M 5 10 14 15 4 1 0
## N 9 31 51 50 14 6 1
## S 44 100 92 80 24 6 7
## SA 19 76 144 145 43 28 26
##
## , , = M
##
##
## 2500 7500 15000 35000 75000 125000 200000
## C 18 51 86 82 42 8 8
## M 0 7 19 13 4 1 0
## N 2 24 49 52 21 3 1
## S 30 74 104 85 39 8 4
## SA 10 51 122 154 55 23 23
From above analysis we can concluded that ‘SA" region is most popular pick and region ’M’ is the least picked. the mean income is 33,875 and median income 15,000. There are 40 females with income of 200,000 compare to 36 male with the same income. There is slight diffrence bewteen people who voted (868) and who did not vote(889)
2.Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together)
Chile <- Chile[complete.cases(Chile),]
region_SA <-data.frame(Chile$population, Chile$income, Chile$age)
region_SA1 <- region_SA[Chile$income >= 1500, ]
colnames(region_SA1)<-c("Population","Income","AGE")
region_SA2 <- data.frame(Chile$income, Chile$age)
colnames(region_SA2)<-c("INCOME","AGE")
summary(region_SA2)
## INCOME AGE
## Min. : 2500 Min. :18.00
## 1st Qu.: 7500 1st Qu.:25.00
## Median : 15000 Median :36.00
## Mean : 34020 Mean :38.29
## 3rd Qu.: 35000 3rd Qu.:49.00
## Max. :200000 Max. :70.00
mean(region_SA2$INCOME)
## [1] 34019.95
median(region_SA2$INCOME)
## [1] 15000
n.sub<-subset(Chile, Chile$age > 40 & Chile$income>50000 & Chile$population ==175000 )
n.sub
## X region population sex age education income statusquo vote
## 14 14 N 175000 F 46 S 75000 1.50684 Y
## 19 19 N 175000 M 67 P 75000 1.32279 Y
## 613 613 C 175000 F 64 S 75000 -1.27876 N
## 615 615 C 175000 M 48 S 75000 0.73356 U
## 775 775 C 175000 M 49 PS 200000 -1.21834 N
## 809 809 C 175000 F 41 PS 75000 1.41116 Y
plot(region_SA2$INCOME,region_SA2$AGE, xlab='Income',ylab=' Age' ,main='Income vs.Age ', col='blue')
plot(region_SA1$Income, region_SA1$AGE, xlab='Income',ylab='Age' ,main='Income vs.Age', col='red')
points(region_SA1$Income[region_SA1$Income >'5000'], region_SA1$AGE[region_SA1$Income> '5000'],pch=10,col='green')
histp<-hist(Chile$income, freq=TRUE, xlab = "Income", ylab = "Age", main = "Income vs. # Age", col="yellow" )
curve(dnorm(x, mean=mean(Chile$income), sd=sd(Chile$income)), add=TRUE, col="green", lwd=2)
boxplot(Chile$income ~ Chile$age, data=Chile, main=toupper("Income vs.Age"), font.main=4, cex.main=1.5, xlab="Age", ylab="Income", font.lab=4, col="orangered")
library(ggplot2)
ggplot(Chile, aes(x = income, y = age)) +
geom_jitter(size = 1, color = "green")
ggplot(Chile, aes(x =income, y = population)) +
geom_jitter(size = 1, color = "red")
From the above analysis we can noticed that in Chile the income below 50000 is the most comon between people of any age from 18 and above. Where we can see that 200000 is mostly common up to 60 years old in the much less population. We can identify that there is a quite a vast gap in the income between 125000 and 200000 with lack of any indication of income in the population.