1.Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.

Question to ask: 1. what is the median income in Chile? 2. Is there a diffrence in earnings in the age - older -> higher income? 3. Is there a gap bewteen median and the highest income in the population?

  Chile <- read.csv(file="/data/Chile.csv", header=TRUE, sep=",")
   str(Chile)
## 'data.frame':    2700 obs. of  9 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ region    : chr  "N" "N" "N" "N" ...
##  $ population: int  175000 175000 175000 175000 175000 175000 175000 175000 175000 175000 ...
##  $ sex       : chr  "M" "M" "F" "F" ...
##  $ age       : int  65 29 38 49 23 28 26 24 41 41 ...
##  $ education : chr  "P" "PS" "P" "P" ...
##  $ income    : int  35000 7500 15000 35000 35000 7500 35000 15000 15000 15000 ...
##  $ statusquo : num  1.01 -1.3 1.23 -1.03 -1.1 ...
##  $ vote      : chr  "Y" "N" "Y" "N" ...
  summary(Chile)
##        X             region            population         sex           
##  Min.   :   1.0   Length:2700        Min.   :  3750   Length:2700       
##  1st Qu.: 675.8   Class :character   1st Qu.: 25000   Class :character  
##  Median :1350.5   Mode  :character   Median :175000   Mode  :character  
##  Mean   :1350.5                      Mean   :152222                     
##  3rd Qu.:2025.2                      3rd Qu.:250000                     
##  Max.   :2700.0                      Max.   :250000                     
##                                                                         
##       age         education             income         statusquo       
##  Min.   :18.00   Length:2700        Min.   :  2500   Min.   :-1.80301  
##  1st Qu.:26.00   Class :character   1st Qu.:  7500   1st Qu.:-1.00223  
##  Median :36.00   Mode  :character   Median : 15000   Median :-0.04558  
##  Mean   :38.55                      Mean   : 33876   Mean   : 0.00000  
##  3rd Qu.:49.00                      3rd Qu.: 35000   3rd Qu.: 0.96857  
##  Max.   :70.00                      Max.   :200000   Max.   : 2.04859  
##  NA's   :1                          NA's   :98       NA's   :17        
##      vote          
##  Length:2700       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
mean(Chile$income, na.rm=TRUE)
## [1] 33875.86
median(Chile$income,na.rm=TRUE)
## [1] 15000
table(Chile$region)
## 
##   C   M   N   S  SA 
## 600 100 322 718 960
table(Chile$vote)
## 
##   A   N   U   Y 
## 187 889 588 868
table(Chile$region,Chile$income,Chile$sex)
## , ,  = F
## 
##     
##      2500 7500 15000 35000 75000 125000 200000
##   C    23   70    87    71    23      4      6
##   M     5   10    14    15     4      1      0
##   N     9   31    51    50    14      6      1
##   S    44  100    92    80    24      6      7
##   SA   19   76   144   145    43     28     26
## 
## , ,  = M
## 
##     
##      2500 7500 15000 35000 75000 125000 200000
##   C    18   51    86    82    42      8      8
##   M     0    7    19    13     4      1      0
##   N     2   24    49    52    21      3      1
##   S    30   74   104    85    39      8      4
##   SA   10   51   122   154    55     23     23

From above analysis we can concluded that ‘SA" region is most popular pick and region ’M’ is the least picked. the mean income is 33,875 and median income 15,000. There are 40 females with income of 200,000 compare to 36 male with the same income. There is slight diffrence bewteen people who voted (868) and who did not vote(889)

2.Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together)

Chile <- Chile[complete.cases(Chile),]

region_SA <-data.frame(Chile$population, Chile$income, Chile$age)

region_SA1 <- region_SA[Chile$income >= 1500, ]
colnames(region_SA1)<-c("Population","Income","AGE")

region_SA2 <- data.frame(Chile$income, Chile$age)


colnames(region_SA2)<-c("INCOME","AGE")

summary(region_SA2)
##      INCOME            AGE       
##  Min.   :  2500   Min.   :18.00  
##  1st Qu.:  7500   1st Qu.:25.00  
##  Median : 15000   Median :36.00  
##  Mean   : 34020   Mean   :38.29  
##  3rd Qu.: 35000   3rd Qu.:49.00  
##  Max.   :200000   Max.   :70.00
mean(region_SA2$INCOME)
## [1] 34019.95
median(region_SA2$INCOME)
## [1] 15000
n.sub<-subset(Chile, Chile$age > 40 & Chile$income>50000 & Chile$population ==175000 )
n.sub
##       X region population sex age education income statusquo vote
## 14   14      N     175000   F  46         S  75000   1.50684    Y
## 19   19      N     175000   M  67         P  75000   1.32279    Y
## 613 613      C     175000   F  64         S  75000  -1.27876    N
## 615 615      C     175000   M  48         S  75000   0.73356    U
## 775 775      C     175000   M  49        PS 200000  -1.21834    N
## 809 809      C     175000   F  41        PS  75000   1.41116    Y
  1. Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.
plot(region_SA2$INCOME,region_SA2$AGE, xlab='Income',ylab=' Age' ,main='Income vs.Age ', col='blue')

plot(region_SA1$Income, region_SA1$AGE, xlab='Income',ylab='Age' ,main='Income vs.Age', col='red')

points(region_SA1$Income[region_SA1$Income >'5000'], region_SA1$AGE[region_SA1$Income> '5000'],pch=10,col='green')

histp<-hist(Chile$income, freq=TRUE, xlab = "Income",  ylab = "Age", main = "Income vs. # Age", col="yellow"  )

curve(dnorm(x, mean=mean(Chile$income), sd=sd(Chile$income)), add=TRUE, col="green", lwd=2)

boxplot(Chile$income ~ Chile$age, data=Chile, main=toupper("Income vs.Age"), font.main=4, cex.main=1.5, xlab="Age", ylab="Income", font.lab=4, col="orangered")

  1. Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end.
library(ggplot2)
ggplot(Chile, aes(x = income, y = age)) +
geom_jitter(size = 1, color = "green")

ggplot(Chile, aes(x =income, y = population)) +
geom_jitter(size = 1, color = "red")

  1. Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end.

From the above analysis we can noticed that in Chile the income below 50000 is the most comon between people of any age from 18 and above. Where we can see that 200000 is mostly common up to 60 years old in the much less population. We can identify that there is a quite a vast gap in the income between 125000 and 200000 with lack of any indication of income in the population.