1 Read the csv file “inc_real.sav” (contains real income data) into R and then check the data by doing the following:

  • Print the first 6 or 10 rows of the data frame
  • Use dim() to determine how many rows and columns the data frame has
  • Get the variable names (use names()),
  • Determine the type of variables (numerical, factor, …), i.e., the structure of the data frame (str()).
library(foreign)
inc_real <- read.spss('/Users/johnhope/Desktop/DS3003/Data/inc_real.sav')

inc_real <- as.data.frame(inc_real) #converting to a data frame

head(inc_real) #printing first 6 rows
##   age    sex whours                             educat income   hwage edu
## 1  24   male     40 non-tertiary post-secondary degree  18000 112.500  15
## 2  43 female     40               academic high school  14500  90.625  12
## 3  27   male     40                     apprenticeship  18000 112.500  10
## 4  37   male     40                  compulsory school  15700  98.125   9
## 5  50   male     42               academic high school  38000 237.500  12
## 6  50   male     39                     apprenticeship  22000 137.500  10
##   potexp
## 1      3
## 2     25
## 3     11
## 4     22
## 5     32
## 6     34

We see the first 6 rows of the data

dim(inc_real)
## [1] 1271    8

The data has 1271 rows and 8 columns

names(inc_real)
## [1] "age"    "sex"    "whours" "educat" "income" "hwage"  "edu"    "potexp"
str(inc_real)
## 'data.frame':    1271 obs. of  8 variables:
##  $ age   : num  24 43 27 37 50 50 30 60 45 26 ...
##  $ sex   : Factor w/ 2 levels "male","female": 1 2 1 1 1 1 2 1 1 1 ...
##  $ whours: num  40 40 40 40 42 39 40 39 40 39 ...
##  $ educat: Factor w/ 9 levels "no degree","compulsory school",..: 8 5 3 2 5 3 7 3 3 3 ...
##  $ income: num  18000 14500 18000 15700 38000 22000 5200 12000 15000 13000 ...
##  $ hwage : num  112.5 90.6 112.5 98.1 237.5 ...
##  $ edu   : num  15 12 10 9 12 10 13 10 10 10 ...
##  $ potexp: num  3 25 11 22 32 34 11 44 29 10 ...

Above we see the variable names and their associated types

2 Get the summary statistics for the variables in the data frame.

summary(inc_real)
##       age            sex          whours     
##  Min.   :16.00   male  :839   Min.   :36.00  
##  1st Qu.:28.00   female:432   1st Qu.:38.00  
##  Median :36.00                Median :40.00  
##  Mean   :36.78                Mean   :39.87  
##  3rd Qu.:45.00                3rd Qu.:40.00  
##  Max.   :64.00                Max.   :80.00  
##                                              
##                               educat        income          hwage       
##  apprenticeship                  :599   Min.   : 5000   Min.   : 31.25  
##  compulsory school               :220   1st Qu.:13000   1st Qu.: 81.25  
##  vocational school               :127   Median :15000   Median : 93.75  
##  vocational high school          :101   Mean   :16822   Mean   :105.14  
##  tertiary education (BA, MA, PhD): 87   3rd Qu.:20000   3rd Qu.:125.00  
##  academic high school            : 66   Max.   :80819   Max.   :505.12  
##  (Other)                         : 71                                   
##       edu            potexp     
##  Min.   : 9.00   Min.   : 0.00  
##  1st Qu.:10.00   1st Qu.:11.00  
##  Median :10.00   Median :19.00  
##  Mean   :10.95   Mean   :19.84  
##  3rd Qu.:12.00   3rd Qu.:28.00  
##  Max.   :17.00   Max.   :46.00  
## 

3 Generate the following sequences using seq() and rep().

  • Each sequence should have a length of 20 (i.e., 20 numbers), only the first 12 numbers are shown below.

    • 1 0 1 0 1 0 1 0 1 0 1 0 ….
    • 1 1 0 0 1 1 0 0 1 1 0 0 ….
    • 0 3 6 9 0 3 6 9 0 3 6 9 ….
rep(c(1,0),10)
##  [1] 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
rep(rep(1:0, each = 2), 5)
##  [1] 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0
rep(seq(0, 9, by=3), 5)
##  [1] 0 3 6 9 0 3 6 9 0 3 6 9 0 3 6 9 0 3 6 9

4 Plot a histrogram with the income variable from data.

hist(inc_real$income)