#Here is a list of data sets: #http://vincentarelbundock.github.io/Rdatasets/ #(click on the csv index for a list) # Please select one, download it and perform the #following tasks:

Lets look at “Fairs extramarital affairs data”. In particular, focusing on age and rating (happiness in marriage - higher is happier).

Bringing the dataset in from github……

#1. Use the summary function to gain an overview of #the data set. Then display the mean and #median for at least two attributes.

#affair2<-read.table(file=theUrl,header=TRUE,quote="#",sep=",")
affair2<-read.csv("https://raw.githubusercontent.com/lszydziak/LS_CUNY/main/Affairs.csv")
#affair2<-"C:/Users/Lisa/Documents/CUNY/Bridge/rsconnect/documents/HW2/affairs.csv"
#Affairs<-read.table(file=affair2,header=TRUE, #sep=",")
Affairs<-affair2
#here are the first few records of the file

head(Affairs)
##    X affairs gender age yearsmarried children religiousness education
## 1  4       0   male  37        10.00       no             3        18
## 2  5       0 female  27         4.00       no             4        14
## 3 11       0 female  32        15.00      yes             1        12
## 4 16       0   male  57        15.00      yes             5        18
## 5 23       0   male  22         0.75       no             2        17
## 6 29       0 female  32         1.50       no             2        17
##   occupation rating
## 1          7      4
## 2          6      4
## 3          1      4
## 4          6      5
## 5          6      3
## 6          5      5
#this is the class of the data
class(Affairs)
## [1] "data.frame"
#how many rows and columns are in Affairs ds
dim.data.frame(Affairs)
## [1] 601  10
#summary function on the data set result......
summary(Affairs)
##        X           affairs          gender               age       
##  Min.   :   4   Min.   : 0.000   Length:601         Min.   :17.50  
##  1st Qu.: 528   1st Qu.: 0.000   Class :character   1st Qu.:27.00  
##  Median :1009   Median : 0.000   Mode  :character   Median :32.00  
##  Mean   :1060   Mean   : 1.456                      Mean   :32.49  
##  3rd Qu.:1453   3rd Qu.: 0.000                      3rd Qu.:37.00  
##  Max.   :9029   Max.   :12.000                      Max.   :57.00  
##   yearsmarried      children         religiousness     education    
##  Min.   : 0.125   Length:601         Min.   :1.000   Min.   : 9.00  
##  1st Qu.: 4.000   Class :character   1st Qu.:2.000   1st Qu.:14.00  
##  Median : 7.000   Mode  :character   Median :3.000   Median :16.00  
##  Mean   : 8.178                      Mean   :3.116   Mean   :16.17  
##  3rd Qu.:15.000                      3rd Qu.:4.000   3rd Qu.:18.00  
##  Max.   :15.000                      Max.   :5.000   Max.   :20.00  
##    occupation        rating     
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:3.000  
##  Median :5.000   Median :4.000  
##  Mean   :4.195   Mean   :3.932  
##  3rd Qu.:6.000   3rd Qu.:5.000  
##  Max.   :7.000   Max.   :5.000
#here is the mean and median for age

mean(Affairs$age)
## [1] 32.48752
median(Affairs$age)
## [1] 32
#here is the mean and median for rating

mean(Affairs$rating)
## [1] 3.93178
median(Affairs$rating)
## [1] 4

#2. Create a new data frame with a subset of the #columns and rows. Make sure to rename it.

Let’s grab the records of those who have affairs, and reduce the number of variables we look at……

#create subset of records with affairs ONLY with variables affairs, age, children and rating....
Affairs_only<-subset(Affairs, Affairs$affairs>0)
Affairs_only<-data.frame(Affairs_only$affairs, Affairs_only$gender,
                         Affairs_only$age,Affairs_only$children,Affairs_only$rating)
#display first few records of the new dataset...
head(Affairs_only)
##   Affairs_only.affairs Affairs_only.gender Affairs_only.age
## 1                    3                male               27
## 2                    3              female               27
## 3                    7                male               37
## 4                   12              female               32
## 5                    1                male               22
## 6                    1              female               22
##   Affairs_only.children Affairs_only.rating
## 1                    no                   4
## 2                   yes                   5
## 3                   yes                   2
## 4                   yes                   2
## 5                    no                   5
## 6                   yes                   5
#how many rows and columns are in the Affairs only ds
dim(Affairs_only)
## [1] 150   5
#Note the number with affairs is 150 and the whole dataset is 601 records.
  1. Create new column names for the new data frame. For example, make old->new, Here are the changes of column names: affairs->AFFAIRS, gender->SEX, children->KIDS, rating->HAPPY

First, let’s take a peek at the first few records of the dataset before changing the names…

head(Affairs_only)
##   Affairs_only.affairs Affairs_only.gender Affairs_only.age
## 1                    3                male               27
## 2                    3              female               27
## 3                    7                male               37
## 4                   12              female               32
## 5                    1                male               22
## 6                    1              female               22
##   Affairs_only.children Affairs_only.rating
## 1                    no                   4
## 2                   yes                   5
## 3                   yes                   2
## 4                   yes                   2
## 5                    no                   5
## 6                   yes                   5
colnames(Affairs_only)<-c("AFFAIRS","SEX","AGE","KIDS","HAPPY")
#display first few rows of new dataset after the name change

head(Affairs_only)
##   AFFAIRS    SEX AGE KIDS HAPPY
## 1       3   male  27   no     4
## 2       3 female  27  yes     5
## 3       7   male  37  yes     2
## 4      12 female  32  yes     2
## 5       1   male  22   no     5
## 6       1 female  22  yes     5
  1. Use the summary function to create an overview of your new data frame. The print the mean and median for the same two attributes. Please compare.
summary(Affairs_only)
##     AFFAIRS           SEX                 AGE            KIDS          
##  Min.   : 1.000   Length:150         Min.   :17.50   Length:150        
##  1st Qu.: 2.000   Class :character   1st Qu.:27.00   Class :character  
##  Median : 7.000   Mode  :character   Median :32.00   Mode  :character  
##  Mean   : 5.833                      Mean   :33.41                     
##  3rd Qu.:10.750                      3rd Qu.:37.00                     
##  Max.   :12.000                      Max.   :57.00                     
##      HAPPY      
##  Min.   :1.000  
##  1st Qu.:2.000  
##  Median :4.000  
##  Mean   :3.447  
##  3rd Qu.:4.000  
##  Max.   :5.000
mean(Affairs_only$AGE)
## [1] 33.41
median(Affairs_only$AGE)
## [1] 32
mean(Affairs_only$HAPPY)
## [1] 3.446667
median(Affairs_only$HAPPY)
## [1] 4

Lets look at mean age for Original full data set compared and followed by mean age of people who had affairs.

# mean Age full Data
mean(Affairs$age)
## [1] 32.48752
#mean Age on Affairs 
mean(Affairs_only$AGE)
## [1] 33.41
#The mean age is older for affairs than the whole dataset, implying older than non-affairs.

Lets look at median age for Original full data set compared and followed by median age of people who had affairs.

# median Age full Data
median(Affairs$age)
## [1] 32
#median Age on Affairs 
median(Affairs_only$AGE)
## [1] 32
#the median is the same in the full dataset and the affairs only dataset.

Lets look at mean rating for Original full data set compared and followed by mean rating of people who had affairs.

# mean rating full Data
mean(Affairs$rating)
## [1] 3.93178
#mean rating on Affairs 
mean(Affairs_only$HAPPY)
## [1] 3.446667
#Interestingly, the full dataset has a higher Rating than those having affairs.  Implying that those having affairs are not as happy.

Lets look at median rating for Original full data set compared and followed by median rating of people who had affairs.

# median rating full Data
median(Affairs$rating)
## [1] 4
#median Rating on Affairs 
median(Affairs_only$HAPPY)
## [1] 4
#The median is the same in the full Dataset and the affairs only subset.
  1. For at least 3 values in a column please rename so that every value in that column is renamed. For example, suppose I have 20 values of the letter “e” in one column. Rename those values so that all 20 would show as “excellent”.

Note Affairs dataset only has two categorical variables which take on 2 values. So, to adjust to this, we will rename the 2 categorical variables values for a total of 4 changes.

Specifically, change sex: male->M, female->F; KIDS: yes->children, no->no children

Affairs_only$SEX[Affairs_only$SEX == 'male'] <- 'M'
Affairs_only$SEX[Affairs_only$SEX == 'female'] <- 'F'
Affairs_only$KIDS[Affairs_only$KIDS == 'yes'] <- 'children'
Affairs_only$KIDS[Affairs_only$KIDS == 'no'] <- 'no children'
  1. Display enough rows to see examples of all of steps 1-5 above.
head(Affairs_only)
##   AFFAIRS SEX AGE        KIDS HAPPY
## 1       3   M  27 no children     4
## 2       3   F  27    children     5
## 3       7   M  37    children     2
## 4      12   F  32    children     2
## 5       1   M  22 no children     5
## 6       1   F  22    children     5
  1. BONUS – place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career. Completed using raw URL.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this: