#Here is a list of data sets: #http://vincentarelbundock.github.io/Rdatasets/ #(click on the csv index for a list) # Please select one, download it and perform the #following tasks:
Lets look at “Fairs extramarital affairs data”. In particular, focusing on age and rating (happiness in marriage - higher is happier).
Bringing the dataset in from github……
#1. Use the summary function to gain an overview of #the data set. Then display the mean and #median for at least two attributes.
#affair2<-read.table(file=theUrl,header=TRUE,quote="#",sep=",")
affair2<-read.csv("https://raw.githubusercontent.com/lszydziak/LS_CUNY/main/Affairs.csv")
#affair2<-"C:/Users/Lisa/Documents/CUNY/Bridge/rsconnect/documents/HW2/affairs.csv"
#Affairs<-read.table(file=affair2,header=TRUE, #sep=",")
Affairs<-affair2
#here are the first few records of the file
head(Affairs)
## X affairs gender age yearsmarried children religiousness education
## 1 4 0 male 37 10.00 no 3 18
## 2 5 0 female 27 4.00 no 4 14
## 3 11 0 female 32 15.00 yes 1 12
## 4 16 0 male 57 15.00 yes 5 18
## 5 23 0 male 22 0.75 no 2 17
## 6 29 0 female 32 1.50 no 2 17
## occupation rating
## 1 7 4
## 2 6 4
## 3 1 4
## 4 6 5
## 5 6 3
## 6 5 5
#this is the class of the data
class(Affairs)
## [1] "data.frame"
#how many rows and columns are in Affairs ds
dim.data.frame(Affairs)
## [1] 601 10
#summary function on the data set result......
summary(Affairs)
## X affairs gender age
## Min. : 4 Min. : 0.000 Length:601 Min. :17.50
## 1st Qu.: 528 1st Qu.: 0.000 Class :character 1st Qu.:27.00
## Median :1009 Median : 0.000 Mode :character Median :32.00
## Mean :1060 Mean : 1.456 Mean :32.49
## 3rd Qu.:1453 3rd Qu.: 0.000 3rd Qu.:37.00
## Max. :9029 Max. :12.000 Max. :57.00
## yearsmarried children religiousness education
## Min. : 0.125 Length:601 Min. :1.000 Min. : 9.00
## 1st Qu.: 4.000 Class :character 1st Qu.:2.000 1st Qu.:14.00
## Median : 7.000 Mode :character Median :3.000 Median :16.00
## Mean : 8.178 Mean :3.116 Mean :16.17
## 3rd Qu.:15.000 3rd Qu.:4.000 3rd Qu.:18.00
## Max. :15.000 Max. :5.000 Max. :20.00
## occupation rating
## Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:3.000
## Median :5.000 Median :4.000
## Mean :4.195 Mean :3.932
## 3rd Qu.:6.000 3rd Qu.:5.000
## Max. :7.000 Max. :5.000
#here is the mean and median for age
mean(Affairs$age)
## [1] 32.48752
median(Affairs$age)
## [1] 32
#here is the mean and median for rating
mean(Affairs$rating)
## [1] 3.93178
median(Affairs$rating)
## [1] 4
#2. Create a new data frame with a subset of the #columns and rows. Make sure to rename it.
Let’s grab the records of those who have affairs, and reduce the number of variables we look at……
#create subset of records with affairs ONLY with variables affairs, age, children and rating....
Affairs_only<-subset(Affairs, Affairs$affairs>0)
Affairs_only<-data.frame(Affairs_only$affairs, Affairs_only$gender,
Affairs_only$age,Affairs_only$children,Affairs_only$rating)
#display first few records of the new dataset...
head(Affairs_only)
## Affairs_only.affairs Affairs_only.gender Affairs_only.age
## 1 3 male 27
## 2 3 female 27
## 3 7 male 37
## 4 12 female 32
## 5 1 male 22
## 6 1 female 22
## Affairs_only.children Affairs_only.rating
## 1 no 4
## 2 yes 5
## 3 yes 2
## 4 yes 2
## 5 no 5
## 6 yes 5
#how many rows and columns are in the Affairs only ds
dim(Affairs_only)
## [1] 150 5
#Note the number with affairs is 150 and the whole dataset is 601 records.
First, let’s take a peek at the first few records of the dataset before changing the names…
head(Affairs_only)
## Affairs_only.affairs Affairs_only.gender Affairs_only.age
## 1 3 male 27
## 2 3 female 27
## 3 7 male 37
## 4 12 female 32
## 5 1 male 22
## 6 1 female 22
## Affairs_only.children Affairs_only.rating
## 1 no 4
## 2 yes 5
## 3 yes 2
## 4 yes 2
## 5 no 5
## 6 yes 5
colnames(Affairs_only)<-c("AFFAIRS","SEX","AGE","KIDS","HAPPY")
#display first few rows of new dataset after the name change
head(Affairs_only)
## AFFAIRS SEX AGE KIDS HAPPY
## 1 3 male 27 no 4
## 2 3 female 27 yes 5
## 3 7 male 37 yes 2
## 4 12 female 32 yes 2
## 5 1 male 22 no 5
## 6 1 female 22 yes 5
summary(Affairs_only)
## AFFAIRS SEX AGE KIDS
## Min. : 1.000 Length:150 Min. :17.50 Length:150
## 1st Qu.: 2.000 Class :character 1st Qu.:27.00 Class :character
## Median : 7.000 Mode :character Median :32.00 Mode :character
## Mean : 5.833 Mean :33.41
## 3rd Qu.:10.750 3rd Qu.:37.00
## Max. :12.000 Max. :57.00
## HAPPY
## Min. :1.000
## 1st Qu.:2.000
## Median :4.000
## Mean :3.447
## 3rd Qu.:4.000
## Max. :5.000
mean(Affairs_only$AGE)
## [1] 33.41
median(Affairs_only$AGE)
## [1] 32
mean(Affairs_only$HAPPY)
## [1] 3.446667
median(Affairs_only$HAPPY)
## [1] 4
Lets look at mean age for Original full data set compared and followed by mean age of people who had affairs.
# mean Age full Data
mean(Affairs$age)
## [1] 32.48752
#mean Age on Affairs
mean(Affairs_only$AGE)
## [1] 33.41
#The mean age is older for affairs than the whole dataset, implying older than non-affairs.
Lets look at median age for Original full data set compared and followed by median age of people who had affairs.
# median Age full Data
median(Affairs$age)
## [1] 32
#median Age on Affairs
median(Affairs_only$AGE)
## [1] 32
#the median is the same in the full dataset and the affairs only dataset.
Lets look at mean rating for Original full data set compared and followed by mean rating of people who had affairs.
# mean rating full Data
mean(Affairs$rating)
## [1] 3.93178
#mean rating on Affairs
mean(Affairs_only$HAPPY)
## [1] 3.446667
#Interestingly, the full dataset has a higher Rating than those having affairs. Implying that those having affairs are not as happy.
Lets look at median rating for Original full data set compared and followed by median rating of people who had affairs.
# median rating full Data
median(Affairs$rating)
## [1] 4
#median Rating on Affairs
median(Affairs_only$HAPPY)
## [1] 4
#The median is the same in the full Dataset and the affairs only subset.
Note Affairs dataset only has two categorical variables which take on 2 values. So, to adjust to this, we will rename the 2 categorical variables values for a total of 4 changes.
Specifically, change sex: male->M, female->F; KIDS: yes->children, no->no children
Affairs_only$SEX[Affairs_only$SEX == 'male'] <- 'M'
Affairs_only$SEX[Affairs_only$SEX == 'female'] <- 'F'
Affairs_only$KIDS[Affairs_only$KIDS == 'yes'] <- 'children'
Affairs_only$KIDS[Affairs_only$KIDS == 'no'] <- 'no children'
head(Affairs_only)
## AFFAIRS SEX AGE KIDS HAPPY
## 1 3 M 27 no children 4
## 2 3 F 27 children 5
## 3 7 M 37 children 2
## 4 12 F 32 children 2
## 5 1 M 22 no children 5
## 6 1 F 22 children 5
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this: