Select a data set, download it, and perform the following tasks.

I’ve selected the “Treatment of Migraine Headaches” data from KosteckiDillon.

1. Use the summary function to gain an overview of the data set. Then display the mean and median for at least two attributes.

migraineData <- read.table("KosteckiDillon.csv", header = TRUE, sep= ",")
head(migraineData)
##   X id time dos hatype age airq medication headache    sex
## 1 1  1  -11 753   Aura  30    9 continuing      yes female
## 2 2  1  -10 754   Aura  30    7 continuing      yes female
## 3 3  1   -9 755   Aura  30   10 continuing      yes female
## 4 4  1   -8 756   Aura  30   13 continuing      yes female
## 5 5  1   -7 757   Aura  30   18 continuing      yes female
## 6 6  1   -6 758   Aura  30   19 continuing      yes female
summary(migraineData)
##        X              id              time             dos        
##  Min.   :   1   Min.   :  1.00   Min.   :-29.00   Min.   :  98.0  
##  1st Qu.:1039   1st Qu.: 33.00   1st Qu.:  3.00   1st Qu.: 384.0  
##  Median :2076   Median : 67.00   Median : 12.00   Median : 623.0  
##  Mean   :2076   Mean   : 66.39   Mean   : 15.46   Mean   : 646.7  
##  3rd Qu.:3114   3rd Qu.:100.00   3rd Qu.: 24.00   3rd Qu.: 950.0  
##  Max.   :4152   Max.   :133.00   Max.   : 99.00   Max.   :1239.0  
##      hatype          age             airq            medication  
##  Aura   :1710   Min.   :18.00   Min.   : 3.00   continuing:2386  
##  Mixed  : 457   1st Qu.:33.00   1st Qu.:18.00   none      : 785  
##  No Aura:1985   Median :44.00   Median :24.00   reduced   : 981  
##                 Mean   :42.36   Mean   :24.83                    
##                 3rd Qu.:50.00   3rd Qu.:29.00                    
##                 Max.   :66.00   Max.   :73.00                    
##  headache       sex      
##  no :1486   female:3545  
##  yes:2666   male  : 607  
##                          
##                          
##                          
## 
#Get mean and median airq
airMean1 <- mean(migraineData$airq)
airMed1 <- median(migraineData$airq)
airMean1
## [1] 24.82601
airMed1
## [1] 24
#Get mean and median age
ageMean1 <- mean(migraineData$age)
ageMed1 <- median(migraineData$age)
ageMean1
## [1] 42.36392
ageMed1
## [1] 44

2. Create a new data frame with a subset of the columns and rows. Make sure to rename it.

#Create data frame with only male patients
maleMigraineData <- migraineData[migraineData$sex == "male", ]
head(maleMigraineData)
##       X id time dos hatype age airq medication headache  sex
## 593 593 18    1 184   Aura  49   45       none      yes male
## 594 594 18    2 185   Aura  49   54       none      yes male
## 595 595 18    3 186   Aura  49   44       none       no male
## 596 596 18    4 187   Aura  49   61       none      yes male
## 597 597 18    5 188   Aura  49   39       none      yes male
## 598 598 18    6 189   Aura  49   48       none      yes male

3. Create new column names for the new data frame.

names(maleMigraineData) <- c("entryNum", "patientID", "days", "timeFromStart", "typeOfMigraine", "howOld", "airQual", "med", "ache", "gender" )
head(maleMigraineData)
##     entryNum patientID days timeFromStart typeOfMigraine howOld airQual
## 593      593        18    1           184           Aura     49      45
## 594      594        18    2           185           Aura     49      54
## 595      595        18    3           186           Aura     49      44
## 596      596        18    4           187           Aura     49      61
## 597      597        18    5           188           Aura     49      39
## 598      598        18    6           189           Aura     49      48
##      med ache gender
## 593 none  yes   male
## 594 none  yes   male
## 595 none   no   male
## 596 none  yes   male
## 597 none  yes   male
## 598 none  yes   male

4. Use the summary function to create an overview of your new data frame. Then print the mean and median for the same two attributes. Please compare.

summary(maleMigraineData)
##     entryNum      patientID           days        timeFromStart   
##  Min.   : 593   Min.   : 18.00   Min.   :-21.00   Min.   : 184.0  
##  1st Qu.:1654   1st Qu.: 55.00   1st Qu.:  5.00   1st Qu.: 524.0  
##  Median :2884   Median : 92.00   Median : 16.00   Median : 863.0  
##  Mean   :2583   Mean   : 83.09   Mean   : 24.38   Mean   : 783.6  
##  3rd Qu.:3602   3rd Qu.:117.00   3rd Qu.: 37.00   3rd Qu.:1110.0  
##  Max.   :4152   Max.   :133.00   Max.   : 99.00   Max.   :1239.0  
##  typeOfMigraine     howOld         airQual              med       ache    
##  Aura   :117    Min.   :18.00   Min.   : 6.00   continuing:230   no :220  
##  Mixed  :166    1st Qu.:41.00   1st Qu.:18.00   none      :193   yes:387  
##  No Aura:324    Median :45.00   Median :23.00   reduced   :184            
##                 Mean   :44.69   Mean   :24.48                             
##                 3rd Qu.:53.00   3rd Qu.:29.00                             
##                 Max.   :63.00   Max.   :73.00                             
##     gender   
##  female:  0  
##  male  :607  
##              
##              
##              
## 
#Get mean and median airq
airMean2 <- mean(maleMigraineData$airQual)
airMed2 <- median(maleMigraineData$airQual)
airMean2
## [1] 24.47644
airMed2
## [1] 23
#Get mean and median age
ageMean2 <- mean(maleMigraineData$howOld)
ageMed2 <- median(maleMigraineData$howOld)
ageMean2
## [1] 44.68863
ageMed2
## [1] 45
sprintf("The air quality mean of the whole migraine data is %.2f versus just the males, which is %.2f. The median is %.2f and %.2f, respectively.", airMean1, airMean2, airMed1, airMed2)
## [1] "The air quality mean of the whole migraine data is 24.83 versus just the males, which is 24.48. The median is 24.00 and 23.00, respectively."
sprintf("The age mean of the whole migraine data is %.0f versus just the males, which is %.0f. The median is %0.f and %.0f, respectively.", ageMean1, ageMean2, ageMed1, ageMed2)
## [1] "The age mean of the whole migraine data is 42 versus just the males, which is 45. The median is 44 and 45, respectively."

5. For at least 3 values in a column please rename so that every value in that column is renamed. For example, suppose I have 20 values of the letter “e” in one column, rename those values so that all 20 would show as “excellent.”

#Add new value to the levels of the factor
levels(maleMigraineData$gender) <- c(levels(maleMigraineData$gender), "man")

levels(maleMigraineData$gender)
## [1] "female" "male"   "man"
#Replace male with man
maleMigraineData$gender[maleMigraineData$gender == "male"] <- "man"

head(maleMigraineData)
##     entryNum patientID days timeFromStart typeOfMigraine howOld airQual
## 593      593        18    1           184           Aura     49      45
## 594      594        18    2           185           Aura     49      54
## 595      595        18    3           186           Aura     49      44
## 596      596        18    4           187           Aura     49      61
## 597      597        18    5           188           Aura     49      39
## 598      598        18    6           189           Aura     49      48
##      med ache gender
## 593 none  yes    man
## 594 none  yes    man
## 595 none   no    man
## 596 none  yes    man
## 597 none  yes    man
## 598 none  yes    man

6. Display enough rows to see examples of all the steps 1-5 above.

head(migraineData)
##   X id time dos hatype age airq medication headache    sex
## 1 1  1  -11 753   Aura  30    9 continuing      yes female
## 2 2  1  -10 754   Aura  30    7 continuing      yes female
## 3 3  1   -9 755   Aura  30   10 continuing      yes female
## 4 4  1   -8 756   Aura  30   13 continuing      yes female
## 5 5  1   -7 757   Aura  30   18 continuing      yes female
## 6 6  1   -6 758   Aura  30   19 continuing      yes female
head(maleMigraineData)
##     entryNum patientID days timeFromStart typeOfMigraine howOld airQual
## 593      593        18    1           184           Aura     49      45
## 594      594        18    2           185           Aura     49      54
## 595      595        18    3           186           Aura     49      44
## 596      596        18    4           187           Aura     49      61
## 597      597        18    5           188           Aura     49      39
## 598      598        18    6           189           Aura     49      48
##      med ache gender
## 593 none  yes    man
## 594 none  yes    man
## 595 none   no    man
## 596 none  yes    man
## 597 none  yes    man
## 598 none  yes    man

7. BONUS - place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.

theURL <- "https://raw.githubusercontent.com/Aysmel/MSDS-SummerBridge19-R-Week2/master/KosteckiDillon.csv"

migraineDataFromURL <- read.table(theURL, header = TRUE, sep= ",")
head(migraineDataFromURL)
##   X id time dos hatype age airq medication headache    sex
## 1 1  1  -11 753   Aura  30    9 continuing      yes female
## 2 2  1  -10 754   Aura  30    7 continuing      yes female
## 3 3  1   -9 755   Aura  30   10 continuing      yes female
## 4 4  1   -8 756   Aura  30   13 continuing      yes female
## 5 5  1   -7 757   Aura  30   18 continuing      yes female
## 6 6  1   -6 758   Aura  30   19 continuing      yes female