For this week’s assignment, I used a CSV file from https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/sandwich/PublicSchools.csv
The file was read into R with the following code:
theURL <- "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/sandwich/PublicSchools.csv"
read.table(file=theURL, header=TRUE, sep=",")
## X Expenditure Income
## 1 Alabama 275 6247
## 2 Alaska 821 10851
## 3 Arizona 339 7374
## 4 Arkansas 275 6183
## 5 California 387 8850
## 6 Colorado 452 8001
## 7 Connecticut 531 8914
## 8 Delaware 424 8604
## 9 Florida 316 7505
## 10 Georgia 265 6700
## 11 Hawaii 403 8380
## 12 Idaho 304 6813
## 13 Illinois 437 8745
## 14 Indiana 345 7696
## 15 Iowa 431 7873
## 16 Kansas 355 8001
## 17 Kentucky 260 6615
## 18 Louisiana 316 6640
## 19 Maine 327 6333
## 20 Maryland 427 8306
## 21 Massachusetts 427 8063
## 22 Michigan 466 8442
## 23 Minnesota 477 7847
## 24 Mississippi 259 5736
## 25 Missouri 274 7342
## 26 Montana 433 7051
## 27 Nebraska 294 7391
## 28 Nevada 359 9032
## 29 New Hampshire 279 7277
## 30 New Jersey 423 8818
## 31 New Mexico 388 6505
## 32 New York 447 8267
## 33 North Carolina 335 6607
## 34 North Dakota 311 7478
## 35 Ohio 322 7812
## 36 Oklahoma 320 6951
## 37 Oregon 397 7839
## 38 Pennsylvania 412 7733
## 39 Rhode Island 342 7526
## 40 South Carolina 315 6242
## 41 South Dakota 321 6841
## 42 Tennessee 268 6489
## 43 Texas 315 7697
## 44 Utah 417 6622
## 45 Vermont 353 6541
## 46 Virginia 356 7624
## 47 Washington 415 8450
## 48 Washington DC 428 10022
## 49 West Virginia 320 6456
## 50 Wisconsin NA 7597
## 51 Wyoming 500 9096
ps <- read.table(file=theURL, header=TRUE, sep=",")
ps
## X Expenditure Income
## 1 Alabama 275 6247
## 2 Alaska 821 10851
## 3 Arizona 339 7374
## 4 Arkansas 275 6183
## 5 California 387 8850
## 6 Colorado 452 8001
## 7 Connecticut 531 8914
## 8 Delaware 424 8604
## 9 Florida 316 7505
## 10 Georgia 265 6700
## 11 Hawaii 403 8380
## 12 Idaho 304 6813
## 13 Illinois 437 8745
## 14 Indiana 345 7696
## 15 Iowa 431 7873
## 16 Kansas 355 8001
## 17 Kentucky 260 6615
## 18 Louisiana 316 6640
## 19 Maine 327 6333
## 20 Maryland 427 8306
## 21 Massachusetts 427 8063
## 22 Michigan 466 8442
## 23 Minnesota 477 7847
## 24 Mississippi 259 5736
## 25 Missouri 274 7342
## 26 Montana 433 7051
## 27 Nebraska 294 7391
## 28 Nevada 359 9032
## 29 New Hampshire 279 7277
## 30 New Jersey 423 8818
## 31 New Mexico 388 6505
## 32 New York 447 8267
## 33 North Carolina 335 6607
## 34 North Dakota 311 7478
## 35 Ohio 322 7812
## 36 Oklahoma 320 6951
## 37 Oregon 397 7839
## 38 Pennsylvania 412 7733
## 39 Rhode Island 342 7526
## 40 South Carolina 315 6242
## 41 South Dakota 321 6841
## 42 Tennessee 268 6489
## 43 Texas 315 7697
## 44 Utah 417 6622
## 45 Vermont 353 6541
## 46 Virginia 356 7624
## 47 Washington 415 8450
## 48 Washington DC 428 10022
## 49 West Virginia 320 6456
## 50 Wisconsin NA 7597
## 51 Wyoming 500 9096
Tasks in the assignment:
Task 1. Use the summary function to gain an overview of the data set. Then display the mean and median for at least two attributes.
summary(ps)
## X Expenditure Income
## Alabama : 1 Min. :259.0 Min. : 5736
## Alaska : 1 1st Qu.:315.2 1st Qu.: 6670
## Arizona : 1 Median :354.0 Median : 7597
## Arkansas : 1 Mean :373.3 Mean : 7608
## California: 1 3rd Qu.:426.2 3rd Qu.: 8286
## Colorado : 1 Max. :821.0 Max. :10851
## (Other) :45 NA's :1
Task 2. Create a new data frame with a subset of the columns and rows. Make sure to rename it.
psdf <- data.frame(ps)
psdf
## X Expenditure Income
## 1 Alabama 275 6247
## 2 Alaska 821 10851
## 3 Arizona 339 7374
## 4 Arkansas 275 6183
## 5 California 387 8850
## 6 Colorado 452 8001
## 7 Connecticut 531 8914
## 8 Delaware 424 8604
## 9 Florida 316 7505
## 10 Georgia 265 6700
## 11 Hawaii 403 8380
## 12 Idaho 304 6813
## 13 Illinois 437 8745
## 14 Indiana 345 7696
## 15 Iowa 431 7873
## 16 Kansas 355 8001
## 17 Kentucky 260 6615
## 18 Louisiana 316 6640
## 19 Maine 327 6333
## 20 Maryland 427 8306
## 21 Massachusetts 427 8063
## 22 Michigan 466 8442
## 23 Minnesota 477 7847
## 24 Mississippi 259 5736
## 25 Missouri 274 7342
## 26 Montana 433 7051
## 27 Nebraska 294 7391
## 28 Nevada 359 9032
## 29 New Hampshire 279 7277
## 30 New Jersey 423 8818
## 31 New Mexico 388 6505
## 32 New York 447 8267
## 33 North Carolina 335 6607
## 34 North Dakota 311 7478
## 35 Ohio 322 7812
## 36 Oklahoma 320 6951
## 37 Oregon 397 7839
## 38 Pennsylvania 412 7733
## 39 Rhode Island 342 7526
## 40 South Carolina 315 6242
## 41 South Dakota 321 6841
## 42 Tennessee 268 6489
## 43 Texas 315 7697
## 44 Utah 417 6622
## 45 Vermont 353 6541
## 46 Virginia 356 7624
## 47 Washington 415 8450
## 48 Washington DC 428 10022
## 49 West Virginia 320 6456
## 50 Wisconsin NA 7597
## 51 Wyoming 500 9096
Task 3. Create new column names for the new data frame.
names(psdf) <- c("State", "State_Expenditures", "State_Income")
psdf
## State State_Expenditures State_Income
## 1 Alabama 275 6247
## 2 Alaska 821 10851
## 3 Arizona 339 7374
## 4 Arkansas 275 6183
## 5 California 387 8850
## 6 Colorado 452 8001
## 7 Connecticut 531 8914
## 8 Delaware 424 8604
## 9 Florida 316 7505
## 10 Georgia 265 6700
## 11 Hawaii 403 8380
## 12 Idaho 304 6813
## 13 Illinois 437 8745
## 14 Indiana 345 7696
## 15 Iowa 431 7873
## 16 Kansas 355 8001
## 17 Kentucky 260 6615
## 18 Louisiana 316 6640
## 19 Maine 327 6333
## 20 Maryland 427 8306
## 21 Massachusetts 427 8063
## 22 Michigan 466 8442
## 23 Minnesota 477 7847
## 24 Mississippi 259 5736
## 25 Missouri 274 7342
## 26 Montana 433 7051
## 27 Nebraska 294 7391
## 28 Nevada 359 9032
## 29 New Hampshire 279 7277
## 30 New Jersey 423 8818
## 31 New Mexico 388 6505
## 32 New York 447 8267
## 33 North Carolina 335 6607
## 34 North Dakota 311 7478
## 35 Ohio 322 7812
## 36 Oklahoma 320 6951
## 37 Oregon 397 7839
## 38 Pennsylvania 412 7733
## 39 Rhode Island 342 7526
## 40 South Carolina 315 6242
## 41 South Dakota 321 6841
## 42 Tennessee 268 6489
## 43 Texas 315 7697
## 44 Utah 417 6622
## 45 Vermont 353 6541
## 46 Virginia 356 7624
## 47 Washington 415 8450
## 48 Washington DC 428 10022
## 49 West Virginia 320 6456
## 50 Wisconsin NA 7597
## 51 Wyoming 500 9096
Task 4. Use the summary function to create an overview of your new data frame. The print the mean and median for the same two attributes. Please compare.
psdf$State <- as.character(psdf$State)
psdf$State_Expenditures <- as.numeric(psdf$State_Expenditures)
psdf$State_Income <- as.numeric(psdf$State_Income)
summary(psdf)
## State State_Expenditures State_Income
## Length:51 Min. :259.0 Min. : 5736
## Class :character 1st Qu.:315.2 1st Qu.: 6670
## Mode :character Median :354.0 Median : 7597
## Mean :373.3 Mean : 7608
## 3rd Qu.:426.2 3rd Qu.: 8286
## Max. :821.0 Max. :10851
## NA's :1
When I compare the summary of the data frame to the summary of the data set, the main difference was that the summary for the data frame recognized the column that contains characters and listed the class as character. Also, the summary for the data frame did NOT list the count of NA’s as was listed in the summary of the data set.
Task 5. For at least 3 values in a column please rename so that every value in that column is renamed. For example, suppose I have 20 values of the letter “e” in one column. Rename those values so that all 20 would show as “excellent”.
I renamed 1 character value in the “State” column of my data frame.
psdf$State[psdf$State == "Washington DC"] <- "Washington, District of Columbia"
psdf[48, ]
## State State_Expenditures State_Income
## 48 Washington, District of Columbia 428 10022
Then, I replaced the NA that was listed for Wisconsin in the “State_Expenditures” column
psdf[50,2]
## [1] NA
psdf[50,2] = 425
psdf[50,2]
## [1] 425
In this particular data set, it was hard to find examples where I could rename 3 values in the column, since there were no examples of repeated values in the columns.