In its original form, the data set is 3,142 rows by 35 columns. This is something I can tell by the standard use of the strucutre function.
I have to change the census ID to a string, because it is not an actual number but rather a unique code. Moreover, I change state from a character to a factor because a state is a category as there are only 50. The mutate function allows me to changese this information, creating a new colummn with this updated information. What’s more, the glimpse gives a quick look into the data, and we can see that the values seem to be all set.
There are two columns with an NA value, one in with an N/A income and another with an N/A in child poverty. It seems though that this was perhaps just a mistake. In order to keep using the important data that is included elsewhere in these entry points, I will place the mean in for these data points.
I run the colSums function twice, to make sure that I am able to get rid of all the null values.
While there are certainly some “outliers,” I don’t think that any values are particularity unusual. Looking at the information with the summary function, I see no negative values or complete zeros for information like transportation. Because we are speaking about counties, all these statistics are sure to vary just as life styles vary across the country. For example, while LA county has well over 10 million people there are small rural counties in West Texas that have just a handful of people. This doesn’t make any value unsual neccesary. Likewise, there are richer counties where the childhood poverty rate would be zero and some where it would be very high.
1985 counties have more women than men. The filter function is an easy way to find this out and the summarise function helps me know just how many there are exactly.
2420 counties have unemployment rates less than 10%. Much like the function above, the filter function is an easy way to narrow down the numbers we are looking for before asking the summary function to count how many results we get from the output.
The 10 counties with longest mean commutes are
My select function keeps only the variables I mention. From there I am able to grab the top 10 mean commutes as the top_n function allows me to do that easily. For the arrange function, I easily arrange the counties in descending order, which I have to specify because usually it is asscedning.
The counties with the lowest percentage of women are
1.Forest Pennsylvania
2.Bent Colorado
3.Sussex Virginia
4.Wheeler Georgia
5.Lassen California
6.Concho Texas
7.Chattahoochee Georgia
8.Aleutians East Borough Alaska
9.West Feliciana Louisiana
10.Pershing Nevada
The mutate function allows me to create a new column that I then multiply by 100 to keep the form of the other data. From there I select function keeps only the variables I mention. From there I am able to grab the top 10 counties with the lowest percentage of women as specified by the -10. Finally, we arrange the counties in ascending order with the arrange function.
Step One: I use the mutate function to create a variable that adds all the percentages of each race together.
Step Two. Using the select and top n funcion, I create a string of functions that allow me to pull the bottom 10 counties by sum of the race. Then the arrange function places them in ascending order.
A. The counties with the lowest sum of race percentages are
1 Hawaii Hawaii
2 Maui Hawaii
3 Mayes Oklahoma
4 Honolulu Hawaii
5 Pontotoc Oklahoma
6 Grundy Tennessee
7 Yakutat City and Borough Alaska
8 Johnston Oklahoma
9 Kauai Hawaii
10 Alfalfa Oklahoma
B. Hawaii has the lowest race sum. I am able to find this because we pull the average of all the counties of each state using the summarise function. From there, we then are able to pull the lowest state average using the slice function since we arranged them in asscending order.
C. 11 counties have a perfect make up of 100. We are able to see this using the simple filter function asking to produce values more 100 and the pairing them with the other information we were asked to include
D. Through grouping all counties together by state, and then filtering by average state, we can see that no states have a perfect 100 percent of racial makeup from the groups presented.
A I create a new variable called carpool rank with the mutate function
B.
## # A tibble: 10 × 5
## census_id county state carpool carpool_rank
## <chr> <chr> <fct> <dbl> <int>
## 1 13061 Clay Georgia 29.9 1
## 2 18087 LaGrange Indiana 27 2
## 3 13165 Jenkins Georgia 25.3 3
## 4 5133 Sevier Arkansas 24.4 4
## 5 20175 Seward Kansas 23.4 5
## 6 48079 Cochran Texas 22.8 6
## 7 48247 Jim Hogg Texas 22.6 7
## 8 48393 Roberts Texas 22.4 8
## 9 39075 Holmes Ohio 21.8 9
## 10 21197 Powell Kentucky 21.6 10
C.
## # A tibble: 10 × 5
## census_id county state carpool carpool_rank
## <chr> <chr> <fct> <dbl> <int>
## 1 48261 Kenedy Texas 0 3141
## 2 48269 King Texas 0 3141
## 3 48235 Irion Texas 0.9 3140
## 4 31183 Wheeler Nebraska 1.3 3139
## 5 36061 New York New York 1.9 3138
## 6 13309 Wheeler Georgia 2.3 3136
## 7 38029 Emmons North Dakota 2.3 3136
## 8 30019 Daniels Montana 2.6 3134
## 9 31057 Dundy Nebraska 2.6 3134
## 10 46069 Hyde South Dakota 2.8 3132
D. Arizona is best for carpooling
E. 1. Arizona
2. Utah
3. Arkansas
4. Hawaii
5. Alaska
These four code chunks essentially all work the same. Using the group by fucntion, I make sure that all the counties from one state are grouped together before then using functions like summarize and arrange to make sure that the data is presented in the aggregation I would want and then ranked how I want. Finally, the slice function gives us the the top number of each variable, helping us answer questions.