Project 2 - Data Set 2

Data set provided by Abdellah Ait Elmouden.

Yearly full-text report provides data on the average agricultural land values. Includes, by state and region, average value per acre for farm real estate (farmland and buildings), irrigated and non-irrigated cropland, and pastureland. yearly full-text report provides data on the average agricultural land values. Includes, by state and region, average value per acre for farm real estate (farmland and buildings), irrigated and non-irrigated cropland, and pastureland.

There are numerous tables within the report. I will extract, tidy and explore table Farm Real Estate Average Value per Acre, showed below.

Table

Table

Step 1: Reading file and extracting part that I will work on.

Table is located between rows 64 and 100.

if (!require('tidyverse')) install.packages('tidyverse')
## Loading required package: tidyverse
## ── Attaching packages ───────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
data_raw<-read.csv('https://downloads.usda.library.cornell.edu/usda-esmis/files/pn89d6567/g732dn07g/th83m920j/land0819.txt')
head(data_raw)
##    United.States.Department.of.Agriculture
## 1 National Agricultural Statistics Service
## 2                          ISSN: 1949-1867
## 3                              Land Values
## 4                             2019 Summary
## 5                              August 2019
## 6                                 Contents
con<-file('https://downloads.usda.library.cornell.edu/usda-esmis/files/pn89d6567/g732dn07g/th83m920j/land0819.txt', open='r')

land_raw<-readLines(con)
leng<-length(land_raw)

rawDF1<-NULL

j<-0
for (i in 1:leng)
{
  if(between(i,64,100))
  {
  j<-j+1
  rawDF1<-rbind(rawDF1,land_raw[i])  
  }
}
close(con)

head(rawDF1)
##      [,1]                                                                              
## [1,] "Northeast ..........:  5,190     5,270     5,380     5,550     5,690      2.5   "
## [2,] "  Connecticut ......: 11,700    11,900    12,100    12,300    12,200     -0.8   "
## [3,] "  Delaware .........:  8,110     8,290     8,250     8,410     8,950      6.4   "
## [4,] "  Maine ............:  2,170     2,210     2,370     2,370     2,410      1.7   "
## [5,] "  Maryland .........:  7,270     7,470     7,620     7,860     8,060      2.5   "
## [6,] "  Massachusetts ....: 10,600    10,700    10,800    10,900    11,100      1.8   "

Next step is to eliminate spaces, dots, semicolon blank lines etc. Data frame will be generated.

rawDF1<-str_trim(rawDF1, side="both")

states <- unlist(str_replace_all(rawDF1, "[[:space:]]", ""))
states<- unlist(str_extract_all(states,"[[:alpha:]]{2,}"))
values <- unlist((str_extract_all(rawDF1, "[[:digit:]]{1,},[[:digit:]]{2,}")))
values<-str_trim(values,side="both")

values <- matrix((unlist(values)), nrow = 33, ncol = 5)
values <- data.matrix(values)
values <- matrix(unlist(values), nrow = 33, ncol = 5)

#y<-transform(values,as.numeric(values))
#x<-as.numeric((unlist(values[[1]])))

rawDF2<-data.frame(states, values, stringsAsFactors = FALSE)

names(rawDF2)<-c("Region/State","2015", "2016", "2017", "2018", "2019")

head(rawDF2)
##    Region/State   2015   2016  2017  2018  2019
## 1     Northeast  5,190  4,900 4,900 7,190 2,100
## 2   Connecticut  5,270  4,980 4,940 3,230 2,070
## 3      Delaware  5,380 13,300 4,960 3,220 2,070
## 4         Maine  5,550 13,300 4,960 3,120 3,810
## 5      Maryland  5,690 13,400 4,760 3,380 3,880
## 6 Massachusetts 11,700 13,500 4,790 3,400 3,970

Now to transform from long to wide and do some EDA. First, I will split first column to display only States

rawDF3<-rawDF2[-c(1,13,17,23,28),] 
names(rawDF3)[1]<-"State"
#$=c("Northeast","LakeStates", "","CornBelts","NorthernPlains","Appalachian"),)
head(rawDF3)
##           State   2015   2016  2017  2018  2019
## 2   Connecticut  5,270  4,980 4,940 3,230 2,070
## 3      Delaware  5,380 13,300 4,960 3,220 2,070
## 4         Maine  5,550 13,300 4,960 3,120 3,810
## 5      Maryland  5,690 13,400 4,760 3,380 3,880
## 6 Massachusetts 11,700 13,500 4,790 3,400 3,970
## 7  NewHampshire 11,900 13,500 4,870 5,930 4,030

Tidying I will create a variable year and transpose the format to wide

rawDF4<- rawDF3 %>%
  gather(2:6,key="year", value="Land_Value")

head(rawDF4)
##           State year Land_Value
## 1   Connecticut 2015      5,270
## 2      Delaware 2015      5,380
## 3         Maine 2015      5,550
## 4      Maryland 2015      5,690
## 5 Massachusetts 2015     11,700
## 6  NewHampshire 2015     11,900
(year_2015 <- rawDF4 %>%
  filter(year==2015) %>%
 ggplot(mapping=aes(x=State,y=Land_Value)) + geom_col()
 
 )

Note

I struggled for hours to convert characters into numeric (column Land_Value) and because of that I was not able to do any meaningful analysis, like aggregation by mean, or a trend chart.