Activity 5 - SQL & R

Installing SQL Library

#install.packages("sqldf")

importing SQL Library

library(sqldf)

## Warning: package 'sqldf' was built under R version 4.1.3

## Loading required package: gsubfn

## Warning: package 'gsubfn' was built under R version 4.1.3

## Loading required package: proto

## Warning: package 'proto' was built under R version 4.1.3

## Loading required package: RSQLite

## Warning: package 'RSQLite' was built under R version 4.1.3

importing dataset

data1<-read.csv("C:/Users/marxm/OneDrive/Documents/ProgrammingForData_R/acti5/data/quiz2.csv",header = TRUE,sep = ",")

data summary:

summary(data1)

##   subregion            region               pop              income     
##  Length:29          Length:29          Min.   :   2311   Min.   :26784  
##  Class :character   Class :character   1st Qu.:  19005   1st Qu.:37970  
##  Mode  :character   Mode  :character   Median :  32122   Median :41595  
##                                        Mean   : 135940   Mean   :43858  
##                                        3rd Qu.: 101482   3rd Qu.:47469  
##                                        Max.   :1554720   Max.   :70821  
##      ipaddr           ufo2010         infections  
##  Min.   :    637   Min.   :  0.00   Min.   :  39  
##  1st Qu.:  12294   1st Qu.:  0.00   1st Qu.: 123  
##  Median :  30418   Median :  2.00   Median : 245  
##  Mean   : 440130   Mean   : 16.66   Mean   :1117  
##  3rd Qu.: 102104   3rd Qu.:  9.00   3rd Qu.: 672  
##  Max.   :5394949   Max.   :169.00   Max.   :6781

SQL Syntax: query that retries everything from the datatable

sqldf("SELECT * FROM data1")

##    subregion         region     pop income  ipaddr ufo2010 infections
## 1  abbeville south carolina   25101  34670   30330       2        245
## 2     acadia      louisiana   61912  37970   38203       6        215
## 3   accomack       virginia   33341  41595   41338       2       2076
## 4        ada          idaho  409061  55304 1035427      59       5023
## 5      adair           iowa    7481  47623    3762       0        189
## 6      adair       kentucky   18675  31775   14303       1        195
## 7      adair       missouri   25581  33157   75347       1        123
## 8      adair       oklahoma   22286  31038    4543       0        116
## 9      adams       colorado  459598  56089 3206226     115       3298
## 10     adams          idaho    3915  36845    1916       0        430
## 11     adams       illinois   67197  45792   50969       0        502
## 12     adams        indiana   34365  46549   30418       0        126
## 13     adams           iowa    3911  44389    2678       0        112
## 14     adams    mississippi   32122  26784   12892       0         67
## 15     adams       nebraska   31459  47469   27319       0         52
## 16     adams   north dakota    2311  40236     637       0         39
## 17     adams           ohio   28350  34232    7657       6         54
## 18     adams   pennsylvania  101482  57097  100026       4       2356
## 19     adams     washington   19005  41219   12294       2       6781
## 20     adams      wisconsin   20679  41152   19004       7        120
## 21   addison        vermont   36745  57203   86406       2       2389
## 22     aiken south carolina  162812  43999  131261       9        279
## 23    aitkin      minnesota   15927  41301    8493       2        257
## 24   alachua        florida  251417  41373  413659      29        290
## 25  alamance north carolina  153920  44430  240090      10        234
## 26   alameda     california 1554720  70821 5394949     169       5689
## 27   alamosa       colorado   16148  38299   15963       1        261
## 28    albany       new york  305455  57715 1655552      40        672
## 29    albany        wyoming   37276  45760  102104      16        205

Now I would like you to create a query that returns every single record (row) for the fields region and population.

sqldf("select region, pop from data1")

##            region     pop
## 1  south carolina   25101
## 2       louisiana   61912
## 3        virginia   33341
## 4           idaho  409061
## 5            iowa    7481
## 6        kentucky   18675
## 7        missouri   25581
## 8        oklahoma   22286
## 9        colorado  459598
## 10          idaho    3915
## 11       illinois   67197
## 12        indiana   34365
## 13           iowa    3911
## 14    mississippi   32122
## 15       nebraska   31459
## 16   north dakota    2311
## 17           ohio   28350
## 18   pennsylvania  101482
## 19     washington   19005
## 20      wisconsin   20679
## 21        vermont   36745
## 22 south carolina  162812
## 23      minnesota   15927
## 24        florida  251417
## 25 north carolina  153920
## 26     california 1554720
## 27       colorado   16148
## 28       new york  305455
## 29        wyoming   37276

Let us now create a query that returns the first 10 records of the data1 table.

sqldf('SELECT * FROM data1 LIMIT 10')

##    subregion         region    pop income  ipaddr ufo2010 infections
## 1  abbeville south carolina  25101  34670   30330       2        245
## 2     acadia      louisiana  61912  37970   38203       6        215
## 3   accomack       virginia  33341  41595   41338       2       2076
## 4        ada          idaho 409061  55304 1035427      59       5023
## 5      adair           iowa   7481  47623    3762       0        189
## 6      adair       kentucky  18675  31775   14303       1        195
## 7      adair       missouri  25581  33157   75347       1        123
## 8      adair       oklahoma  22286  31038    4543       0        116
## 9      adams       colorado 459598  56089 3206226     115       3298
## 10     adams          idaho   3915  36845    1916       0        430

sqldf("SELECT * FROM data1 ORDER BY infections DESC  LIMIT 5")

##   subregion     region     pop income  ipaddr ufo2010 infections
## 1     adams washington   19005  41219   12294       2       6781
## 2   alameda california 1554720  70821 5394949     169       5689
## 3       ada      idaho  409061  55304 1035427      59       5023
## 4     adams   colorado  459598  56089 3206226     115       3298
## 5   addison    vermont   36745  57203   86406       2       2389

Interpret the output of the query ran above and run a similar query (limit 7) including the attributes ipadddr , ufo2010 ,and infections.: The output above is limited to 5 rows of data is is ordered by the infections column: highest to lowest. therefore we can see that washington has the highest infections while vermont has the lowest.

similar query (limit 7) including the attributes ipadddr , ufo2010 ,and infections.

sqldf("select ipaddr, ufo2010, infections from data1 order by infections limit 7")

##   ipaddr ufo2010 infections
## 1    637       0         39
## 2  27319       0         52
## 3   7657       6         54
## 4  12892       0         67
## 5   2678       0        112
## 6   4543       0        116
## 7  19004       7        120

only regions and infections where the number of infections is greater than 1,000:

sqldf('SELECT region,infections FROM data1 WHERE infections > 1000 order by 2 desc')

##         region infections
## 1   washington       6781
## 2   california       5689
## 3        idaho       5023
## 4     colorado       3298
## 5      vermont       2389
## 6 pennsylvania       2356
## 7     virginia       2076

Now I would like you to inspect (create a query) income,region and population when the condition is income>=50,000. Make comments.

sqldf("select income, region, pop from data1 where income >= 50000")

##   income       region     pop
## 1  55304        idaho  409061
## 2  56089     colorado  459598
## 3  57097 pennsylvania  101482
## 4  57203      vermont   36745
## 5  70821   california 1554720
## 6  57715     new york  305455

-> there are only 6 regions with an income greater or equal to $50,000. The population of these regions ranges from 37 thousand to 155 thousand.

sqldf('SELECT * FROM data1 WHERE (income > 50000 AND infections > 2000) OR pop > 100000')

##   subregion         region     pop income  ipaddr ufo2010 infections
## 1       ada          idaho  409061  55304 1035427      59       5023
## 2     adams       colorado  459598  56089 3206226     115       3298
## 3     adams   pennsylvania  101482  57097  100026       4       2356
## 4   addison        vermont   36745  57203   86406       2       2389
## 5     aiken south carolina  162812  43999  131261       9        279
## 6   alachua        florida  251417  41373  413659      29        290
## 7  alamance north carolina  153920  44430  240090      10        234
## 8   alameda     california 1554720  70821 5394949     169       5689
## 9    albany       new york  305455  57715 1655552      40        672

Interpret the query ran above and create a query that recreates a similar case scenario (one that is critical for the success of this mini-project): the query above shows all details of records that have an income above $50,000 and infections greater than 2,000 or a population greater than 100,000.

Let us run a query that returns every region and population with a number of infections between 1000 and 7000.

sqldf('SELECT region,pop FROM data1 WHERE infections between  1000 and 7000')

##         region     pop
## 1     virginia   33341
## 2        idaho  409061
## 3     colorado  459598
## 4 pennsylvania  101482
## 5   washington   19005
## 6      vermont   36745
## 7   california 1554720

the query above shows the region and population of records with infections between 1,000 and 7,000. The between filter could be usefull when analyzing tax data as the US has a progressive tax system and procentual increase occur for incomes between x and y.

Interpret the query below and list a hypothetical case scenario in which this query would be useful.

sqldf('SELECT * FROM data1 WHERE subregion LIKE "A%"')

##    subregion         region     pop income  ipaddr ufo2010 infections
## 1  abbeville south carolina   25101  34670   30330       2        245
## 2     acadia      louisiana   61912  37970   38203       6        215
## 3   accomack       virginia   33341  41595   41338       2       2076
## 4        ada          idaho  409061  55304 1035427      59       5023
## 5      adair           iowa    7481  47623    3762       0        189
## 6      adair       kentucky   18675  31775   14303       1        195
## 7      adair       missouri   25581  33157   75347       1        123
## 8      adair       oklahoma   22286  31038    4543       0        116
## 9      adams       colorado  459598  56089 3206226     115       3298
## 10     adams          idaho    3915  36845    1916       0        430
## 11     adams       illinois   67197  45792   50969       0        502
## 12     adams        indiana   34365  46549   30418       0        126
## 13     adams           iowa    3911  44389    2678       0        112
## 14     adams    mississippi   32122  26784   12892       0         67
## 15     adams       nebraska   31459  47469   27319       0         52
## 16     adams   north dakota    2311  40236     637       0         39
## 17     adams           ohio   28350  34232    7657       6         54
## 18     adams   pennsylvania  101482  57097  100026       4       2356
## 19     adams     washington   19005  41219   12294       2       6781
## 20     adams      wisconsin   20679  41152   19004       7        120
## 21   addison        vermont   36745  57203   86406       2       2389
## 22     aiken south carolina  162812  43999  131261       9        279
## 23    aitkin      minnesota   15927  41301    8493       2        257
## 24   alachua        florida  251417  41373  413659      29        290
## 25  alamance north carolina  153920  44430  240090      10        234
## 26   alameda     california 1554720  70821 5394949     169       5689
## 27   alamosa       colorado   16148  38299   15963       1        261
## 28    albany       new york  305455  57715 1655552      40        672
## 29    albany        wyoming   37276  45760  102104      16        205

the query above shows records in which the subregion starts with the letter A.

Last,let us find the average number of infections in our table.

sqldf("SELECT AVG(infections) FROM data1")

##   AVG(infections)
## 1        1117.069

The query above shows the average of the infection column

Let us inspect the new table defined below data2.

data2<-read.csv("C:/Users/marxm/OneDrive/Documents/ProgrammingForData_R/acti5/data/finalexam.csv",header = TRUE,sep = ",")

Let us apply summary to data2 and examine the output.

summary(data2)

##     region               pop              income          ipaddr       
##  Length:29          Min.   :   2311   Min.   :26784   Min.   :    637  
##  Class :character   1st Qu.:  19005   1st Qu.:37970   1st Qu.:  12294  
##  Mode  :character   Median :  32122   Median :41595   Median :  30418  
##                     Mean   : 135940   Mean   :43858   Mean   : 440130  
##                     3rd Qu.: 101482   3rd Qu.:47469   3rd Qu.: 102104  
##                     Max.   :1554720   Max.   :70821   Max.   :5394949  
##     ufo2010         infections  
##  Min.   :  0.00   Min.   :  39  
##  1st Qu.:  0.00   1st Qu.: 123  
##  Median :  2.00   Median : 245  
##  Mean   : 16.66   Mean   :1117  
##  3rd Qu.:  9.00   3rd Qu.: 672  
##  Max.   :169.00   Max.   :6781

the data2 dataframe has 6 columns of which one (region) is a character column and the remaining 5 (pop, income, ipaddr, ufo2010, infections) columns are numeric. the min, max, mean, quartiles are shown above

Next, I would like you to complete the following steps:

1-Run a query that returns full details for every record from the west region.

sqldf("select * from data2 where region = 'West'")

##   region     pop income  ipaddr ufo2010 infections
## 1   West  409061  55304 1035427      59       5023
## 2   West  459598  56089 3206226     115       3298
## 3   West    3915  36845    1916       0        430
## 4   West   19005  41219   12294       2       6781
## 5   West 1554720  70821 5394949     169       5689
## 6   West   16148  38299   15963       1        261
## 7   West   37276  45760  102104      16        205

-> region = ‘West’ filter applied to show only records in the western region

2-Run a query that returns full details for every record from the east region where the number of infections is greater than 1500.

sqldf("select * from data2 where (region = 'East' and infections > 1500)")

##   region    pop income ipaddr ufo2010 infections
## 1   East  33341  41595  41338       2       2076
## 2   East 101482  57097 100026       4       2356
## 3   East  36745  57203  86406       2       2389

The output above reflects records that are in the East region AND have more than 1500 infections -> done by region and infections filter

3-Run a few queries that can potentially help us to make the point that: a.Region affects the number of infections b.Population affects the number of infections c.Income affects the number of infections d.ufo2010 affects the number of infections e. The different interactions of region and income, and income and population affects the number of infections.

sqldf("select region, sum(infections) from data2 group by region")

##    region sum(infections)
## 1 Central            1637
## 2    East            9071
## 3    West           21687

-> the output shows that the central regions have less infections than the east and west regions. The west region has significantly more infections than the east.

sqldf("select pop, infections from data2 order by infections desc")

##        pop infections
## 1    19005       6781
## 2  1554720       5689
## 3   409061       5023
## 4   459598       3298
## 5    36745       2389
## 6   101482       2356
## 7    33341       2076
## 8   305455        672
## 9    67197        502
## 10    3915        430
## 11  251417        290
## 12  162812        279
## 13   16148        261
## 14   15927        257
## 15   25101        245
## 16  153920        234
## 17   61912        215
## 18   37276        205
## 19   18675        195
## 20    7481        189
## 21   34365        126
## 22   25581        123
## 23   20679        120
## 24   22286        116
## 25    3911        112
## 26   32122         67
## 27   28350         54
## 28   31459         52
## 29    2311         39

Output shows population and infections order by infections highest to lowest. We can see that the region with the highest population does not neccessarely have the highest infections number

sqldf("select income, infections from data2 order by infections desc")

##    income infections
## 1   41219       6781
## 2   70821       5689
## 3   55304       5023
## 4   56089       3298
## 5   57203       2389
## 6   57097       2356
## 7   41595       2076
## 8   57715        672
## 9   45792        502
## 10  36845        430
## 11  41373        290
## 12  43999        279
## 13  38299        261
## 14  41301        257
## 15  34670        245
## 16  44430        234
## 17  37970        215
## 18  45760        205
## 19  31775        195
## 20  47623        189
## 21  46549        126
## 22  33157        123
## 23  41152        120
## 24  31038        116
## 25  44389        112
## 26  26784         67
## 27  34232         54
## 28  47469         52
## 29  40236         39

Unlike B, we could say that there is a postive correlation between income and infections. However, this cannot be clearly seen with the raw data.

sqldf("select ufo2010, infections from data2 order by infections desc")

##    ufo2010 infections
## 1        2       6781
## 2      169       5689
## 3       59       5023
## 4      115       3298
## 5        2       2389
## 6        4       2356
## 7        2       2076
## 8       40        672
## 9        0        502
## 10       0        430
## 11      29        290
## 12       9        279
## 13       1        261
## 14       2        257
## 15       2        245
## 16      10        234
## 17       6        215
## 18      16        205
## 19       1        195
## 20       0        189
## 21       0        126
## 22       1        123
## 23       7        120
## 24       0        116
## 25       0        112
## 26       0         67
## 27       6         54
## 28       0         52
## 29       0         39

Similar to B, we cannot see a trend of correlation between higher ufo2010 numbers and higher infections

region and income vs infections

sqldf("select region, income, infections from data2 order by infections desc")

##     region income infections
## 1     West  41219       6781
## 2     West  70821       5689
## 3     West  55304       5023
## 4     West  56089       3298
## 5     East  57203       2389
## 6     East  57097       2356
## 7     East  41595       2076
## 8     East  57715        672
## 9  Central  45792        502
## 10    West  36845        430
## 11    East  41373        290
## 12    East  43999        279
## 13    West  38299        261
## 14 Central  41301        257
## 15    East  34670        245
## 16    East  44430        234
## 17    East  37970        215
## 18    West  45760        205
## 19    East  31775        195
## 20 Central  47623        189
## 21 Central  46549        126
## 22 Central  33157        123
## 23    East  41152        120
## 24 Central  31038        116
## 25 Central  44389        112
## 26 Central  26784         67
## 27 Central  34232         54
## 28 Central  47469         52
## 29 Central  40236         39

-> order by infections (highest to lowest), we can see that in general, the western region has the highest infections followed by the eastern and central region with one outlier being that the least infections are also found in the west. The income does not seem to have an apparent effect on the infection numbers.

Activity 5 - SQL & R

MM

2022-11-01