Installing SQL Library
#install.packages("sqldf")
importing SQL Library
library(sqldf)
## Warning: package 'sqldf' was built under R version 4.1.3
## Loading required package: gsubfn
## Warning: package 'gsubfn' was built under R version 4.1.3
## Loading required package: proto
## Warning: package 'proto' was built under R version 4.1.3
## Loading required package: RSQLite
## Warning: package 'RSQLite' was built under R version 4.1.3
importing dataset
data1<-read.csv("C:/Users/marxm/OneDrive/Documents/ProgrammingForData_R/acti5/data/quiz2.csv",header = TRUE,sep = ",")
data summary:
summary(data1)
## subregion region pop income
## Length:29 Length:29 Min. : 2311 Min. :26784
## Class :character Class :character 1st Qu.: 19005 1st Qu.:37970
## Mode :character Mode :character Median : 32122 Median :41595
## Mean : 135940 Mean :43858
## 3rd Qu.: 101482 3rd Qu.:47469
## Max. :1554720 Max. :70821
## ipaddr ufo2010 infections
## Min. : 637 Min. : 0.00 Min. : 39
## 1st Qu.: 12294 1st Qu.: 0.00 1st Qu.: 123
## Median : 30418 Median : 2.00 Median : 245
## Mean : 440130 Mean : 16.66 Mean :1117
## 3rd Qu.: 102104 3rd Qu.: 9.00 3rd Qu.: 672
## Max. :5394949 Max. :169.00 Max. :6781
SQL Syntax: query that retries everything from the datatable
sqldf("SELECT * FROM data1")
## subregion region pop income ipaddr ufo2010 infections
## 1 abbeville south carolina 25101 34670 30330 2 245
## 2 acadia louisiana 61912 37970 38203 6 215
## 3 accomack virginia 33341 41595 41338 2 2076
## 4 ada idaho 409061 55304 1035427 59 5023
## 5 adair iowa 7481 47623 3762 0 189
## 6 adair kentucky 18675 31775 14303 1 195
## 7 adair missouri 25581 33157 75347 1 123
## 8 adair oklahoma 22286 31038 4543 0 116
## 9 adams colorado 459598 56089 3206226 115 3298
## 10 adams idaho 3915 36845 1916 0 430
## 11 adams illinois 67197 45792 50969 0 502
## 12 adams indiana 34365 46549 30418 0 126
## 13 adams iowa 3911 44389 2678 0 112
## 14 adams mississippi 32122 26784 12892 0 67
## 15 adams nebraska 31459 47469 27319 0 52
## 16 adams north dakota 2311 40236 637 0 39
## 17 adams ohio 28350 34232 7657 6 54
## 18 adams pennsylvania 101482 57097 100026 4 2356
## 19 adams washington 19005 41219 12294 2 6781
## 20 adams wisconsin 20679 41152 19004 7 120
## 21 addison vermont 36745 57203 86406 2 2389
## 22 aiken south carolina 162812 43999 131261 9 279
## 23 aitkin minnesota 15927 41301 8493 2 257
## 24 alachua florida 251417 41373 413659 29 290
## 25 alamance north carolina 153920 44430 240090 10 234
## 26 alameda california 1554720 70821 5394949 169 5689
## 27 alamosa colorado 16148 38299 15963 1 261
## 28 albany new york 305455 57715 1655552 40 672
## 29 albany wyoming 37276 45760 102104 16 205
Now I would like you to create a query that returns every single record (row) for the fields region and population.
sqldf("select region, pop from data1")
## region pop
## 1 south carolina 25101
## 2 louisiana 61912
## 3 virginia 33341
## 4 idaho 409061
## 5 iowa 7481
## 6 kentucky 18675
## 7 missouri 25581
## 8 oklahoma 22286
## 9 colorado 459598
## 10 idaho 3915
## 11 illinois 67197
## 12 indiana 34365
## 13 iowa 3911
## 14 mississippi 32122
## 15 nebraska 31459
## 16 north dakota 2311
## 17 ohio 28350
## 18 pennsylvania 101482
## 19 washington 19005
## 20 wisconsin 20679
## 21 vermont 36745
## 22 south carolina 162812
## 23 minnesota 15927
## 24 florida 251417
## 25 north carolina 153920
## 26 california 1554720
## 27 colorado 16148
## 28 new york 305455
## 29 wyoming 37276
Let us now create a query that returns the first 10 records of the data1 table.
sqldf('SELECT * FROM data1 LIMIT 10')
## subregion region pop income ipaddr ufo2010 infections
## 1 abbeville south carolina 25101 34670 30330 2 245
## 2 acadia louisiana 61912 37970 38203 6 215
## 3 accomack virginia 33341 41595 41338 2 2076
## 4 ada idaho 409061 55304 1035427 59 5023
## 5 adair iowa 7481 47623 3762 0 189
## 6 adair kentucky 18675 31775 14303 1 195
## 7 adair missouri 25581 33157 75347 1 123
## 8 adair oklahoma 22286 31038 4543 0 116
## 9 adams colorado 459598 56089 3206226 115 3298
## 10 adams idaho 3915 36845 1916 0 430
sqldf("SELECT * FROM data1 ORDER BY infections DESC LIMIT 5")
## subregion region pop income ipaddr ufo2010 infections
## 1 adams washington 19005 41219 12294 2 6781
## 2 alameda california 1554720 70821 5394949 169 5689
## 3 ada idaho 409061 55304 1035427 59 5023
## 4 adams colorado 459598 56089 3206226 115 3298
## 5 addison vermont 36745 57203 86406 2 2389
Interpret the output of the query ran above and run a similar query (limit 7) including the attributes ipadddr , ufo2010 ,and infections.: The output above is limited to 5 rows of data is is ordered by the infections column: highest to lowest. therefore we can see that washington has the highest infections while vermont has the lowest.
similar query (limit 7) including the attributes ipadddr , ufo2010 ,and infections.
sqldf("select ipaddr, ufo2010, infections from data1 order by infections limit 7")
## ipaddr ufo2010 infections
## 1 637 0 39
## 2 27319 0 52
## 3 7657 6 54
## 4 12892 0 67
## 5 2678 0 112
## 6 4543 0 116
## 7 19004 7 120
only regions and infections where the number of infections is greater than 1,000:
sqldf('SELECT region,infections FROM data1 WHERE infections > 1000 order by 2 desc')
## region infections
## 1 washington 6781
## 2 california 5689
## 3 idaho 5023
## 4 colorado 3298
## 5 vermont 2389
## 6 pennsylvania 2356
## 7 virginia 2076
Now I would like you to inspect (create a query) income,region and population when the condition is income>=50,000. Make comments.
sqldf("select income, region, pop from data1 where income >= 50000")
## income region pop
## 1 55304 idaho 409061
## 2 56089 colorado 459598
## 3 57097 pennsylvania 101482
## 4 57203 vermont 36745
## 5 70821 california 1554720
## 6 57715 new york 305455
-> there are only 6 regions with an income greater or equal to $50,000. The population of these regions ranges from 37 thousand to 155 thousand.
sqldf('SELECT * FROM data1 WHERE (income > 50000 AND infections > 2000) OR pop > 100000')
## subregion region pop income ipaddr ufo2010 infections
## 1 ada idaho 409061 55304 1035427 59 5023
## 2 adams colorado 459598 56089 3206226 115 3298
## 3 adams pennsylvania 101482 57097 100026 4 2356
## 4 addison vermont 36745 57203 86406 2 2389
## 5 aiken south carolina 162812 43999 131261 9 279
## 6 alachua florida 251417 41373 413659 29 290
## 7 alamance north carolina 153920 44430 240090 10 234
## 8 alameda california 1554720 70821 5394949 169 5689
## 9 albany new york 305455 57715 1655552 40 672
Interpret the query ran above and create a query that recreates a similar case scenario (one that is critical for the success of this mini-project): the query above shows all details of records that have an income above $50,000 and infections greater than 2,000 or a population greater than 100,000.
Let us run a query that returns every region and population with a number of infections between 1000 and 7000.
sqldf('SELECT region,pop FROM data1 WHERE infections between 1000 and 7000')
## region pop
## 1 virginia 33341
## 2 idaho 409061
## 3 colorado 459598
## 4 pennsylvania 101482
## 5 washington 19005
## 6 vermont 36745
## 7 california 1554720
the query above shows the region and population of records with infections between 1,000 and 7,000. The between filter could be usefull when analyzing tax data as the US has a progressive tax system and procentual increase occur for incomes between x and y.
Interpret the query below and list a hypothetical case scenario in which this query would be useful.
sqldf('SELECT * FROM data1 WHERE subregion LIKE "A%"')
## subregion region pop income ipaddr ufo2010 infections
## 1 abbeville south carolina 25101 34670 30330 2 245
## 2 acadia louisiana 61912 37970 38203 6 215
## 3 accomack virginia 33341 41595 41338 2 2076
## 4 ada idaho 409061 55304 1035427 59 5023
## 5 adair iowa 7481 47623 3762 0 189
## 6 adair kentucky 18675 31775 14303 1 195
## 7 adair missouri 25581 33157 75347 1 123
## 8 adair oklahoma 22286 31038 4543 0 116
## 9 adams colorado 459598 56089 3206226 115 3298
## 10 adams idaho 3915 36845 1916 0 430
## 11 adams illinois 67197 45792 50969 0 502
## 12 adams indiana 34365 46549 30418 0 126
## 13 adams iowa 3911 44389 2678 0 112
## 14 adams mississippi 32122 26784 12892 0 67
## 15 adams nebraska 31459 47469 27319 0 52
## 16 adams north dakota 2311 40236 637 0 39
## 17 adams ohio 28350 34232 7657 6 54
## 18 adams pennsylvania 101482 57097 100026 4 2356
## 19 adams washington 19005 41219 12294 2 6781
## 20 adams wisconsin 20679 41152 19004 7 120
## 21 addison vermont 36745 57203 86406 2 2389
## 22 aiken south carolina 162812 43999 131261 9 279
## 23 aitkin minnesota 15927 41301 8493 2 257
## 24 alachua florida 251417 41373 413659 29 290
## 25 alamance north carolina 153920 44430 240090 10 234
## 26 alameda california 1554720 70821 5394949 169 5689
## 27 alamosa colorado 16148 38299 15963 1 261
## 28 albany new york 305455 57715 1655552 40 672
## 29 albany wyoming 37276 45760 102104 16 205
the query above shows records in which the subregion starts with the letter A.
Last,let us find the average number of infections in our table.
sqldf("SELECT AVG(infections) FROM data1")
## AVG(infections)
## 1 1117.069
The query above shows the average of the infection column
Let us inspect the new table defined below data2.
data2<-read.csv("C:/Users/marxm/OneDrive/Documents/ProgrammingForData_R/acti5/data/finalexam.csv",header = TRUE,sep = ",")
Let us apply summary to data2 and examine the output.
summary(data2)
## region pop income ipaddr
## Length:29 Min. : 2311 Min. :26784 Min. : 637
## Class :character 1st Qu.: 19005 1st Qu.:37970 1st Qu.: 12294
## Mode :character Median : 32122 Median :41595 Median : 30418
## Mean : 135940 Mean :43858 Mean : 440130
## 3rd Qu.: 101482 3rd Qu.:47469 3rd Qu.: 102104
## Max. :1554720 Max. :70821 Max. :5394949
## ufo2010 infections
## Min. : 0.00 Min. : 39
## 1st Qu.: 0.00 1st Qu.: 123
## Median : 2.00 Median : 245
## Mean : 16.66 Mean :1117
## 3rd Qu.: 9.00 3rd Qu.: 672
## Max. :169.00 Max. :6781
the data2 dataframe has 6 columns of which one (region) is a character column and the remaining 5 (pop, income, ipaddr, ufo2010, infections) columns are numeric. the min, max, mean, quartiles are shown above
Next, I would like you to complete the following steps:
1-Run a query that returns full details for every record from the west region.
sqldf("select * from data2 where region = 'West'")
## region pop income ipaddr ufo2010 infections
## 1 West 409061 55304 1035427 59 5023
## 2 West 459598 56089 3206226 115 3298
## 3 West 3915 36845 1916 0 430
## 4 West 19005 41219 12294 2 6781
## 5 West 1554720 70821 5394949 169 5689
## 6 West 16148 38299 15963 1 261
## 7 West 37276 45760 102104 16 205
-> region = ‘West’ filter applied to show only records in the western region
2-Run a query that returns full details for every record from the east region where the number of infections is greater than 1500.
sqldf("select * from data2 where (region = 'East' and infections > 1500)")
## region pop income ipaddr ufo2010 infections
## 1 East 33341 41595 41338 2 2076
## 2 East 101482 57097 100026 4 2356
## 3 East 36745 57203 86406 2 2389
The output above reflects records that are in the East region AND have more than 1500 infections -> done by region and infections filter
3-Run a few queries that can potentially help us to make the point that: a.Region affects the number of infections b.Population affects the number of infections c.Income affects the number of infections d.ufo2010 affects the number of infections e. The different interactions of region and income, and income and population affects the number of infections.
sqldf("select region, sum(infections) from data2 group by region")
## region sum(infections)
## 1 Central 1637
## 2 East 9071
## 3 West 21687
-> the output shows that the central regions have less infections than the east and west regions. The west region has significantly more infections than the east.
sqldf("select pop, infections from data2 order by infections desc")
## pop infections
## 1 19005 6781
## 2 1554720 5689
## 3 409061 5023
## 4 459598 3298
## 5 36745 2389
## 6 101482 2356
## 7 33341 2076
## 8 305455 672
## 9 67197 502
## 10 3915 430
## 11 251417 290
## 12 162812 279
## 13 16148 261
## 14 15927 257
## 15 25101 245
## 16 153920 234
## 17 61912 215
## 18 37276 205
## 19 18675 195
## 20 7481 189
## 21 34365 126
## 22 25581 123
## 23 20679 120
## 24 22286 116
## 25 3911 112
## 26 32122 67
## 27 28350 54
## 28 31459 52
## 29 2311 39
Output shows population and infections order by infections highest to lowest. We can see that the region with the highest population does not neccessarely have the highest infections number
sqldf("select income, infections from data2 order by infections desc")
## income infections
## 1 41219 6781
## 2 70821 5689
## 3 55304 5023
## 4 56089 3298
## 5 57203 2389
## 6 57097 2356
## 7 41595 2076
## 8 57715 672
## 9 45792 502
## 10 36845 430
## 11 41373 290
## 12 43999 279
## 13 38299 261
## 14 41301 257
## 15 34670 245
## 16 44430 234
## 17 37970 215
## 18 45760 205
## 19 31775 195
## 20 47623 189
## 21 46549 126
## 22 33157 123
## 23 41152 120
## 24 31038 116
## 25 44389 112
## 26 26784 67
## 27 34232 54
## 28 47469 52
## 29 40236 39
Unlike B, we could say that there is a postive correlation between income and infections. However, this cannot be clearly seen with the raw data.
sqldf("select ufo2010, infections from data2 order by infections desc")
## ufo2010 infections
## 1 2 6781
## 2 169 5689
## 3 59 5023
## 4 115 3298
## 5 2 2389
## 6 4 2356
## 7 2 2076
## 8 40 672
## 9 0 502
## 10 0 430
## 11 29 290
## 12 9 279
## 13 1 261
## 14 2 257
## 15 2 245
## 16 10 234
## 17 6 215
## 18 16 205
## 19 1 195
## 20 0 189
## 21 0 126
## 22 1 123
## 23 7 120
## 24 0 116
## 25 0 112
## 26 0 67
## 27 6 54
## 28 0 52
## 29 0 39
Similar to B, we cannot see a trend of correlation between higher ufo2010 numbers and higher infections
sqldf("select region, income, infections from data2 order by infections desc")
## region income infections
## 1 West 41219 6781
## 2 West 70821 5689
## 3 West 55304 5023
## 4 West 56089 3298
## 5 East 57203 2389
## 6 East 57097 2356
## 7 East 41595 2076
## 8 East 57715 672
## 9 Central 45792 502
## 10 West 36845 430
## 11 East 41373 290
## 12 East 43999 279
## 13 West 38299 261
## 14 Central 41301 257
## 15 East 34670 245
## 16 East 44430 234
## 17 East 37970 215
## 18 West 45760 205
## 19 East 31775 195
## 20 Central 47623 189
## 21 Central 46549 126
## 22 Central 33157 123
## 23 East 41152 120
## 24 Central 31038 116
## 25 Central 44389 112
## 26 Central 26784 67
## 27 Central 34232 54
## 28 Central 47469 52
## 29 Central 40236 39
-> order by infections (highest to lowest), we can see that in general, the western region has the highest infections followed by the eastern and central region with one outlier being that the least infections are also found in the west. The income does not seem to have an apparent effect on the infection numbers.