R SQL 103122

Good afternoon all! First let us find out what our working directory is.

#getwd()

Today we are going to query the table quiz2 by using Transact-SQL inside the R chunks. First, we are going to need to install the package sqldf.

#install.packages("sqldf")

#E- 'sqldf' Manipulates R Data Frames Using SQL
library(sqldf)

## Loading required package: gsubfn

## Loading required package: proto

## Warning in fun(libname, pkgname): couldn't connect to display ":0"

## Loading required package: RSQLite

#E- imported data into data1
data1 <- read.csv("quiz2.csv", header = TRUE, sep = ",")

summary(data1)

##   subregion            region               pop              income     
##  Length:29          Length:29          Min.   :   2311   Min.   :26784  
##  Class :character   Class :character   1st Qu.:  19005   1st Qu.:37970  
##  Mode  :character   Mode  :character   Median :  32122   Median :41595  
##                                        Mean   : 135940   Mean   :43858  
##                                        3rd Qu.: 101482   3rd Qu.:47469  
##                                        Max.   :1554720   Max.   :70821  
##      ipaddr           ufo2010         infections  
##  Min.   :    637   Min.   :  0.00   Min.   :  39  
##  1st Qu.:  12294   1st Qu.:  0.00   1st Qu.: 123  
##  Median :  30418   Median :  2.00   Median : 245  
##  Mean   : 440130   Mean   : 16.66   Mean   :1117  
##  3rd Qu.: 102104   3rd Qu.:  9.00   3rd Qu.: 672  
##  Max.   :5394949   Max.   :169.00   Max.   :6781

Let us run a query that returns every single record from the table data1.

#E- Selects the columns, and organizes them by that. 
sqldf("SELECT * FROM data1")

Now I would like you to create a query that returns every single record (row) for the fields region and population.

Let us now create a query that returns the first 10 records of the data1 table.

#E- selects all the columns and displays up to 10 rows. (LIMIT 10)
sqldf('SELECT * FROM data1 LIMIT 10')

#E- Selects all the columns from data 1, and displays them in DESCENDING order based on the column 'infections' up to the first 5 rows.
sqldf("SELECT * FROM data1 ORDER BY infections DESC  LIMIT 5")

Interpret the output of the query ran above and run a similar query (limit 7) including the attributes ipaddr , ufo2010 ,and infections.

#E- only using the columns stated, find the top infections.
sqldf("SELECT ipaddr,ufo2010,infections FROM data1 ORDER BY infections DESC  LIMIT 7")

#E- Select only the columns (region, infections) from data1, and filter them, in descending order, based on infections greater than 1000.
sqldf('SELECT region,infections FROM data1 WHERE infections > 1000 order by 2 desc')

Now I would like you to inspect (create a query) income,region and population when the condition is income>=50,000. Make comments.

#E- selected only the 3 columns whom have an income greater than equal to 50,000
sqldf('SELECT income,region,pop FROM data1 WHERE income >= 50000')

#E- Selecting the rows where income is greater than 50k and infections is greater than 2k, with th possible filter of 100k pop.
sqldf('SELECT * FROM data1 WHERE (income > 50000 AND infections > 2000) OR pop > 100000')

Interpret the query ran above and create a query that recreates a similar case scenario (one that is critical for the success of this mini-project).

#E- The factor that most affects infections is 'ufo2010', although I thought this would not be the case, the more infections there are the more ufo sightings there were. 
#E- the reasoning for this is probably that the more people that saw the UFO sightings had a mind that was more prone to infections.
sqldf('SELECT * FROM data1 ORDER BY infections DESC')

Let us run a query that returns every region and population with a number of infections between 1000 and 7000.

sqldf('SELECT region,pop FROM data1 WHERE infections between  1000 and 7000')

Interpret the query below and list a hypothetical case scenario in which this query would be useful.

#E- from the dataset, filter sub region in alphabetical order.
#E- it would be useful if an employer were to need employee names, 
sqldf('SELECT * FROM data1 WHERE subregion LIKE "A%"')

Last,let us find the average number of infections in our table.

#E- AVG() calculates average of a column
sqldf("SELECT AVG(infections) FROM data1")

Let us inspect the new table defined below data2.

#data2<-read.csv("finalexam.csv",header = TRUE,sep = ",")
#E- Dataset was imported into Rstudio and referenced as 'data2'
#data2 <- finalexam
data2<-read.csv("finalexam.csv",header = TRUE,sep = ",")

Let us apply summary to data2 and examine the output.

summary(data2)

##     region               pop              income          ipaddr       
##  Length:29          Min.   :   2311   Min.   :26784   Min.   :    637  
##  Class :character   1st Qu.:  19005   1st Qu.:37970   1st Qu.:  12294  
##  Mode  :character   Median :  32122   Median :41595   Median :  30418  
##                     Mean   : 135940   Mean   :43858   Mean   : 440130  
##                     3rd Qu.: 101482   3rd Qu.:47469   3rd Qu.: 102104  
##                     Max.   :1554720   Max.   :70821   Max.   :5394949  
##     ufo2010         infections  
##  Min.   :  0.00   Min.   :  39  
##  1st Qu.:  0.00   1st Qu.: 123  
##  Median :  2.00   Median : 245  
##  Mean   : 16.66   Mean   :1117  
##  3rd Qu.:  9.00   3rd Qu.: 672  
##  Max.   :169.00   Max.   :6781

Next, I would like you to complete the following steps:

1-Run a query that returns full details for every record from the west region.

sqldf('SELECT * FROM data2 WHERE region = "West"')

2-Run a query that returns full details for every record from the east region where the number of infections is greater than 1500.

sqldf('SELECT * FROM data2 WHERE (region = "East" AND region >1500)')

3-Run a few queries that can potentially help us to make the point that: a.Region affects the number of infections b.Population affects the number of infections c.Income affects the number of infections d.ufo2010 affects the number of infections e. The different interactions of region and income, and income and population affects the number of infections.

#A. Region affects the number of infections
#E- The west Region has the most infections
sqldf('SELECT region,infections FROM data2 ORDER BY infections DESC')

#B Population affects the number of infections
#E- The population of '19005' has the most infections
sqldf('SELECT pop,infections FROM data2 ORDER BY infections DESC')

#C Income affects the number of infections
#E- income of '41219' has the most infections.
sqldf('SELECT income,infections FROM data2 ORDER BY infections DESC')

#D ufo2010 affects the number of infections
#E- ufo2010 of '2' has the most infections.
sqldf('SELECT ufo2010,infections FROM data2 ORDER BY infections DESC')

#E.1 The different interactions of region and income, and income and population affects the number of infections.
#E- the West and East region, seem to have the highest income, and in return seem to be the ones with the largest infections.
sqldf('SELECT region,income,infections FROM data2 ORDER BY infections DESC')

#E.2 The different interactions of region and income, and income and population affects the number of infections.
#E- income and population seem to not be correlated, and not be a direct factor affecting infections.
sqldf('SELECT income,pop,infections FROM data2 ORDER BY infections DESC')

Explain every single query you include in this part of the handout.