library(knitr)
library(ggplot2)
library(rjson)
The datasets used for this project were named “GroupSouth.csv” and “irsCAzip2013.csv”. These were downloaded from the Statistics 20 class website on the UCLA CCLE website. The web page scraped was under the link “http://www.stat.ucla.edu/~vlew/datasets/LASDAUG.html” and was found on the Statistics 20 class website on the UCLA CCLE website. The dataset named “GroupSouth.csv” was downloaded from the UCLA CCLE website. It originally had 13548 observations and 41 variables. The dataset named “irsCAzip2013.csv” was also downloaded from the UCLA CCLE website. It orginally had 1484 observations and 111 variables. The dataset named the scraped webpage originally had 14191 observations and 19 variables. After modifications and selection of relevant variables, the final merged dataset, named “totalanalysis” had 2946 observations and 14 variables.
In this project, I studied the relationship between restaurants in California with information from the IRS about taxes in the same Southern Californian zip codes. More specifically, I looked at the relationship between number of tax returns filed by the “heads” of households and the number of dependents. Then, this result was organized by the rating that restaurants in that area, defined by zip code, recieved on Yelp. I attempted to discover a possible connection between restaurants that receieve higher Yelp Ratings and likelihood to complete a tax return, specifically ones by heads of household (meaning more than one person in a household, the majority of which are families). I also investigated the relationship between adjusted gross income (AGI) in a given zip code and the Yelp ratings for restaurants in the same area. Furthermore, I conducted a short investigation into restaurant prices and their Yelp ratings. Moreover, I performed another short inquiry on the mean victim counts in a certain area correlated with the area’s restaurants’ Yelp ratings.
Both datasets and the webpage that I scraped all came from the Statistics 20 class website on the UCLA CCLE website. The variables that present in the final merged dataset are outlined below:
ZipCode
Restaurant- Name of restaurant
Street.Address
City
State
newRatings- Yelp rating of the restaurant (transformed from the “Ratings” variable from the “GroupSouth.csv” data; newRatings is numerical)
Price - Price rating of restaurants (from \(-\)\($\))
The variables above derived from “GroupSouth.csv”. The variables below derived from “irsCA2013zip.csv”.
N1- Number of returns in 2013 as reported by the IRS
MARS1- Number of returns for people who filed as single in 2013 as reported by the IRS
MARS2- Number of returns for people who filed jointly as married in 2013 as reported by the IRS
MARS4- Number of returns for people who filed as head of household in 2013 as reported by the IRS
NUMDEP- Number of dependents in 2013 as reported by the IRS
A00100- Adjusted gross income in 2013 as reported by the IRS
sqm1- square root of the number of single returns filed in 2013
I chose to retain these variables since they gave the most information about the number of returns reported by the IRS for many different groups, as well as the total. These variables also gave information on the restaurants, their locations, Yelp ratings, and prices. This gave insight on how the rating of a restaurant relates to the number of returns filed in that same area. Furthermore, the data that remains also helps to solve the two smaller investigations that I brought up in the introduction.
The basic summary statistics for the information on the returns is given below:
N1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 100 2170 10200 22700 17500 16900000
MARS1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 40 890 4720 10700 8320 7940000
MARS2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30 910 3560 8240 6240 6110000
MARS4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 230 1080 3480 2480 2580000
NUMDEP
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30 1360 6400 17800 13300 13200000
A00100
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.09e+03 1.20e+05 6.04e+05 1.63e+06 1.16e+06 1.21e+09
For the above summary statistics, I used the summary() function as it provides the basic five number summary I desired with minimal code. Furthermore, in later analysis, I used the lm() function to perform a linear regression between the number of “head of household” returns filed and the number of dependents found in an area defined by zip code. (See Graph 2)
If you put the word “kable” into your R chunk options, you can produce a nice little table using the kable function. You will need to have the knitr package installed for this to work.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 1500.49 | 79.27 | 18.93 | 0 |
| lis$MARS4 | 3.86 | 0.02 | 212.00 | 0 |
1
This graphic shows the relationship between the average victim count in a certain zip code and that area’s restaurants’ corresponding Yelp ratings. There does not seem to be a correlation to suggest that areas with high victim counts have lower rated restaurants than areas with low victim counts.
2
This graphic shows the relationship between the number of returns filed by the heads of households and the number of dependents through a linear regression model. For every 1 increase in returns filed by the head of a household, there is a 3.860 increase in the number of dependents, on average.
3
This graphic shows the relationship between a restaurant’s price and its Yelp rating. There seems to be a correlation here that suggests that more expensive restaurants are often rated highly, whereas less expensive ones are not. Even though there are high rated restaurants for cheap, there are many more lower rated cheap restaurants than low rated expensive ones. This is expected.
4
This graphic shows the relationship between the number of returns filed for the heads of households and the number of dependents. It is essentially the same as graphic 2, except this one is created using ggplot2 and uses a color spectrum to illustrate the Yelp ratings of restuarants in the zip code as well. It appears that areas with more head of household returns and more dependents seem to have higher rated restaurants.
This class was incredibly influential on me and my aspirations for my future. Before I even took this class, I had tentatively decided to change my major from Computer Science to Statistics. I chose to do this because I believed that statistics, and the R programming language, were more applicable to what I want to do in the future. And this class only confirmed this. I learned that not only do I thoroughly enjoy performing data analysis with R, but I can also be quite good at it too. When I apply myself properly and am dedicated to a certain topic, I am confident in myself to perform a good analysis.
In the future, and after I graduate, I hope to work in the sports industry, ideally performing some sort of analytics for either a team, or an independent company. Over my next two and a half years at UCLA, I hope to further my studying of statistics, focusing mainly on data analysis using the R language, along with SQL and Python. I hope to conduct research while here at UCLA, and hopefully get a promising internship as well. Overall, I greatly enjoyed this class, this project, and the study of R as a whole and will hopefully continue to use what I have learned long into the future.
south <- read.csv("/Users/sschmall/Downloads/GroupSouth.csv")
irs <- read.csv("/Users/sschmall/Downloads/irsCAzip2013.csv")
#south - 13548 obs. of 41 variables
#irs - 1484 obs. of 111 variables
##A
south2 <- subset(south, select = -c(Suite.Number, Neighborhood, Cross.Street, Link.to.Map, Website, Phone.Number, Link.to.Menu, Hours, Link.to.Yelp, Link.to.Yahoo..Local, Link.to.Citysearch, Link.to.Zagat))
##B
library(ggmap)
south2 <- south2[order(south2$Latitude),]
south2$Latitude
adds <- paste(south2$Street.Address, south2$City, south2$State, south2$Zip.Code)
mylist <- south2[1:373,]
mylist$Longitude
myadds <- adds[1:373]
geocodeQueryCheck()
loc <- geocode(myadds, output="latlona", messaging=FALSE, source="google" )
mylist$Latitude <- loc$lat
mylist$Longitude <- loc$lon
south2$Latitude[1:373] <- mylist$Latitude
south2$Longitude[1:373] <- mylist$Longitude
#C
south2$Alcohol[south2$Email == " "] <- NA
south2$Alcohol[south2$Alcohol == " "] <- NA
south2$Credit.Cards[south2$Credit.Cards == " "] <- NA
south2$Good.for.Kids[south2$Good.for.Kids == " "] <- NA
south2$Childrens.Menu[south2$Childrens.Menu == " "] <- NA
south2$Takeout[south2$Takeout == " "] <- NA
south2$Delivery[south2$Delivery == " "] <- NA
south2$Kosher[south2$Kosher == " "] <- NA
south2$Halal[south2$Halal == " "] <- NA
south2$Vegan.Vegetarian[south2$Vegan.Vegetarian == " "] <- NA
south2$Gluten.Free.Options[south2$Gluten.Free.Options == " "] <- NA
south2$Organic.Options[south2$Organic.Options == " "] <- NA
south2$Wheelchair.Access[south2$Wheelchair.Access == " "] <- NA
south2$Price[south2$Price == " "] <- NA
south2$Chef[south2$Chef == " "] <- NA
south2$Reservations[south2$Reservations == " "] <- NA
##D
south2$newRatings <- as.numeric(south2$Ratings)
##E
names(south2)[names(south2)=="Zip.Code"] <- "zipCode"
south2 <- south2[order(south2$zipCode),]
#3
##A
OrigRatings <- na.omit(south2$Ratings)
OrigRatings <- OrigRatings[OrigRatings != ""]
OrigRatings <- OrigRatings[OrigRatings != " "]
OrigRatings <- na.omit(OrigRatings)
table(OrigRatings)
summary(south2$newRatings)
##B
GoodForKids <- toupper(as.character(south2$Good.for.Kids))
GoodForKids <- na.omit(GoodForKids)
GoodForKids <- GoodForKids[GoodForKids != ""]
table(GoodForKids)
CredCards <- toupper(as.character(south2$Credit.Cards))
CredCards <- na.omit(CredCards)
CredCards <- CredCards[CredCards != ""]
table(CredCards)
Res <- toupper(as.character(south2$Reservations))
Res <- na.omit(Res)
Res <- Res[Res != ""]
table(Res)
##C
summary(irs$N1)
summary(irs$MARS1)
summary(irs$MARS2)
summary(irs$MARS4)
summary(irs$NUMDEP)
summary(irs$A00100)
#4
names(irs)[names(irs)=="Group.1"] <- "zipCode"
south2$zipCode <- as.numeric(as.character(south2$zipCode))
irs2 <- subset(irs, select = c(N1, MARS1, MARS2, MARS4, NUMDEP, A00100, zipCode))
irssouth <- merge(irs2, south2, by = "zipCode")
head(irssouth)
##COME BACK
#5
##A
library(XML)
las <- readHTMLTable("http://www.stat.ucla.edu/~vlew/datasets/LASDAUG.html")
las2 <- as.data.frame(las)
##B
?aggregate
names(las2) <- substr(names(las2), 6, nchar(names(las2)))
names(las2)[names(las2)=="ZIP"] <- "zipCode"
las2$zipCode <- as.numeric(as.character(las2$zipCode))
las3 <- aggregate(as.numeric(las2$VICTIM_COUNT), by = list(las2$zipCode), FUN = mean)
names(las3)[names(las3)=="Group.1"] <- "zipCode"
names(las3)[names(las3)=="x"] <- "MeanVictimCount"
las3
lis <- merge(las3, irssouth, by = "zipCode")
lis <- subset(lis, select = (c("zipCode", "MeanVictimCount", "N1", "MARS1", "MARS2", "MARS4", "NUMDEP", "A00100", "Restaurant", "Street.Address", "City", "State", "newRatings", "Price")))
#6
library(ggplot2)
box <- boxplot(lis$A00100~lis$newRatings, ylim = c(0, 6000000), xlab = "Yelp Rating", main = "Adjusted Gross Income by Yelp Rating", col = c("red", "green", "blue", "cyan", "darkred", "gray", "yellow", "pink", "aquamarine", "orange"), ylab = "Adjusted Gross Income")
x <- lis$newRatings
y <- lis$Price
tab2 <- table(x, y)
tab2 <- tab2[,-c(1,2,7)]
tab2
barplot(tab2, beside = TRUE, ylim = c(0, 100), legend = rownames(tab2), xlab = "Restaurant Price", ylab = "Number of Restaurants", main = "Restaurant Rating by Price Range", args.legend = list(title = "Rating"), col = c("red", "green", "blue", "cyan", "darkred", "gray", "yellow", "pink", "aquamarine", "orange"))
plot(lis$MeanVictimCount, lis$newRatings, xlab = "Mean Victim Count", ylab = "Yelp Rating", main = "Relationship Between Victim Count and Yelp Rating", col = c("green"))
lis$MeanVictimCount
#7
irs$NUMDEP
lis$sqm1 <- sqrt(lis$MARS1)
#8
meanirs <- apply(irs, 2, mean)
meanirs <- matrix(meanirs) #Showing aptitude with the apply functions; found mean of every irs variable
#8
reg <- lm(lis$NUMDEP~lis$MARS4)
reg
summary(reg)
bb <- plot(lis$MARS4, lis$NUMDEP, col = "red")
abline(reg, col = "green")
gg <- ggplot(lis, aes(x=lis$MARS4, y=lis$NUMDEP, color=lis$newRatings)) + geom_point(shape=2) + geom_smooth(method=lm, sm=FALSE) + ggtitle("Relationship Bewteen Number of Returns \nFiled for Heads of Households and Number of Dependents") + labs(x="Head of Household Returns", y="Dependents", colour="Rating")
gg