# Enter your name here: Joshua Gaze
# 1. I did this homework by myself, with help from the book and the professor.
Make a data frame data.frame( )
Row index of max/min which.max( ) which.min( )
Sort value or order rows sort( ) order( )
Descriptive statistics mean( ) sum( ) max( )
Conditional statement if (condition) “true stuff” else “false stuff”
Often, when you get a dataset, it is not in the format you want. You can (and should) use code to refine the dataset to become more useful. As Chapter 6 of Introduction to Data Science mentions, this is called “data munging.” In this homework, you will read in a dataset from the web and work on it (in a data frame) to improve its usefulness.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.5 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
dfComps <- read_csv("https://intro-datascience.s3.us-east-2.amazonaws.com/companies1.csv")
## Rows: 47758 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): permalink, name, homepage_url, category_list, market, funding_tota...
## dbl (2): funding_rounds, founded_year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
urlComps<- dfComps[which(dfComps$homepage_url != 'NA'), ]
dim(urlComps)
## [1] 44435 18
print('44435 companies are missing a homepage URL')
## [1] "44435 companies are missing a homepage URL"
str(urlComps)
## tibble [44,435 × 18] (S3: tbl_df/tbl/data.frame)
## $ permalink : chr [1:44435] "/organization/waywire" "/organization/tv-communications" "/organization/rock-your-paper" "/organization/in-touch-network" ...
## $ name : chr [1:44435] "#waywire" "&TV Communications" "'Rock' Your Paper" "(In)Touch Network" ...
## $ homepage_url : chr [1:44435] "http://www.waywire.com" "http://enjoyandtv.com" "http://www.rockyourpaper.org" "http://www.InTouchNetwork.com" ...
## $ category_list : chr [1:44435] "|Entertainment|Politics|Social Media|News|" "|Games|" "|Publishing|Education|" "|Electronics|Guides|Coffee|Restaurants|Music|iPhone|Apps|Mobile|iOS|E-Commerce|" ...
## $ market : chr [1:44435] "News" "Games" "Publishing" "Electronics" ...
## $ funding_total_usd: chr [1:44435] "1 750 000" "4 000 000" "40 000" "1 500 000" ...
## $ status : chr [1:44435] "acquired" "operating" "operating" "operating" ...
## $ country_code : chr [1:44435] "USA" "USA" "EST" "GBR" ...
## $ state_code : chr [1:44435] "NY" "CA" NA NA ...
## $ region : chr [1:44435] "New York City" "Los Angeles" "Tallinn" "London" ...
## $ city : chr [1:44435] "New York" "Los Angeles" "Tallinn" "London" ...
## $ funding_rounds : num [1:44435] 1 2 1 1 2 1 1 1 1 1 ...
## $ founded_at : chr [1:44435] "1/6/12" NA "26/10/2012" "1/4/11" ...
## $ founded_month : chr [1:44435] "2012-06" NA "2012-10" "2011-04" ...
## $ founded_quarter : chr [1:44435] "2012-Q2" NA "2012-Q4" "2011-Q2" ...
## $ founded_year : num [1:44435] 2012 NA 2012 2011 2012 ...
## $ first_funding_at : chr [1:44435] "30/06/2012" "4/6/10" "9/8/12" "1/4/11" ...
## $ last_funding_at : chr [1:44435] "30/06/2012" "23/09/2010" "9/8/12" "1/4/11" ...
print("There are 2 numeric variabels in urlComps. Those being 'funding_rounds' and 'founded_year'")
## [1] "There are 2 numeric variabels in urlComps. Those being 'funding_rounds' and 'founded_year'"
mean(urlComps$funding_rounds)
## [1] 1.725194
print('The average number of funding rounds for the companies in urlComps was 1.725194')
## [1] "The average number of funding rounds for the companies in urlComps was 1.725194"
#mean(urlComps$founded_year, na.rm=TRUE)
#your code goes here
min(urlComps$founded_year, na.rm=TRUE)
## [1] 1900
urlComps$company <- str_replace(urlComps$permalink,"/organization/","")
# the variable I believe should be numeric is the char variable funding_total_usd
funding_total_usd_null <- is.na(urlComps$funding_total_usd)
table(funding_total_usd_null) #there are no null values in this edition of the 'funding_total_usd' column
## funding_total_usd_null
## FALSE
## 44435
urlComps$funding_new <- gsub("//s","",urlComps$funding_total_usd)
urlComps$funding_new<- as.numeric(urlComps$funding_total_usd)
## Warning: NAs introduced by coercion
# a high number of NA entries were coerced into this column when created
library(stringi)
urlComps$funding_new <- stri_replace_all_charclass(urlComps$funding_total_usd,"\\p{WHITE_SPACE}", "")
Error in stri_replace_all_charclass(urlComps$funding_total_usd, "\\p{WHITE_SPACE}", : object 'urlComps' not found
Traceback:
1. stri_replace_all_charclass(urlComps$funding_total_usd, "\\p{WHITE_SPACE}",
. "")
N. You are now ready to convert urlComps$funding_new to numeric using as.numeric().
Calculate the average funding amount for urlComps. If you get “NA,” try using the na.rm=TRUE argument from problem I.
urlComps$funding_new<- as.numeric(urlComps$funding_total_usd)
## Warning: NAs introduced by coercion
mean(urlComps$funding_new,na.rm=TRUE)
## [1] 402.2857
# The average funding amount for urlComps is 402.2857
Sample three unique observations from urlComps$funding_rounds, store the results in the vector ‘observations’
observations <- sample(urlComps$funding_rounds, size=3, replace =F)
Take the mean of those observations
mean(observations)
## [1] 4
Do the two steps (sampling and taking the mean) in one line of code
mean(sample(urlComps$funding_rounds, size=3, replace=F))
## [1] 2
Explain why the two means are (or might be) different
# because it's performing another sample on the column, so there's the poossibility of differing vectors and thus means
Use the replicate( ) function to repeat your sampling of three observations of urlComps$funding_rounds observations five times. The first argument to replicate( ) is the number of repeats you want. The second argument is the little chunk of code you want repeated.
replicate(5,sample(urlComps$funding_rounds, size=3, replace=F))
## [,1] [,2] [,3] [,4] [,5]
## [1,] 3 1 4 2 1
## [2,] 1 2 2 4 1
## [3,] 1 1 1 4 1
Rerun your replication, this time doing 20 replications and storing the output of replicate() in a variable called values.
values <- replicate(20, mean(sample(urlComps$funding_rounds, size=3, replace=F)))
Generate a histogram of the means stored in values.
hist(values)
Rerun your replication, this time doing 1000 replications and storing the output of replicate() in a variable called values, and then generate a histogram of values.
values <- replicate(20, mean(sample(urlComps$funding_rounds, size=3, replace=F)))
hist(values)
Repeat the replicated sampling, but this time, raise your sample size from 3 to 22. How does that affect your histogram? Explain in a comment.
values <- replicate(20, mean(sample(urlComps$funding_rounds, size=22, replace=F)))
hist(values)
Explain in a comment below, the last three histograms, why do they look different?
# the three histograms gradually go from a right scewed distribution to more of a normal distribution
# the reasoning behind this progression towards a normal distribution is the increased number of the samples along with the increased repetitions of plotting the means of these samples. Higher the sample size and higher the repetions both positively correlate with constructing a normal distribution.