# Enter your name here: Taylor Sain
# 2. I did this homework with help from the book and the professor and these Internet sources:
# https://discuss.analyticsvidhya.com/t/how-to-count-the-missing-value-in-r/2949
# https://stackoverflow.com/questions/1508889/how-to-count-number-of-numeric-values-in-a-column
# https://www.tutorialspoint.com/r/r_lists.htm
# https://www.rdocumentation.org/packages/stringr/versions/1.4.0/topics/str_replace
# https://statisticsglobe.com/warning-message-nas-introduced-by-coercion-in-r
# https://www.dummies.com/programming/r/how-to-take-samples-from-data-in-r/
Make a data frame data.frame( )
Row index of max/min which.max( ) which.min( )
Sort value or order rows sort( ) order( )
Descriptive statistics mean( ) sum( ) max( )
Conditional statement if (condition) “true stuff” else “false stuff”
Often, when you get a dataset, it is not in the format you want. You can (and should) use code to refine the dataset to become more useful. As Chapter 6 of Introduction to Data Science mentions, this is called “data munging.” In this homework, you will read in a dataset from the web and work on it (in a data frame) to improve its usefulness.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.5 ✓ stringr 1.4.0
## ✓ tidyr 1.1.4 ✓ forcats 0.5.1
## ✓ readr 2.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
dfComps <- data.frame(read_csv('https://intro-datascience.s3.us-east-2.amazonaws.com/companies1.csv'))
## Rows: 47758 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): permalink, name, homepage_url, category_list, market, funding_tota...
## dbl (2): funding_rounds, founded_year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(dfComps)
companies <- data.frame(dfComps$name,dfComps$homepage_url)
View(companies)
sum(is.na(dfComps$homepage_url))
## [1] 3323
sum(is.na(companies$dfComps.homepage_url))
## [1] 3323
How many numeric variables does the dataframe have? You can figure that out by looking at the output of str(urlComps).
What is the average number of funding rounds for the companies in urlComps?
str(dfComps)
## 'data.frame': 47758 obs. of 18 variables:
## $ permalink : chr "/organization/waywire" "/organization/tv-communications" "/organization/rock-your-paper" "/organization/in-touch-network" ...
## $ name : chr "#waywire" "&TV Communications" "'Rock' Your Paper" "(In)Touch Network" ...
## $ homepage_url : chr "http://www.waywire.com" "http://enjoyandtv.com" "http://www.rockyourpaper.org" "http://www.InTouchNetwork.com" ...
## $ category_list : chr "|Entertainment|Politics|Social Media|News|" "|Games|" "|Publishing|Education|" "|Electronics|Guides|Coffee|Restaurants|Music|iPhone|Apps|Mobile|iOS|E-Commerce|" ...
## $ market : chr "News" "Games" "Publishing" "Electronics" ...
## $ funding_total_usd: chr "1 750 000" "4 000 000" "40 000" "1 500 000" ...
## $ status : chr "acquired" "operating" "operating" "operating" ...
## $ country_code : chr "USA" "USA" "EST" "GBR" ...
## $ state_code : chr "NY" "CA" NA NA ...
## $ region : chr "New York City" "Los Angeles" "Tallinn" "London" ...
## $ city : chr "New York" "Los Angeles" "Tallinn" "London" ...
## $ funding_rounds : num 1 2 1 1 2 1 1 1 1 1 ...
## $ founded_at : chr "1/6/12" NA "26/10/2012" "1/4/11" ...
## $ founded_month : chr "2012-06" NA "2012-10" "2011-04" ...
## $ founded_quarter : chr "2012-Q2" NA "2012-Q4" "2011-Q2" ...
## $ founded_year : num 2012 NA 2012 2011 2012 ...
## $ first_funding_at : chr "30/06/2012" "4/6/10" "9/8/12" "1/4/11" ...
## $ last_funding_at : chr "30/06/2012" "23/09/2010" "9/8/12" "1/4/11" ...
colSums(!is.na(dfComps))
## permalink name homepage_url category_list
## 47758 47758 44435 42464
## market funding_total_usd status country_code
## 42465 47758 46029 42634
## state_code region city funding_rounds
## 29147 42634 41630 47758
## founded_at founded_month founded_quarter founded_year
## 37328 37255 37255 37255
## first_funding_at last_funding_at
## 47758 47758
mean(dfComps$funding_rounds)
## [1] 1.688576
mean(dfComps$founded_year, na.rm=TRUE)
## [1] 2007.247
#your code goes here
min(dfComps$founded_year, na.rm=TRUE)
## [1] 1900
dfComps$NewPermalink <- str_replace(dfComps$permalink,'/organization/',' ')
str(dfComps)
## 'data.frame': 47758 obs. of 19 variables:
## $ permalink : chr "/organization/waywire" "/organization/tv-communications" "/organization/rock-your-paper" "/organization/in-touch-network" ...
## $ name : chr "#waywire" "&TV Communications" "'Rock' Your Paper" "(In)Touch Network" ...
## $ homepage_url : chr "http://www.waywire.com" "http://enjoyandtv.com" "http://www.rockyourpaper.org" "http://www.InTouchNetwork.com" ...
## $ category_list : chr "|Entertainment|Politics|Social Media|News|" "|Games|" "|Publishing|Education|" "|Electronics|Guides|Coffee|Restaurants|Music|iPhone|Apps|Mobile|iOS|E-Commerce|" ...
## $ market : chr "News" "Games" "Publishing" "Electronics" ...
## $ funding_total_usd: chr "1 750 000" "4 000 000" "40 000" "1 500 000" ...
## $ status : chr "acquired" "operating" "operating" "operating" ...
## $ country_code : chr "USA" "USA" "EST" "GBR" ...
## $ state_code : chr "NY" "CA" NA NA ...
## $ region : chr "New York City" "Los Angeles" "Tallinn" "London" ...
## $ city : chr "New York" "Los Angeles" "Tallinn" "London" ...
## $ funding_rounds : num 1 2 1 1 2 1 1 1 1 1 ...
## $ founded_at : chr "1/6/12" NA "26/10/2012" "1/4/11" ...
## $ founded_month : chr "2012-06" NA "2012-10" "2011-04" ...
## $ founded_quarter : chr "2012-Q2" NA "2012-Q4" "2011-Q2" ...
## $ founded_year : num 2012 NA 2012 2011 2012 ...
## $ first_funding_at : chr "30/06/2012" "4/6/10" "9/8/12" "1/4/11" ...
## $ last_funding_at : chr "30/06/2012" "23/09/2010" "9/8/12" "1/4/11" ...
## $ NewPermalink : chr " waywire" " tv-communications" " rock-your-paper" " in-touch-network" ...
# funding_total_usd should be numeric and not char
dfComps$Funds <- as.numeric(dfComps$funding_total_usd)
## Warning: NAs introduced by coercion
#Almost all if not every one of them was a character string rather than numeric
sum(is.na(dfComps$Funds))
## [1] 47721
library(stringi)
dfComps$funding_total_usd_new <- stri_replace_all_charclass(dfComps$funding_total_usd,"\\p{WHITE_SPACE}", "")
# Removes all of the white space between the numbers that were separated to make it easier to convert
N. You are now ready to convert urlComps$funding_new to numeric using as.numeric().
Calculate the average funding amount for urlComps. If you get “NA,” try using the na.rm=TRUE argument from problem I.
dfComps$funding_total_usd_new <- as.numeric(dfComps$funding_total_usd_new)
## Warning: NAs introduced by coercion
mean(dfComps$funding_total_usd_new, na.rm = TRUE)
## [1] 17820092
Sample three unique observations from urlComps$funding_rounds, store the results in the vector ‘observations’
v <- sample(dfComps$funding_rounds, 5)
v2 <- sample(dfComps$funding_rounds, 10)
v3 <- sample(dfComps$funding_rounds, 15)
v
## [1] 1 1 1 7 2
v2
## [1] 1 1 1 3 1 1 1 2 1 1
v3
## [1] 1 1 1 2 2 2 1 1 1 1 1 1 3 2 1
Take the mean of those observations
mean(v)
## [1] 2.4
mean(v2)
## [1] 1.3
mean(v3)
## [1] 1.4
Do the two steps (sampling and taking the mean) in one line of code
mean(sample(dfComps$funding_rounds, 5))
## [1] 2.2
Explain why the two means are (or might be) different
Use the replicate( ) function to repeat your sampling of three observations of urlComps$funding_rounds observations five times. The first argument to replicate( ) is the number of repeats you want. The second argument is the little chunk of code you want repeated.
# Sampling is a random function, so if I tell it to sample 5 observations, it will chose 5 random elements each time I call the command
replicate(5, sample(dfComps$funding_rounds, 5))
## [,1] [,2] [,3] [,4] [,5]
## [1,] 4 1 1 1 1
## [2,] 1 6 1 6 2
## [3,] 1 1 1 1 1
## [4,] 1 1 2 2 7
## [5,] 1 1 1 1 1
Rerun your replication, this time doing 20 replications and storing the output of replicate() in a variable called values.
values <- replicate(20, sample(dfComps$funding_rounds, 5))
values
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 1 1 3 3 1 1 1 1 1 4 1 2 3 2
## [2,] 1 1 1 2 1 5 1 1 4 1 1 2 1 1
## [3,] 8 1 2 1 12 1 1 1 1 1 1 5 6 1
## [4,] 1 4 2 3 1 2 1 2 1 1 1 2 1 1
## [5,] 1 1 2 2 1 2 1 1 1 2 2 1 6 1
## [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] 2 1 1 3 1 1
## [2,] 1 1 1 2 2 3
## [3,] 1 1 3 5 1 2
## [4,] 1 5 1 3 2 5
## [5,] 1 1 1 1 1 1
Generate a histogram of the means stored in values.
hist(values)
Rerun your replication, this time doing 1000 replications and storing the output of replicate() in a variable called values, and then generate a histogram of values.
values <- replicate(1000, sample(dfComps$funding_rounds, 5))
hist(values)
Repeat the replicated sampling, but this time, raise your sample size from 3 to 22. How does that affect your histogram? Explain in a comment.
values <- replicate(1000, sample(dfComps$funding_rounds, 22))
hist(values)
# my histogram is still skewed, i think if you make the sample bigger the historgram will normalize
Explain in a comment below, the last three histograms, why do they look different?
# the histograms look different because of he sample size that we are using. The smaller the sample size, the harder it is to acheive a normal bell curve which is what we want if we want to look at quartiles and compare the data.