Intro to Data Science - HW 3

Attribution statement: (choose only one and delete the rest)

# 1. I did this homework by myself, with help from the book and the professor.

Reminders of things to practice from last week:

Make a data frame data.frame( )
Row index of max/min which.max( ) which.min( )
Sort value or order rows sort( ) order( )
Descriptive statistics mean( ) sum( ) max( )
Conditional statement if (condition) “true stuff” else “false stuff”

This Week:

Often, when you get a dataset, it is not in the format you want. You can (and should) use code to refine the dataset to become more useful. As Chapter 6 of Introduction to Data Science mentions, this is called “data munging.” In this homework, you will read in a dataset from the web and work on it (in a data frame) to improve its usefulness.

Part 1: Use read_csv( ) to read a CSV file from the web into a data frame:

  1. Use R code to read directly from a URL on the web. Store the dataset into a new dataframe, called dfComps.
    The URL is:
    https://intro-datascience.s3.us-east-2.amazonaws.com/companies1.csv
    Hint: use read_csv( ), not read.csv( ). This is from the tidyverse package. Check the help to compare them.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
dfComps <- read_csv("https://intro-datascience.s3.us-east-2.amazonaws.com/companies1.csv")
## Rows: 47758 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): permalink, name, homepage_url, category_list, market, funding_tota...
## dbl  (2): funding_rounds, founded_year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Part 2: Create a new data frame that only contains companies with a homepage URL:

  1. Use subsetting to create a new dataframe that contains only the companies with homepage URLs (store that dataframe in urlComps).
urlComps<- dfComps[which(dfComps$homepage_url != 'NA'), ]
  1. How many companies are missing a homepage URL?
dim(urlComps)
## [1] 44435    18
print('44435 companies are missing a homepage URL')
## [1] "44435 companies are missing a homepage URL"

Part 3: Analyze the numeric variables in the dataframe.

  1. How many numeric variables does the dataframe have? You can figure that out by looking at the output of str(urlComps).
str(urlComps)
## tibble [44,435 × 18] (S3: tbl_df/tbl/data.frame)
##  $ permalink        : chr [1:44435] "/organization/waywire" "/organization/tv-communications" "/organization/rock-your-paper" "/organization/in-touch-network" ...
##  $ name             : chr [1:44435] "#waywire" "&TV Communications" "'Rock' Your Paper" "(In)Touch Network" ...
##  $ homepage_url     : chr [1:44435] "http://www.waywire.com" "http://enjoyandtv.com" "http://www.rockyourpaper.org" "http://www.InTouchNetwork.com" ...
##  $ category_list    : chr [1:44435] "|Entertainment|Politics|Social Media|News|" "|Games|" "|Publishing|Education|" "|Electronics|Guides|Coffee|Restaurants|Music|iPhone|Apps|Mobile|iOS|E-Commerce|" ...
##  $ market           : chr [1:44435] "News" "Games" "Publishing" "Electronics" ...
##  $ funding_total_usd: chr [1:44435] "1 750 000" "4 000 000" "40 000" "1 500 000" ...
##  $ status           : chr [1:44435] "acquired" "operating" "operating" "operating" ...
##  $ country_code     : chr [1:44435] "USA" "USA" "EST" "GBR" ...
##  $ state_code       : chr [1:44435] "NY" "CA" NA NA ...
##  $ region           : chr [1:44435] "New York City" "Los Angeles" "Tallinn" "London" ...
##  $ city             : chr [1:44435] "New York" "Los Angeles" "Tallinn" "London" ...
##  $ funding_rounds   : num [1:44435] 1 2 1 1 2 1 1 1 1 1 ...
##  $ founded_at       : chr [1:44435] "1/6/12" NA "26/10/2012" "1/4/11" ...
##  $ founded_month    : chr [1:44435] "2012-06" NA "2012-10" "2011-04" ...
##  $ founded_quarter  : chr [1:44435] "2012-Q2" NA "2012-Q4" "2011-Q2" ...
##  $ founded_year     : num [1:44435] 2012 NA 2012 2011 2012 ...
##  $ first_funding_at : chr [1:44435] "30/06/2012" "4/6/10" "9/8/12" "1/4/11" ...
##  $ last_funding_at  : chr [1:44435] "30/06/2012" "23/09/2010" "9/8/12" "1/4/11" ...
print("There are 2 numeric variabels in urlComps. Those being 'funding_rounds' and 'founded_year'")
## [1] "There are 2 numeric variabels in urlComps. Those being 'funding_rounds' and 'founded_year'"
  1. What is the average number of funding rounds for the companies in urlComps?
mean(urlComps$funding_rounds)
## [1] 1.725194
print('The average number of funding rounds for the companies in urlComps was 1.725194')
## [1] "The average number of funding rounds for the companies in urlComps was 1.725194"
  1. What year was the oldest company in the dataframe founded?
    Hint: If you get a value of “NA,” most likely there are missing values in this variable which preclude R from properly calculating the min & max values. You can ignore NAs with basic math calculations. For example, instead of running mean(urlComps$founded_year), something like this will work for determining the average (note that this question needs to use a different function than ‘mean’.
#mean(urlComps$founded_year, na.rm=TRUE)

#your code goes here
min(urlComps$founded_year, na.rm=TRUE)
## [1] 1900

Part 4: Use string operations to clean the data.

  1. The permalink variable in urlComps contains the name of each company but the names are currently preceded by the prefix “/organization/”. We can use str_replace() in tidyverse or gsub() to clean the values of this variable:
urlComps$company <- str_replace(urlComps$permalink,"/organization/","")
  1. Can you identify another variable which should be numeric but is currently coded as character? Use the as.numeric() function to add a new variable to urlComps which contains the values from the char variable as numbers. Do you notice anything about the number of NA values in this new column compared to the original “char” one?
# the variable I believe should be numeric is the char variable funding_total_usd
funding_total_usd_null <- is.na(urlComps$funding_total_usd)
table(funding_total_usd_null) #there are no null values in this edition of the 'funding_total_usd' column
## funding_total_usd_null
## FALSE 
## 44435
urlComps$funding_new <- gsub("//s","",urlComps$funding_total_usd)
urlComps$funding_new<- as.numeric(urlComps$funding_total_usd)
## Warning: NAs introduced by coercion
# a high number of NA entries were coerced into this column when created
  1. To ensure the char values are converted correctly, we first need to remove the spaces between the digits in the variable. Check if this works, and explain what it is doing:
library(stringi)
urlComps$funding_new <- stri_replace_all_charclass(urlComps$funding_total_usd,"\\p{WHITE_SPACE}", "")
Error in stri_replace_all_charclass(urlComps$funding_total_usd, "\\p{WHITE_SPACE}", : object 'urlComps' not found
Traceback:


1. stri_replace_all_charclass(urlComps$funding_total_usd, "\\p{WHITE_SPACE}", 
 .     "")

N. You are now ready to convert urlComps$funding_new to numeric using as.numeric().

Calculate the average funding amount for urlComps. If you get “NA,” try using the na.rm=TRUE argument from problem I.

urlComps$funding_new<- as.numeric(urlComps$funding_total_usd)
## Warning: NAs introduced by coercion
mean(urlComps$funding_new,na.rm=TRUE)
## [1] 402.2857
# The average funding amount for urlComps is 402.2857

Sample three unique observations from urlComps$funding_rounds, store the results in the vector ‘observations’

observations <- sample(urlComps$funding_rounds, size=3, replace =F)

Take the mean of those observations

mean(observations)
## [1] 4

Do the two steps (sampling and taking the mean) in one line of code

mean(sample(urlComps$funding_rounds, size=3, replace=F))
## [1] 2

Explain why the two means are (or might be) different

# because it's performing another sample on the column, so there's the poossibility of differing vectors and thus means

Use the replicate( ) function to repeat your sampling of three observations of urlComps$funding_rounds observations five times. The first argument to replicate( ) is the number of repeats you want. The second argument is the little chunk of code you want repeated.

replicate(5,sample(urlComps$funding_rounds, size=3, replace=F))
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    3    1    4    2    1
## [2,]    1    2    2    4    1
## [3,]    1    1    1    4    1

Rerun your replication, this time doing 20 replications and storing the output of replicate() in a variable called values.

values <- replicate(20, mean(sample(urlComps$funding_rounds, size=3, replace=F)))

Generate a histogram of the means stored in values.

hist(values)

Rerun your replication, this time doing 1000 replications and storing the output of replicate() in a variable called values, and then generate a histogram of values.

values <- replicate(20, mean(sample(urlComps$funding_rounds, size=3, replace=F)))
hist(values)

Repeat the replicated sampling, but this time, raise your sample size from 3 to 22. How does that affect your histogram? Explain in a comment.

values <- replicate(20, mean(sample(urlComps$funding_rounds, size=22, replace=F)))
hist(values)

Explain in a comment below, the last three histograms, why do they look different?

# the three histograms gradually go from a right scewed distribution to more of a normal distribution
# the reasoning behind this progression towards a normal distribution is the increased number of the samples along with the increased repetitions of plotting the means of these samples. Higher the sample size and higher the repetions both positively correlate with constructing a normal distribution.