Intro to Data Science - HW 3

Attribution statement: (choose only one and delete the rest)

# 2. I did this homework with help from the book and the professor and these Internet sources:
# https://discuss.analyticsvidhya.com/t/how-to-count-the-missing-value-in-r/2949
# https://stackoverflow.com/questions/1508889/how-to-count-number-of-numeric-values-in-a-column
# https://www.tutorialspoint.com/r/r_lists.htm
# https://www.rdocumentation.org/packages/stringr/versions/1.4.0/topics/str_replace
# https://statisticsglobe.com/warning-message-nas-introduced-by-coercion-in-r
# https://www.dummies.com/programming/r/how-to-take-samples-from-data-in-r/

Reminders of things to practice from last week:

Make a data frame data.frame( )
Row index of max/min which.max( ) which.min( )
Sort value or order rows sort( ) order( )
Descriptive statistics mean( ) sum( ) max( )
Conditional statement if (condition) “true stuff” else “false stuff”

This Week:

Often, when you get a dataset, it is not in the format you want. You can (and should) use code to refine the dataset to become more useful. As Chapter 6 of Introduction to Data Science mentions, this is called “data munging.” In this homework, you will read in a dataset from the web and work on it (in a data frame) to improve its usefulness.

Part 1: Use read_csv( ) to read a CSV file from the web into a data frame:

  1. Use R code to read directly from a URL on the web. Store the dataset into a new dataframe, called dfComps.
    The URL is:
    https://intro-datascience.s3.us-east-2.amazonaws.com/companies1.csv
    Hint: use read_csv( ), not read.csv( ). This is from the tidyverse package. Check the help to compare them.
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ stringr 1.4.0
## ✓ tidyr   1.1.4     ✓ forcats 0.5.1
## ✓ readr   2.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
dfComps <- data.frame(read_csv('https://intro-datascience.s3.us-east-2.amazonaws.com/companies1.csv'))
## Rows: 47758 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): permalink, name, homepage_url, category_list, market, funding_tota...
## dbl  (2): funding_rounds, founded_year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(dfComps)

Part 2: Create a new data frame that only contains companies with a homepage URL:

  1. Use subsetting to create a new dataframe that contains only the companies with homepage URLs (store that dataframe in urlComps).
companies <- data.frame(dfComps$name,dfComps$homepage_url)
View(companies)
  1. How many companies are missing a homepage URL?
sum(is.na(dfComps$homepage_url))
## [1] 3323
sum(is.na(companies$dfComps.homepage_url))
## [1] 3323

Part 3: Analyze the numeric variables in the dataframe.

  1. How many numeric variables does the dataframe have? You can figure that out by looking at the output of str(urlComps).

  2. What is the average number of funding rounds for the companies in urlComps?

str(dfComps)
## 'data.frame':    47758 obs. of  18 variables:
##  $ permalink        : chr  "/organization/waywire" "/organization/tv-communications" "/organization/rock-your-paper" "/organization/in-touch-network" ...
##  $ name             : chr  "#waywire" "&TV Communications" "'Rock' Your Paper" "(In)Touch Network" ...
##  $ homepage_url     : chr  "http://www.waywire.com" "http://enjoyandtv.com" "http://www.rockyourpaper.org" "http://www.InTouchNetwork.com" ...
##  $ category_list    : chr  "|Entertainment|Politics|Social Media|News|" "|Games|" "|Publishing|Education|" "|Electronics|Guides|Coffee|Restaurants|Music|iPhone|Apps|Mobile|iOS|E-Commerce|" ...
##  $ market           : chr  "News" "Games" "Publishing" "Electronics" ...
##  $ funding_total_usd: chr  "1 750 000" "4 000 000" "40 000" "1 500 000" ...
##  $ status           : chr  "acquired" "operating" "operating" "operating" ...
##  $ country_code     : chr  "USA" "USA" "EST" "GBR" ...
##  $ state_code       : chr  "NY" "CA" NA NA ...
##  $ region           : chr  "New York City" "Los Angeles" "Tallinn" "London" ...
##  $ city             : chr  "New York" "Los Angeles" "Tallinn" "London" ...
##  $ funding_rounds   : num  1 2 1 1 2 1 1 1 1 1 ...
##  $ founded_at       : chr  "1/6/12" NA "26/10/2012" "1/4/11" ...
##  $ founded_month    : chr  "2012-06" NA "2012-10" "2011-04" ...
##  $ founded_quarter  : chr  "2012-Q2" NA "2012-Q4" "2011-Q2" ...
##  $ founded_year     : num  2012 NA 2012 2011 2012 ...
##  $ first_funding_at : chr  "30/06/2012" "4/6/10" "9/8/12" "1/4/11" ...
##  $ last_funding_at  : chr  "30/06/2012" "23/09/2010" "9/8/12" "1/4/11" ...
colSums(!is.na(dfComps))
##         permalink              name      homepage_url     category_list 
##             47758             47758             44435             42464 
##            market funding_total_usd            status      country_code 
##             42465             47758             46029             42634 
##        state_code            region              city    funding_rounds 
##             29147             42634             41630             47758 
##        founded_at     founded_month   founded_quarter      founded_year 
##             37328             37255             37255             37255 
##  first_funding_at   last_funding_at 
##             47758             47758
mean(dfComps$funding_rounds)
## [1] 1.688576
  1. What year was the oldest company in the dataframe founded?
    Hint: If you get a value of “NA,” most likely there are missing values in this variable which preclude R from properly calculating the min & max values. You can ignore NAs with basic math calculations. For example, instead of running mean(urlComps$founded_year), something like this will work for determining the average (note that this question needs to use a different function than ‘mean’.
mean(dfComps$founded_year, na.rm=TRUE)
## [1] 2007.247
#your code goes here
min(dfComps$founded_year, na.rm=TRUE)
## [1] 1900

Part 4: Use string operations to clean the data.

  1. The permalink variable in urlComps contains the name of each company but the names are currently preceded by the prefix “/organization/”. We can use str_replace() in tidyverse or gsub() to clean the values of this variable:
dfComps$NewPermalink <- str_replace(dfComps$permalink,'/organization/',' ')
  1. Can you identify another variable which should be numeric but is currently coded as character? Use the as.numeric() function to add a new variable to urlComps which contains the values from the char variable as numbers. Do you notice anything about the number of NA values in this new column compared to the original “char” one?
str(dfComps)
## 'data.frame':    47758 obs. of  19 variables:
##  $ permalink        : chr  "/organization/waywire" "/organization/tv-communications" "/organization/rock-your-paper" "/organization/in-touch-network" ...
##  $ name             : chr  "#waywire" "&TV Communications" "'Rock' Your Paper" "(In)Touch Network" ...
##  $ homepage_url     : chr  "http://www.waywire.com" "http://enjoyandtv.com" "http://www.rockyourpaper.org" "http://www.InTouchNetwork.com" ...
##  $ category_list    : chr  "|Entertainment|Politics|Social Media|News|" "|Games|" "|Publishing|Education|" "|Electronics|Guides|Coffee|Restaurants|Music|iPhone|Apps|Mobile|iOS|E-Commerce|" ...
##  $ market           : chr  "News" "Games" "Publishing" "Electronics" ...
##  $ funding_total_usd: chr  "1 750 000" "4 000 000" "40 000" "1 500 000" ...
##  $ status           : chr  "acquired" "operating" "operating" "operating" ...
##  $ country_code     : chr  "USA" "USA" "EST" "GBR" ...
##  $ state_code       : chr  "NY" "CA" NA NA ...
##  $ region           : chr  "New York City" "Los Angeles" "Tallinn" "London" ...
##  $ city             : chr  "New York" "Los Angeles" "Tallinn" "London" ...
##  $ funding_rounds   : num  1 2 1 1 2 1 1 1 1 1 ...
##  $ founded_at       : chr  "1/6/12" NA "26/10/2012" "1/4/11" ...
##  $ founded_month    : chr  "2012-06" NA "2012-10" "2011-04" ...
##  $ founded_quarter  : chr  "2012-Q2" NA "2012-Q4" "2011-Q2" ...
##  $ founded_year     : num  2012 NA 2012 2011 2012 ...
##  $ first_funding_at : chr  "30/06/2012" "4/6/10" "9/8/12" "1/4/11" ...
##  $ last_funding_at  : chr  "30/06/2012" "23/09/2010" "9/8/12" "1/4/11" ...
##  $ NewPermalink     : chr  " waywire" " tv-communications" " rock-your-paper" " in-touch-network" ...
# funding_total_usd should be numeric and not char
dfComps$Funds <- as.numeric(dfComps$funding_total_usd)
## Warning: NAs introduced by coercion
#Almost all if not every one of them was a character string rather than numeric
sum(is.na(dfComps$Funds))
## [1] 47721
  1. To ensure the char values are converted correctly, we first need to remove the spaces between the digits in the variable. Check if this works, and explain what it is doing:
library(stringi)
dfComps$funding_total_usd_new <- stri_replace_all_charclass(dfComps$funding_total_usd,"\\p{WHITE_SPACE}", "")
# Removes all of the white space between the numbers that were separated to make it easier to convert

N. You are now ready to convert urlComps$funding_new to numeric using as.numeric().

Calculate the average funding amount for urlComps. If you get “NA,” try using the na.rm=TRUE argument from problem I.

dfComps$funding_total_usd_new <- as.numeric(dfComps$funding_total_usd_new)
## Warning: NAs introduced by coercion
mean(dfComps$funding_total_usd_new, na.rm = TRUE)
## [1] 17820092

Sample three unique observations from urlComps$funding_rounds, store the results in the vector ‘observations’

v <- sample(dfComps$funding_rounds, 5)
v2 <- sample(dfComps$funding_rounds, 10)
v3 <- sample(dfComps$funding_rounds, 15)
v
## [1] 1 1 1 7 2
v2 
##  [1] 1 1 1 3 1 1 1 2 1 1
v3
##  [1] 1 1 1 2 2 2 1 1 1 1 1 1 3 2 1

Take the mean of those observations

mean(v)
## [1] 2.4
mean(v2)
## [1] 1.3
mean(v3)
## [1] 1.4

Do the two steps (sampling and taking the mean) in one line of code

mean(sample(dfComps$funding_rounds, 5))
## [1] 2.2

Explain why the two means are (or might be) different

Use the replicate( ) function to repeat your sampling of three observations of urlComps$funding_rounds observations five times. The first argument to replicate( ) is the number of repeats you want. The second argument is the little chunk of code you want repeated.

# Sampling is a random function, so if I tell it to sample 5 observations, it will chose 5 random elements each time I call the command

replicate(5, sample(dfComps$funding_rounds, 5))
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    4    1    1    1    1
## [2,]    1    6    1    6    2
## [3,]    1    1    1    1    1
## [4,]    1    1    2    2    7
## [5,]    1    1    1    1    1

Rerun your replication, this time doing 20 replications and storing the output of replicate() in a variable called values.

values <- replicate(20, sample(dfComps$funding_rounds, 5))
values
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,]    1    1    3    3    1    1    1    1    1     4     1     2     3     2
## [2,]    1    1    1    2    1    5    1    1    4     1     1     2     1     1
## [3,]    8    1    2    1   12    1    1    1    1     1     1     5     6     1
## [4,]    1    4    2    3    1    2    1    2    1     1     1     2     1     1
## [5,]    1    1    2    2    1    2    1    1    1     2     2     1     6     1
##      [,15] [,16] [,17] [,18] [,19] [,20]
## [1,]     2     1     1     3     1     1
## [2,]     1     1     1     2     2     3
## [3,]     1     1     3     5     1     2
## [4,]     1     5     1     3     2     5
## [5,]     1     1     1     1     1     1

Generate a histogram of the means stored in values.

hist(values)

Rerun your replication, this time doing 1000 replications and storing the output of replicate() in a variable called values, and then generate a histogram of values.

values <- replicate(1000, sample(dfComps$funding_rounds, 5))
hist(values)

Repeat the replicated sampling, but this time, raise your sample size from 3 to 22. How does that affect your histogram? Explain in a comment.

values <- replicate(1000, sample(dfComps$funding_rounds, 22))
hist(values)

# my histogram is still skewed, i think if you make the sample bigger the historgram will normalize

Explain in a comment below, the last three histograms, why do they look different?

# the histograms look different because of he sample size that we are using. The smaller the sample size, the harder it is to acheive a normal bell curve which is what we want if we want to look at quartiles and compare the data.