Getting and Handling Large Data Sets

GET THE DATA

You will acquire and analyze a real dataset on baby name popularity provided by the Social Security Administration. To warm up, we will ask you a few simple questions that can be answered by inspecting the data.

The data can be downloaded in zip format from: http://www.ssa.gov/oact/babynames/state/namesbystate.zip (~22MB)

Part 1: Loading the data

The data are seperated in multiple files by states. I loop through the folder by looking for files that end with .txt and append then into a single dataframe.

setwd("~/Documents/scu/Winter 2017/Machine Learning/Homework/namesbystate")
df<-NULL
tem<-list.files()
for (i in tem){
  if(grepl('.TXT',i)){
  #print (i)
  temp1 <- read.table(i, sep = ",")
  df<-rbind(df,temp1)
  }}
col<-c("state","gender","year","name","occurences")
colnames(df)<-col
write.csv(df,"allstate.csv")
df<-read.csv("allstate.csv")

The most gender ambiguous name in 2013? 1945?

I came up with a ambiguity metric. The most gender ambiguous name is defined as a name with the highest occurences (and most popular) between male and female. This ambiguity metric is made up with two factors. The first factor is the name occurence ratio between male and female. The higher the ratio is, the more ambiguous the name is. The second factor is to capture the popularity of the name. The multiplication of two factor will be the ambiguity metric (AM), the higher the AM, the more ambiguous the name is.

Based on this definition- The most gender ambiguous name is Charlie.

library(data.table)
## -------------------------------------------------------------------------
## data.table + dplyr code now lives in dtplyr.
## Please library(dtplyr)!
## -------------------------------------------------------------------------
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
ambiguous_func<-function(x)
  {
  DT<-data.table(df)
  DT_2013=DT[year == x] 
  DT_2013=DT_2013[,sum(occurences),by=.(name,gender)]
  temp1 <-  DT_2013 %>% group_by(name) %>% filter(n()>1) %>% arrange(desc(V1)) 
  temp1<-data.table(temp1)
  
  ratio <- function(x) (x/lag(x))
  temp2<-temp1%>% group_by(name)%>%mutate_each(funs(min_ratio=ratio),V1)
  temp2=na.omit(temp2)
  temp2$V1<-NULL
  temp2$gender<-NULL
  
  temp1=temp1[,sum(V1),by=.(name)]
  colnames(temp1)<-c("name","total_occurences")
  temp<-merge(temp1,temp2,by="name")
  temp$ambiguity<-temp$total_occurences*temp$min_ratio
  temp<-data.table(temp)
  temp<-temp[min_ratio>=0.75]
  
  result<-( temp[which.max(temp$ambiguity),])
  print (result)
  }

ambiguous_func(2013)
##       name total_occurences min_ratio ambiguity
## 1: Charlie             2844 0.8479532  2411.579
ambiguous_func(1945)
##      name total_occurences min_ratio ambiguity
## 1: Leslie             3654 0.8491903  3102.941