Final Project Planning Document

Find an interesting dataset and describe the system you plan to build out. If you would like to use one of the datasets you have already worked with, you should add a unique element or incorporate additional data. (i.e. explicit features you scrape from another source, like image analysis on movie posters). The overall goal, however, will be to produce quality recommendations by extracting insights from a large dataset. You may do so using Spark, or another distributed computing method, OR by effectively applying one of the more advanced mathematical techniques we have covered. There is no preference for one over the other, as long as your recommender works! The planning document should be written up and published as a notebook on GitHub or in RPubs. Please submit the link in the Unit 4 folder, due Thursday, July 5.


The dataset used will be from a previous project, located here: Indeed Web Scraping. This project scrapes job postings from Indeed and store them on an AWS PostgreSQL server in order to determine which skills are most important for aspiring data scientists. For the final recommender systems project, I would like to expand this dataset to jobs outside of data science as well.


My goal is to create a content-based recommender system based on the job title, company, and job description for both jobs and skills. I am considering one or both of the following options:

  • Allow users to upload a resume and provide job recommendations
  • Allow users to input skills and provide additional skills to learn based on job posting requirements