For the last blog entry, I’d like to talk about the contents of DATA 621 and the applications to the MSDS curriculum and a data science career. Looking back at the syllabus, this class aims to make the business case for simple machine learning, specifically supervised learning techniques. In a typical case, a business is looking at an impotant outcome and looking to understand it better. In a lot of cases this is revenue or costs, but can also be studied indirectly by modelling customer churn, number of claims, or some other quality measure. An initial model should offer a good description of the variables that drive this outcome, and potentially be extended to generated predictions and make a business decision.
Going into this class, I have known about linear regression techniques and their applications to dose-response curves and measuring clinical trial outcomes using logistic survival analysis. One part of 621 that was very beneficial is building off the foundation of simple linear models to generalized model techniques. In cases where outcomes are rare or small and countable, I now understand there are methods used for count regression. This class also offered a good review of measuring model performance by interpretation of confusion matrices and ROC curves. Last, some of the homework datasets plus the final project allowed some review of handling messy dataframes.
Some topics were briefly mentioned but certainly could be sections of another class. Handling missing values requires a lot of decisionmaking that affects model outcome. Many of the variables considered also offered zero-inflated variables, or features with clustered data. I’ll definitely continue to work on developing an intuition for using all information available in a dataset. There’s also something be said about identifying the variables that are to be added to the final model. Model selection has a clear bias-variance tradeoff but even at a fundamental level there are some characteristics that are difficult or immoral to use as explanatory variables. Some things in real life are difficult to measure with any meaningful precision, especially if there’s a financial incentive to not be truthful. And in Homework 3, a dataset that attempted to predict crime in Boston neighborhoods, the assignment dropped a variable from the original dataset indicating the proportion of African Americans living in a community. Even if a characteristic like this improves model performance, what’s the likelihood that it would be misused or misinterpreted? Its consideration offers that old reprise from an ethical angle - ‘correlation doesn’t imply causation.’
The next two courses I’m taking through CUNY are this fall, Math Modeling Techniques and Knowledge and Visual Analytics. One element I struggled through this course was generating meaningful summaries and graphs of models. I found myself depending heavily on the same four or five techniques for detecting bias and measuring performance in a model. Clearly some techniques like generating histograms are invaluable, but a graph that can capture the multidimenionality of a process really elevates my own understanding, and makes it much easier to communicate the same idea with others. And with respect to modeling techniques, I’m looking forward to learning more about hierarchal models or unsupervised machine learning techniques.