The syllabus below is for Fall 2022. It is being revamped to incorporate latest AI including LLMs such as ChatGPT. Check back in a couple of months.
☛ Introduction
We live in a world where in every aspect of our daily lives, from the way we work, shop, communicate, or socialize; we are both consuming and creating vast amounts of information. More often than not, these daily activities create a trail of digitized data that is being stored, mined, and analyzed by firms. There is however a growing gulf between 'data scientists' (statisticians/computer scientists) and business managers, aggravated further by plethora of machine learning algorithms and technical jargon surrounding them. The objectives of this course are to fill this gap by (1) exposing you to the linguists of data, (2) providing hands-on experience with cutting-edge tools and techniques used in practice, and (3) exposing you to a wide variety of empirical contexts to instill an intuition for D3M, i.e. how to generate insights from the volumes of data.
☛ Class Notes
Topic 1: Language of Data
Our first topic covers modern developments in data structures, storage, and ETL operations (query and use data for BI or Ai). This session provides an overview of current state-of-the-art players in this space including Big 3 cloud providers (AWS, GCP, Azure) as well as strong challengers such as Snowflake.
Focus of this session is on understanding data pipelines (storage, retrieval) as well as understanding Data Structures & Variable Types. We will go over the linguistics of data analytics that includes understanding: (1) Types of data (e.g. Cross-sectional vs. Panel, Time-series, Geo-spatial, etc.); (2) Merging, splitting & reshaping data files (long vs. wide data) (3) Types of variables (e.g. Numeric, categorical, Text, Images), (4) Common variable transformations (e.g. binning, log); and (5) Exploring relationships between variables using intuitive visualizations/dashboards. We will also look at basic functioning of two simple yet powerful software (JMP & Tableau) on how to achieve complex data retrieval/scraping, summarization, and creating insightful BI dashboards.
Topic 2: Quantifying Metrics
Any phenomenon we study (e.g. Sales of our product, Click-thru rates for our Ad campaign) depends on a large number of factors. For example, sales may depend on price, advertising, competitor pricing, etc. Regression based models help us to isolate the impact of different variables, i.e. provide marginal estimates. They allow us to ask questions (with caveats) such as how much additional web traffic can I expect if I were to spend X$ on FB Ad (holding constant other factors that may impact web traffic). This topic looks at methods that help us quantify metrics for decision-making. We will look at 2 main approaches:
Experiments & Causal Inference : Carefully designed experiments are regarded as the "gold standard" for making causal inferences. We will discuss the issues of design of experiments and look at several hands-on case studies to illustrate ideas of controlled experiments, A-B testing, and "natural" experiments. We will also look at some new ideas & algorithms beyond A-B testing (e.g. Heterogeneous treatment effects, counter-factual or what-if analysis). In addition, we will discuss how to develop careful regression models from field data, including common transformations (e.g. log-log regressions), interpreting own- and cross price elasticity, adding controls for extraneous factors such as trends and seasonality. This module includes a large scale case study on developing optimal pricing strategies using output from regression models.
Topic 3: Language of Algorithms
This last section covers a plethora of ML algorithms that can broadly be classified into 'supervised' and 'unsupervised' learning.
Predictive Models (Supervised Learning): A large component of analytics is focused on developing predictive models. These range from simple forecasting tools to advanced machine learning algorithms that are implemented in real time (e.g. what Ad to show to a customer). This topic provides an overview of core ideas in predictive modeling such a training/test data, popular algorithms for variable selection, and modeling techniques such as logistic regression, decision trees and ensemble meta-algorithms that combine several machine learning algorithms into one predictive model. We will also implement AutoML (automated algorithms to pick best models for a given context). Finally, we will look at dimension reduction techniques such as PCA, tSNE, UMap and Clustering algorithms that are braodly classified into Unsupervised Learning. The goal in these algorithms is to detect patterns or groupings in our data. Unlike predictive modeling, we don't necessarily have a 'dependent' variable. Rather we are confronted with situation where the primary objective is exploratory to find hidden patterns or groupings in data.
Unstructured Data & Deep Learning Algorithms
This last module looks at the latest and greatest in algorithms (mainly Deep Learning) that is driving the modern Ai revolution. From common tasks such as language translation and generating music or art, to significant breakthroughs in medicine and driverless cars; power of data/algorithms/computing power is disrupting many sectors of the economy. This session provides a comprehensive overview of latest algorithms and their implementation in various cloud providers. We will also work with several data sets (customer reviews, sentiment analysis, image mining) to understand the inner workings and practical uses of these algorithms.
🎯 Maximizing your Learning
This class usually has a diverse student mix in terms of technical background and career goals/interests. Class provides a full spectrum of topics in analytics-- from visual Bi tools to algorithms fueling the Ai revolution. Although some of the material covered in class is fairly technical, actual implementation of even the most advanced algorithms have become automated and trivial. How deep you want to go into a specific topic/method should be based on your goals and interests. Regardless of your current comfort level with data/technology/statistics, it is my objective to make sure that everyone gets a good grasp on key concepts & ideas, and a practical grounding in working with a variety of large scale data.
Finally, there are numerous learning resources available online (for free) which we should exploit. Most of class notes are annotated with web links for further exploration of ideas/concepts/tools. I strongly encourage you to treat Google as your friend in this learning journey.
Look forward to meeting everyone.