class: center, middle, title-slide .title[ # Introduction ] .subtitle[ ## JSC 370: Data Science II ] .date[ ### January 6, 2025 ] --- # Instructors - Meredith Franklin: meredith.franklin@utoronto.ca - Evelyn Pan, Sajal Bhalla I will do the lectures on Mondays and the TAs will run the labs on Wednesdays. --- # My Background - In late 2021 moved from Los Angeles where I was an Assistant/Associate Professor of Biostatistics at University of Southern California - From Canada originally, McGill math for BSc, Ottawa/Carleton Institute of Math for MSc, Harvard for PhD, UChicago for postdoc - At U of T I'm an Associate Professor with tenure in the Department of Statistical Science (51%) and the School of the Environment (49%) - I am the Data Science Concentration Lead for the MScAC - I am on the executive committee for the U of T Data Science Institute --- # My Teaching - Founded a Master's of Health Data Science program at USC that launched in 2020 - Co-taught the introduction data science course - Taught graduate-level spatial statistics, inference, linear models - At U of T I have also taught STA255, STA465/STA2016 --- # My Research - Spatial statistical methods for environmental data - Data science techniques for remote sensing data/imagery - Focus on pollution (air, noise) and climate (ghg, land cover change) - Machine learning becoming a big part of environmental research <img src="data:image/png;base64,#img/research_fig.png" width="80%" style="display: block; margin: auto;" /> --- # Course Goals Through this course, you will hone the techniques used in Data Science. You will learn: - Programming in R (Python for ML), and tools Markdown, Git - Exploratory data analysis – generating hypotheses and building intuition - Data visualization – showing data through interpretable summaries - Data collection – data scraping, wrangling, cleaning - Statistical (machine learning) algorithms - Building a github.io website --- # Quercus + Git + Piazza Course website - lecture slides, labs, data https://jsc370.github.io/JSC370-2025/ Quercus - announcements, homework solutions, lab solutions, guest speaker reflections, grading https://q.utoronto.ca/courses/383329 Piazza - questions and discussion https://piazza.com/class/m5lck963uic6at --- # What is data science? - Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge. -- <img src="data:image/png;base64,#img/data-science.png" width="90%" style="display: block; margin: auto;" /> --- <img src="data:image/png;base64,#img/data-science-drew-conway.jpg" width="90%" style="display: block; margin: auto;" /> --- # Data science can be really cool <figure align="center"> <img src="https://imgs.xkcd.com/comics/regular_expressions.png" style="width:450px"> <figcaption>Source: https://xkcd.com/208/</figcaption> </figure> --- # With great power comes great responsibility <figure align="center"> <img src="https://imgs.xkcd.com/comics/extrapolating.png" style="width:500px"> <figcaption>Fuente: https://xkcd.com/605/</figcaption> </figure> --- {height=20%} --- .center[  ] --- .center[  ] --- # Data Scientists in Demand Also see [here](https://www.amstat.org/news-listing/2021/10/11/new-report-highlights-growing-demand-for-data-science-analytics-talent), and [here](https://www.forbes.com/sites/gilpress/2021/06/27/salaries-and-job-opportunities-for-data-scientists-continue-to-rise/), and [here](https://www.glassdoor.com/research/data-scientists-still-the-talk-of-the-town) Things look good for DS jobs [in Toronto](https://weclouddata.com/blog/torontos-data-science-odyssey-a-tale-of-triumphs-tribulations-and-a-dash-of-uncertainty/) A good [data science subreddit](https://www.reddit.com/r/datascience/) to follow - it provides insights on jobs, academic programs, and there are AMAs from industry leaders. Another good resource is [Towards Data Science](https://towardsdatascience.com/) --- # What is this course? This course is a introduction to the world of data science following on from where JSC270 left off. -- The course will teach language agnostic skills that are easily transferable, with examples done in R. -- You can use any language/tool you prefer. But we are focused mainly on R, RStudio and some Python. --- # What is R? <img src="https://www.r-project.org/logo/Rlogo.svg" width="150px" alt="R logo"> > R is a language and environment for statistical computing and graphics. -- https://r-project.org Created by statisticians for statisticians. Over 20,000 packages added to CRAN. --- .center[  ] --- # History of R Originates from S, which was developed by Bell Labs in the 1970s First versions of R were developed by Robert Gentleman and Ross Ihaka of U Aukland in mid-1990s R is intended for statisticians but used by many (>2M users!) R is open source, has nice graphics and visualizations A lot of help is available online (Stack Overflow, R package vignettes, Journal of Statistical Software) --- # R Data Science Resources 1) R Programming for Data Science, 2022. Roger Peng. https://bookdown.org/rdpeng/rprogdatascience/ Supplementary References 2) R for Data Science, 2023 Garrett Grolemund and Hadley Wickham. http://r4ds.hadley.nz/ 3) Exploratory Data Analysis with R, 2020 Roger Peng https://bookdown.org/rdpeng/exdata/ 4) Mastering Software Development in R, 2020 Roger Peng, Sean Kross, Brooke Anderson https://bookdown.org/rdpeng/RProgDA/ --- # R in the terminal <img src="data:image/png;base64,#R-terminal.png" width="50%" style="display: block; margin: auto;" /> --- # What is RStudio? <img src="https://rstudio.com/wp-content/uploads/2018/10/RStudio-Logo.svg" width="400px" alt="RStudio logo"> > RStudio is an integrated development environment (IDE) for R. https://posit.co/download/rstudio-desktop/ --- .center[  ] --- # R + RStudio <img src="data:image/png;base64,#rstudio-now.png" width="50%" style="display: block; margin: auto;" /> --- ## GitHub -- - Version control is necessary in the trade of data science and is used in industry and academia -- - Building up a solid GitHub profile will put you in a good position for job hunting -- - You will build a github.io website as part of this course <img src="data:image/png;base64,#img/git1.png" width="10%" style="display: block; margin: auto;" /> <img src="data:image/png;base64,#img/git2.png" width="10%" style="display: block; margin: auto;" /> --- # First Week The lab exercises can be found on the course website in the schedule https://github.com/JSC370/JSC370-2025/ Download the Rmd files Submit individually completed lab at the end of day Wednesday --- # Next Week Lecture 1-3 pm Monday Jan 13 (Version control) Lab 1-3 pm Wednesday Jan 15 (Version control)