Software Engineering and Reproducible Research 40 Hours “A data scientist knows more about statistics than a software engineer, and more about programming than a statistician.” Being a data scientist means applying statistics and analysis of data, writing real working code that runs and gets results. You’ve been doing that your entire time at Bloom Institute of Technology , but much of our work has been in the land of Python notebooks, a useful but limited environment intended for exploration, not engineering. Some place a divide between science and engineering – theory and practice, ideas and application. A skilled data scientist masters both: science informs engineering, and engineering increases the rigor of science by making it reproducible and scalable. In this unit we will build the core skills needed to communicate and work with software engineers. You may have pleasantly surprised colleagues if you not only know the latest and greatest machine learning model but build and approach it with software development best practices. To do this, we will go beyond Python notebooks, into the world of modules, packages, containers, and more. SQL and Databases 40 Hours What does “data” look like? If you try to picture it, you probably see rows and columns on a spreadsheet or CSV, that can be conveniently loaded with pandas and cleaned and analyzed from there. As a data scientist, this will often be the form you want your data to be in, but it’s probably not how your data started. Most modern data is generated automatically by human interaction with a web-backed application – every app they download, every click they make, all travels over a network and is saved by the server. Though in the rawest of forms this may be a log file, in most cases where it really goes is a database. So, what is a database? A place for data! If it’s relational, it’s actually still pretty close to that rows and columns picture, though with some important additional functionality. These databases are commonly accessed using SQL – Structured Query Language – a standard based on relational algebra, and a useful tool known not just by data scientists but by software engineers, MBAs, and more. If it’s so-called “NoSQL,” then it’s most likely a document-oriented database (or document store), which, despite the glamor, is essentially a bunch of key-value pairs. What key-value pair object are you already familiar with? Python dicts! In this sprint we will learn about both of the above paradigms, and how the separation between them is not as fine a line as you may think. Page 42 of 58 REV 10/31/2022 This catalog applies to all students other than those who reside in CA, CO, GA, TX, and DC who have their own catalogs.
Bloom Institute of Technology | Course Catalog Page 41 Page 43