Data Warehousing
Overview
Teaching: 45 min
Exercises: 15 minQuestions
What is a datawarehouse?
How can I leverage relational databases for reporting and BI?
How can I use python to populate a datawarehouse for Tableau?
Objectives
Model real world situations with a schema
Identify facts and dimensions within a datawarehouse.
We saw in the previous lesson that a database can be used to store structured, tabular data. The collection of tables, their structures and constraints is sometimes referred to as a database schema.
What is a Data Warehouse?
- Central repository of information coming from one or more sources: transactional systems and other relational databases contained structured, semi-structered or unstructured data.
- Data is processed, transformed and ingested at a constant pace
- Users: data scientists, business decision makers, business analysts
- Tools: BI tools, SQL Clients, Spreasheets
Design Overview: Facts and Dimensions
A datawarehouse is made up of two types of tables. Facts represent a single measurement. A measurement is simply a value without a context. In datawarehousing, these are referred to as “Facts”.
*Facts* refer to individual measurements without a context.
*Dimensions* refer to pieces of context that give meaning to a *Fact*.
ETL : Extract Transform Load
Extract Transform Load refers to the process of taking many sources of data, and loading them to a common database. A single ETL Process will take a source and load it into one or more targets.
Data Integration and Aggregation
After loading data, it may be important to derive or aggregate additional results. This is known as data integration.
Reporting and BI
Tools such as BI are meant to read data from a datawarehouse and create visualization.
Creating an ETL Process
Break out gapminder dataset into two tables: GDP by year, and a country/continent table.
Key Points
Data warehouses are designed to facilitate reporting.
Business intelligence tools are built to use data warehouse.
A data warehouse is built by an ETL process, and a data integration process.