Using Tables, Views, and Materialized Views in the Processing Scripts to optimize performance.
Requirements for Views and Materialized Views
In order to use Views and Materialized Views for Processing Scripts, all Data Sources that are referenced by Processing Scripts will need to be upgraded to the V2 Pipeline Services.
Please check with your Customer Success Lead to confirm if this is enabled in your instance.
What is a Table?
A Table in a database is a static collection of data in a table, similar conceptually to the average Excel Spreadsheet. This means the table contains its own copy of the data, and is only updated via direct commands.
What is a View?
A View in a database is a dynamic collection of data based on a query that is only collated when the view is queried, but can otherwise be used in the same way as a table. This means that every time a view is queried, it can return a different result
What is a Materialized View?
A Materialied View in Redshift is a static collection of data based on a query that can be collated at different points in time:
When the Materialized View is created
Manually at later points in time
Automatically if the underlying tables have changed
This means that the Materialized View does have it's own copy of the data, but is designed to be updated and refreshed.
Which type do I want to use, and when?
Tables are the slowest to build for a set of data, but because they support Named Queries that don't have a FROM clause, and can still be queried by Materialized Views, they are recommended for this purpose.
Views are the fastest to create, because they don't create a copy of the data at run-time. However, they are the slowest to query, because the query underpinning the view has to be run every time the View is built.
This makes Views most useful for two cases, provided they are only referenced by one later query:
The last named query that is then used in a large Fact Table.
This is usually the best time to use a View, because:
The view only needs to be referenced in the Fact Table (and the Fact Table can then be referenced in subsequant queries)
The last table before a Fact Table is often quite complex with many joins
This table is then simply copied again to create the Fact Table - so it's easy to combine the two and skip the work required to copy the data twice
Steps that are shifting a lot of data, but with a small transformation
Running these as a view means in-practice that they are run as part of multiple statements at once, with only one copy of the data getting made. Using it this way is as much of an art as a science, as some cases they can improve performance, and in others they can make it worse.
This is particularly the case because a View cannot be referenced by a Materialized View, only a Table (which is slower than a Materialized View to load). This potentially offsets some of the benefit, depending on where the View is used.
Materialized Views are faster to build than Tables, and faster to query than Views. As such, they are a good default for any named queries that use a FROM clause.
How can I use Tables/Views/Materizlied Views?
Our Processing Scripts are comprised of several named queries, like this:
Each of these can be either a Table, View, or Materialized View. To define them as a table, use the following modifier:
Similarly, use the view modifier to make
Materialized Views will be used by default (if no modifier is given), provided the Use Materialized Views for Processing Script flag is enabled:
If this is disabled, Tables will be used by default.
Fact, Model, and Dimension Named Queries
Fact, Model, and Dimension tables are all created as tables by default, even with the Materialized Flag enabled. This is because tables support Distribution and Sort keys (referenced using the keys() function in Processing Scripts), and are therefore much faster to query for Metrics and Dimensions.