Power BI & Synapse Part 4 — Serverless SQL: How much will it cost me?

How much would you pay if you use Power BI in a combo with Synapse Analytics? I’ve put Power BI & Synapse Analytics Serverless SQL pool to an ultimate benchmarking test and here are the results!

Image by author

Serverless SQL pool — The next big thing

I’ve already written about the Serverless SQL pool, and I firmly believe that it is the next big thing when it comes to dealing with large volumes of semi-structured or non-structured data.

  • Databases you create in your endpoint. Those databases contain only metadata (users, roles, schemas, views, inline table-valued functions, stored procedures, external data sources, external file formats, and external tables)
  • DDL statements, except for the CREATE STATISTICS statement because it processes data from storage based on the specified sample percentage
  • Metadata-only queries

Scenario

Here is the scenario: I have two CSV files related to the NYC taxi dataset, that I’ve already used in one of the previous demos. One contains data about all yellow cab rides from January 2019 (650 MB), while the other contains data from February 2019 (620 MB).

Image by author
DROP VIEW IF EXISTS taxi201902csv;
GO

CREATE VIEW taxi201902csv AS
SELECT
VendorID
,cast(tpep_pickup_datetime as DATE) tpep_pickup_datetime
,cast(tpep_dropoff_datetime as DATE) tpep_dropoff_datetime
,passenger_count
,trip_distance
,RateCodeID
,store_and_fwd_flag
,PULocationID
,DOLocationID
,payment_type
,fare_amount
,extra
,mta_tax
,tip_amount
,tolls_amount
,improvement_surcharge
,total_amount
,congestion_surcharge
FROM
    OPENROWSET(
        BULK N'https://nikola.dfs.core.windows.net/nikola/Data/yellow_tripdata_2019-02.csv',
        FORMAT = 'CSV',
        PARSER_VERSION='2.0',
HEADER_ROW = TRUE
) 
WITH(
VendorID INT,
tpep_pickup_datetime DATETIME2,
tpep_dropoff_datetime DATETIME2,
passenger_count INT,
trip_distance DECIMAL(10,2),
RateCodeID INT,
store_and_fwd_flag VARCHAR(10),
PULocationID INT,
DOLocationID INT,
payment_type INT,
fare_amount DECIMAL(10,2),
extra DECIMAL(10,2),
mta_tax DECIMAL(10,2),
tip_amount DECIMAL(10,2),
tolls_amount DECIMAL(10,2),
improvement_surcharge DECIMAL(10,2),
total_amount DECIMAL(10,2),
congestion_surcharge DECIMAL(10,2)
)
AS [taxi201902csv]
Image by author
DROP VIEW IF EXISTS taxi201901parquet;
GO

CREATE VIEW taxi201901parquet AS
SELECT
VendorID
,CAST(TpepPickupDatetime AS DATE) TpepPickupDatetime
,CAST(TpepDropoffDatetime AS DATE) TpepDropoffDatetime
,PassengerCount
,TripDistance
,PuLocationId
,DoLocationId
,StartLon
,StartLat
,EndLon
,EndLat
,RateCodeId
,StoreAndFwdFlag
,PaymentType
,FareAmount
,Extra
,MtaTax
,ImprovementSurcharge
,TipAmount
,TollsAmount
,TotalAmount
FROM
OPENROWSET(
BULK 'puYear=*/puMonth=*/*.snappy.parquet',
DATA_SOURCE = 'YellowTaxi',
FORMAT='PARQUET'
) nyc
WHERE
nyc.filepath(1) = 2019
AND nyc.filepath(2) IN (1)
AND tpepPickupDateTime BETWEEN CAST('1/1/2019' AS datetime) AND CAST('1/31/2019' AS datetime)

CSV vs Parquet — What do I need to know?

Before we proceed with testing, a little more theory…As I plan to compare data processing between CSV and Parquet files, I believe we should understand the key differences between these two types:

  • Parquet files support column storage format — that being said, columns within the Parquet file are physically separated, which means that you don’t need to scan the whole file if you need data from few columns only! On the opposite, when you’re querying a CSV file, every time you send the query, it will scan the whole file, even if you need data from one single column
  • For those coming from the traditional SQL world, you can think of CSV vs Parquet, such as row-store vs columnar databases

Use Case #1 — Import CSV data into Power BI

Let’s start with the most obvious and desirable scenario — using Import mode to ingest all the data into Power BI, and performing data refresh to check how much it will cost us.

SELECT * FROM sys.dm_external_data_processed
WHERE type = 'daily'
Image by author
Image by author
let
StartDate = #date(StartYear,1,1),
EndDate = #date(EndYear,12,31),
NumberOfDays = Duration.Days( EndDate - StartDate ),
Dates = List.Dates(StartDate, NumberOfDays+1, #duration(1,0,0,0)),
#"Converted to Table" = Table.FromList(Dates, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
#"Renamed Columns" = Table.RenameColumns(#"Converted to Table",{{"Column1", "FullDateAlternateKey"}}),
#"Changed Type" = Table.TransformColumnTypes(#"Renamed Columns",{{"FullDateAlternateKey", type date}}),
#"Inserted Year" = Table.AddColumn(#"Changed Type", "Year", each Date.Year([FullDateAlternateKey]), type number),
#"Inserted Month" = Table.AddColumn(#"Inserted Year", "Month", each Date.Month([FullDateAlternateKey]), type number),
#"Inserted Month Name" = Table.AddColumn(#"Inserted Month", "Month Name", each Date.MonthName([FullDateAlternateKey]), type text),
#"Inserted Quarter" = Table.AddColumn(#"Inserted Month Name", "Quarter", each Date.QuarterOfYear([FullDateAlternateKey]), type number),
#"Inserted Week of Year" = Table.AddColumn(#"Inserted Quarter", "Week of Year", each Date.WeekOfYear([FullDateAlternateKey]), type number),
#"Inserted Week of Month" = Table.AddColumn(#"Inserted Week of Year", "Week of Month", each Date.WeekOfMonth([FullDateAlternateKey]), type number),
#"Inserted Day" = Table.AddColumn(#"Inserted Week of Month", "Day", each Date.Day([FullDateAlternateKey]), type number),
#"Inserted Day of Week" = Table.AddColumn(#"Inserted Day", "Day of Week", each Date.DayOfWeek([FullDateAlternateKey]), type number),
#"Inserted Day of Year" = Table.AddColumn(#"Inserted Day of Week", "Day of Year", each Date.DayOfYear([FullDateAlternateKey]), type number),
#"Inserted Day Name" = Table.AddColumn(#"Inserted Day of Year", "Day Name", each Date.DayOfWeekName([FullDateAlternateKey]), type text)
in
#"Inserted Day Name"
  • My whole data model size is ~92 MB, as the data is optimally compressed within Power BI Desktop (we’ve reduced the cardinality of DateTime columns)
Image by author

Use Case #2 — DirectQuery over CSV files

Let’s now check what would happen if we use exactly the same query, but instead of importing data into Power BI Desktop, we will use the DirectQuery option.

Image by author
Image by author
Image by author

Use Case #3 — Use Date slicer in DirectQuery mode

Now, I want to check what happens if I select a specific date range within my slicer, for example between January 1st and January 13th:

Image by author
/*Query 1*/
SELECT
TOP (1000001) [semijoin1].[c1],SUM([a0])
AS [a0]
FROM
(
(
SELECT [t1].[tpep_pickup_datetime] AS [c14],[t1].[total_amount] AS [a0]
FROM
(
(SELECT * FROM dbo.taxi201901csv
UNION ALL
SELECT * FROM dbo.taxi201902csv
)
)
AS [t1]
)
AS [basetable0]
INNER JOIN
(
(SELECT 3 AS [c1],CAST( '20190101 00:00:00' AS datetime) AS [c14] ) UNION ALL
(SELECT 4 AS [c1],CAST( '20190102 00:00:00' AS datetime) AS [c14] ) UNION ALL
(SELECT 5 AS [c1],CAST( '20190103 00:00:00' AS datetime) AS [c14] ) UNION ALL
(SELECT 6 AS [c1],CAST( '20190104 00:00:00' AS datetime) AS [c14] ) UNION ALL
(SELECT 7 AS [c1],CAST( '20190105 00:00:00' AS datetime) AS [c14] ) UNION ALL
(SELECT 8 AS [c1],CAST( '20190106 00:00:00' AS datetime) AS [c14] ) UNION ALL
(SELECT 9 AS [c1],CAST( '20190107 00:00:00' AS datetime) AS [c14] ) UNION ALL
(SELECT 10 AS [c1],CAST( '20190108 00:00:00' AS datetime) AS [c14] ) UNION ALL
(SELECT 11 AS [c1],CAST( '20190109 00:00:00' AS datetime) AS [c14] ) UNION ALL
(SELECT 12 AS [c1],CAST( '20190110 00:00:00' AS datetime) AS [c14] ) UNION ALL
(SELECT 13 AS [c1],CAST( '20190111 00:00:00' AS datetime) AS [c14] ) UNION ALL
(SELECT 14 AS [c1],CAST( '20190112 00:00:00' AS datetime) AS [c14] ) UNION ALL
(SELECT 15 AS [c1],CAST( '20190113 00:00:00' AS datetime) AS [c14] )
)
AS [semijoin1] on
(
([semijoin1].[c14] = [basetable0].[c14])
)
)
GROUP BY [semijoin1].[c1]

/*Query 2*/
SELECT SUM([t1].[total_amount])
AS [a0]
FROM
(
(SELECT * FROM dbo.taxi201901csv
UNION ALL
SELECT * FROM dbo.taxi201902csv)
)
AS [t1]
WHERE
(
([t1].[tpep_pickup_datetime] IN (CAST( '20190112 00:00:00' AS datetime),CAST( '20190113 00:00:00' AS datetime),CAST( '20190101 00:00:00' AS datetime),CAST( '20190102 00:00:00' AS datetime),CAST( '20190103 00:00:00' AS datetime),CAST( '20190104 00:00:00' AS datetime),CAST( '20190105 00:00:00' AS datetime),CAST( '20190106 00:00:00' AS datetime),CAST( '20190107 00:00:00' AS datetime),CAST( '20190108 00:00:00' AS datetime),CAST( '20190109 00:00:00' AS datetime),CAST( '20190110 00:00:00' AS datetime),CAST( '20190111 00:00:00' AS datetime)))
)

Use Case #4 — Aggregated table in DirectQuery mode

The next test will show us what happens if we create an aggregated table and store it in DirectQuery mode within the Power BI. It’s quite a simple aggregated table, consisting of total amount and pickup time columns.

Image by author
Image by author

Use Case #5 — Aggregated table in Import mode

After that, I want to check what happens if I import a previously aggregated table into Power BI. I believe that calculations will be faster, but let’s see how it will affect the query costs.

Image by author

Use Case #6 — Aggregated data in Serverless SQL pool

One last thing I want to check, is what happens if I know my analytic workloads, and can prepare some most frequent queries in advance, using a Serverless SQL pool.

DROP VIEW IF EXISTS taxi201901_02_agg;
GO

CREATE VIEW taxi201901_02_agg AS
  SELECT  CAST(C2 AS DATE) AS tpep_pickup_datetime,
SUM(CAST(C17 AS DECIMAL(10,2))) AS total_amount
FROM
    OPENROWSET(
        BULK N'https://nikola.dfs.core.windows.net/nikola/Data/yellow_tripdata_2019-01.csv',
        FORMAT = 'CSV',
        PARSER_VERSION='2.0',
HEADER_ROW = TRUE
) 
AS [taxi201901_02_agg]
GROUP BY CAST(C2 AS DATE)
Image by author

Use Case #7 — Import Parquet files

Let’s now evaluate if something changes if we use data from Parquet files, instead of CSV.

Use Case #8 — DirectQuery over Parquet files

And finally, let’s examine the figures if we use DirectQuery mode in Power BI to query the data directly from the Parquet files within the Serverless SQL pool in Synapse.

SELECT 
TOP (1000001) [semijoin1].[c1],SUM([a0])
AS [a0]
FROM
(
(
SELECT [t1].[TpepPickupDatetime] AS [c13],[t1].[TotalAmount] AS [a0]
FROM
(
(SELECT *
FROM taxi201901parquet
UNION ALL
SELECT *
FROM taxi201902parquet)
)
AS [t1]
)
AS [basetable0]
INNER JOIN
(
(SELECT 3 AS [c1],CAST( '20190101 00:00:00' AS datetime) AS [c13] ) UNION ALL
(SELECT 4 AS [c1],CAST( '20190102 00:00:00' AS datetime) AS [c13] ) UNION ALL
(SELECT 5 AS [c1],CAST( '20190103 00:00:00' AS datetime) AS [c13] ) UNION ALL
(SELECT 6 AS [c1],CAST( '20190104 00:00:00' AS datetime) AS [c13] ) UNION ALL
(SELECT 7 AS [c1],CAST( '20190105 00:00:00' AS datetime) AS [c13] ) UNION ALL
(SELECT 8 AS [c1],CAST( '20190106 00:00:00' AS datetime) AS [c13] ) UNION ALL
(SELECT 9 AS [c1],CAST( '20190107 00:00:00' AS datetime) AS [c13] ) UNION ALL
(SELECT 10 AS [c1],CAST( '20190108 00:00:00' AS datetime) AS [c13] ) UNION ALL
(SELECT 11 AS [c1],CAST( '20190109 00:00:00' AS datetime) AS [c13] ) UNION ALL
…..

Use Case #9 — Aggregated data in the Serverless SQL pool

That’s our tipping point! Here is where the magic happens! By being able to store columns physically separated, Parquet outperforms all previous use cases in this situation — and when I say this situation — I mean when you are able to reduce the number of necessary columns (include only those columns you need to query from the Power BI report).

Final Verdict

Here is the table with costs for every single use case I’ve examined:

Image by author
  • Whenever possible, Import the data into Power BI — that means you will pay only when the data snapshot is being refreshed, not for every single query within the report
  • If you are dealing with Parquet files, whenever possible, create pre-aggregated data (views) in the Serverless SQL pool in the Synapse
  • Since Serverless SQL pool still doesn’t support ResultSet Cache (as far as I know, Microsoft’s team is working on it), keep in mind that each time you run the query (even if you’re returning the same result set), the query will be generated and you will need to pay for it!
  • If your analytic workloads require a high number of queries over a large dataset (so large that Import mode is not an option), maybe you should consider storing data in the Dedicated SQL pool, as you will pay fixed storage costs then, instead of data processing costs each time you query the data. Here, in order to additionally benefit from using this scenario, you should materialize intermediate results using external tables, BEFORE importing them into a Dedicated SQL pool! That way, your queries will read already prepared data, instead of raw data
  • Stick with the general best practices when using Serverless SQL pool within Synapse Analytics

Conclusion

In this article, we dived deep to test different scenarios and multiple use cases, when using Power BI in combination with the Serverless SQL pool in Synapse Analytics.

Data Mozart — Don’t let data make noise, create music from it | data-mozart.com | https://www.linkedin.com/in/nikola-ilic-data-mozart/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store