I like to use public datasets for experimentation and presentation demos, especially data that people can easily understand and relate to. For some, keeping them up-to-date was a manual process of downloading files, loading tables, and merging. There are of course many better ways to do this, some of which are more automated than others. I could have simply used PowerShell to call bcp, or even just implemented an insert statement and some loops. Then I found dbatools, which has commands which enable me to do an even better job with far less work – just the way I like it!. Here’s how I now keep my datasets current:
Getting The Data
I’ll be using data from the City of Chicago’s Data Portal. They have a tremendous online resource with lots of public datasets available. One that I really like is their listing of towed vehicles. Any time the city tows or impounds a vehicle, a record gets added here and remains for 90 days. It’s very manageable, with only 10 columns and a few thousand rows. (As an added bonus, you can search for license plates you know and then ask your friends about their experience at the impound lot!)
Chicago’s data portal uses Socrata, which is a very well-documented and easy-to-use tool for exposing data. It has a wonderful API for querying and accessing data, but to keep things simple for this post we’re just going to download a CSV file.
If you’re on the page for a dataset, you can download it by clicking on “Export” on the top right and then selecting “CSV”. To avoid all that, the direct link to download a CSV of this dataset is here. Download it and take a look at what we’ve got using your spreadsheet or text editor of choice (mine is Notepad++).
Loading The Data
We’ve got our data, now let’s load it. I like to load the entire downloaded dataset into a stage table, and then copy new rows I haven’t previously seen into my production table that I query from. Here’s the script to create these tables:
-- CREATE STAGE TABLE CREATE TABLE [dbo].[TowedVehiclesSTG]( [TowDate] [date] NOT NULL, [Make] [nchar](4) NULL, [Style] [nchar](2) NULL, [Model] [nchar](4) NULL, [Color] [nchar](3) NULL, [Plate] [nchar](8) NULL, [State] [nchar](2) NULL, [TowedToFacility] [nvarchar](75) NULL, [FacilityPhone] [nchar](14) NULL, [ID] [int] NOT NULL ); -- CREATE FINAL TABLE CREATE TABLE [dbo].[TowedVehicles]( [ID] [int] NOT NULL, [TowDate] [date] NOT NULL, [Make] [nchar](4) NULL, [Style] [nchar](2) NULL, [Model] [nchar](4) NULL, [Color] [nchar](3) NULL, [Plate] [nchar](8) NULL, [State] [nchar](2) NULL, [TowedToFacility] [nvarchar](75) NULL, [FacilityPhone] [nchar](14) NULL, CONSTRAINT PK_TowedVehicles PRIMARY KEY CLUSTERED (ID) );
Now for the magic – let’s load some data! The dbatools command that does all the heavy lifting here is called Import-DbaCsvToSql. It loads CSV files into a SQL Server table quickly and easily. As an added bonus, the entire import is within a transaction, so if an error occurs everything gets rolled back. I like to specify my tables and datatypes ahead of time, but if you want to load into a table that doesn’t exist yet, this script will create a table and do its best to guess the appropriate datatype. To use, simply point it at a CSV file and a SQL Server instance, database, and (optionally) a table. It will take care of the rest.
# Load from CSV into staging table Import-DbaCsvToSql -Csv $downloadFile -SqlInstance InstanceName -Database TowedVehicles -Table TowedVehiclesSTG ` -Truncate -FirstRowColumns
The two parameters on the second line tell the command to truncate the table before loading, and that the first line of the CSV file contains column names.
Now the data has been staged, but since this dataset contains all cars towed over the past 90 days, chances are very good that I already have some of these tows in my production table from a previous download. A simple query to insert all rows from staging into production that aren’t already there will do the trick. This query is run using another dbatools command, Invoke-Sqlcmd2.
# Move new rows from staging into production table Invoke-Sqlcmd2 -ServerInstance InstanceName -Database TowedVehicles ` -Query "INSERT INTO [dbo].[TowedVehicles] SELECT [ID], [TowDate], [Make], [Style], [Model], [Color], [Plate], [State], [TowedToFacility], [FacilityPhone] FROM ( SELECT s.*, ROW_NUMBER() OVER (PARTITION BY s.ID ORDER BY s.ID) AS n FROM [dbo].[TowedVehiclesSTG] s LEFT JOIN [dbo].[TowedVehicles] v ON s.ID = v.ID WHERE v.ID IS NULL ) a WHERE a.n = 1"
Putting it all together
I’ve showed you how simple dbatools makes it to load a CSV file into a table and then run a query to load from staging into production, but the beauty of PowerShell is that it’s easy to do way more than that. I actually scripted this entire process, including downloading the data! You can download the full PowerShell script, along with a T-SQL Script for creating the tables, from my GitHub here.
Happy Data Loading!