PowerPivot – CSVs “The SQL Way”

PowerPivot – CSVs “The SQL Way”

Despite all the technology at our disposal today, when it comes to exchanging data between systems – text files are still the king. The venerable CSV file format is still the least common denominator in the import/export of data. The good news is the Microsoft Excel/PowerPivot/PowerBI stack is a very efficient tool for crunching CSV data into state of the art Charts/Tables/KPIs/Dashboards/etc. However, real world challenges (like dirty data) often turn what should be a straight forward process into a series of speed bumps, hurdles and sometimes dead ends. Described here are some of the tricks/techniques I use on a daily basis to process CSV data with the Microsoft BI tools.

There are many ways to process CSV files. If you are looking to import just a few simple files on an ad-hoc basis there are several methods more simple than The SQL Way I share here. However, if you are tasked with building a process that is robust, repeatable and easily supported… then read on…

Why I do it The SQL Way:

  • Dynamic! With SQL in my ETL path I can quickly make changes to meet the dynamic demands of my data consumers
  • Documentation! Sometimes I push out very important data sets (think litigation) and by providing the SQL used in the ETL it goes a long way to explain where the data came from and how it was compiled. (Same for DAX by the way)
  • Performance! By doing some of the ETL work in the SQL on import, I can keep my tables smaller, perhaps summarized or preprocessed. Keeps the load time and memory requirements down for my data consumers.

Assumptions:

  • You have little to no control over the format of the source data. It is given to you in CSV format (which at best is loosely defined standard often misused). It is given to you periodically and your goal is to import the data into PowerPivot so it may be combined with other data and visualized in Pivots and Charts.
  • The project is not well defined, thus you can expect many changes in source data and desired visualizations. Somebody is going to move your cheese. Count on it.
  • Source data files may need to be combined to form a single table in PowerPivot (such as sales from different regions, different reporting periods, etc…)
  • Source data files may need to be joined or filtered to form a single table in PowerPivot (such as importing sales data for only those customers that exist in a separate file)

Approach:

  • Build a Schema.ini file that describes the CSV source data
  • Define PowerPivot Connection to the source CSV files
  • Create the SQL necessary to extract/filter/transform the source data into a PowerPivot data model
  • Create DAX defined columns and measures as needed (not described here)
  • Create additional dimensions as needed (not described here)
  • Create desired visualizations in the way of Pivot Tables and Charts (not described here)

Build a Schema.ini file:

The bad news is the Schema.ini file format is NOT well defined. I found a few gems on Internet forums that helped me clear hurdles, but I never found a comprehensive resource. The good news is Excel provides a decent Wizard that does most of the heavy lifting. Follow along closely as these steps are not all that obvious.

  1. Open a new spreadsheet in Excel (2010 version for this example). On the Data Ribbon select to Get External Data, From Other Sources, From Microsoft Query
  2. From the Choose Data Source panel, select <New Data Source>
  3. From the Create New Data Source pane, give a bogus name (really does not matter), Select the driver “Microsoft Text Driver(*.txt,*.csv)”, click the Connect button
  4. From the ODBC Text Setup pane select the proper directory for your source CSV file, click the Options button
  5. Still from the ODBC Text Setup pane click the Define Format button
  6. From the Define Text Format pane select the source CSV file, and set the Column Name Header /  Format: / Delimiter:  fields to match your source CSV file
  7. Still from the Define Text Format Pane you now get to enjoy the magic… click the Guess button and then the OK button. Ignore any error message you get at this point and back all the way out of the Wizard and even close the spreadsheet – for at this point you have accomplished your goal… you have created the Schema.ini file!!!
  8. Edit the schema.ini file with your favorite text editor and make adjustments per notes below.

Schema.ini Notes:

  • the schema.ini file must reside in the same directory as the source CSV file
  • if you use the Wizard to create the schema.ini file, it will contain entries for all files found in the directory. To keep the schema.ini clean, I typically keep each report type in its own directory. c:\data\report1 has a schema.ini that defines report1.csv; c:\data\report2 has a schema.ini that defines report2.csv; etc…
  • As a general rule, I define all columns in the schema.ini as “Char Width 255” & let PowerPivot convert the fields to the desired data type. All hail PowerPivot and its ability to change data types on the fly!!!
  • Dates in source CSV data can be problematic. Most times you can define in Schema.ini the date fields as “Char Width 255” and in PowerPivot as “Date”. However depending on the format of the source data you may need to specify something like DateTimeFormat=”d-mmm-yy hh:nn:ss” in the schema.ini file in order to help PowerPivot with the transform. One limitation to note – schema.ini assumes all dates in a given source file are of the same format. If that is not true in your case, define in schema.ini the date fields at “Char Width 255” and in PowerPivot as “Text”, then sharpen up your DAX skills to create a defined column of type “Date”.
  • PowerPivot can handle “dirty” column names most of the time. It will often just drop the offending character. However I found on occasion PowerPivot will lock up on the import if the column name is too weird. In such cases I can often edit the schema.ini file and remove the offending character from the column definition.
  • If the source CSV data contains control characters (other than Carriage Return, Line Feed, End of File, etc) the results can be unpredictable. I have had cases where a text file would only partially load (with no error message thrown) because the 026 control character was somehow embedded in a user input text field. Thus I typically scrub the CSV file before import to remove extraneous junk. On an ad-hoc basis I use NotePad++ to search/replace. For more complex or systematic processes I depend on a PowerShell script that cleanses the files before import. I also use PowerShell to perform many other ETL functions, but that is a whole other topic 🙂
  • Depending on how your source CSV file treats text delimiters, you may find this schema.ini setting to be handy: textdelimiter=none. This is another of those under-documented settings that has saved me much grief.

Define the PowerPivot Connection & Import first Report:

  1. From the PowerPivot Window (Excel 2010 64 bit for this example), Home Ribbon, select to Get External Data From Other Sources
  2. From the Table Import Wizard panel select “Others (OLEDB/ODBC)”
  3. From the Connection String panel, define a useful name for the connection (such as “CSV Files via SQL”) and click the Build button
  4. From the Data Link Properties panel, Provider tab, select “Microsoft OLE DB Provider for ODBC Drivers” and click the Next button
  5. Still from the Data Link Properties panel, Connection tab, select “Use Connection String” and click the related Build button
  6. From the Select Data Source panel, give a DSN Name such as “report64” and click the New button
  7. From the Create New Data Source panel select the “Microsoft Access Text Driver (*.txt, *.csv) and click the Next button
  8. From the Create New Data Source panel browse to the desired location such as c:\data and give a suitable file name such as Report64 and click Next and then click Finish
  9. From the ODBC Text Setup panel select OK (assuming you have already created the schema.ini file described above)
  10. From the Data Link Properties panel (again) select the OK button
  11. From the Table Import Wizard panel (again) verify the connection string looks something like the following and click Next:
    Provider=MSDASQL.1;Persist Security Info=False;Extended Properties=”DBQ=C:\Data\Report1;DefaultDir=C:\Data\Report1;Driver={Microsoft Access Text Driver(*.txt,*.csv)}; DriverId=27; Extensions=txt,csv,tab,asc; FIL=text; FILEDSN=C:\data\report64.dsn; MaxBufferSize=2048; MaxScanRows=25; PageTimeout=5; SafeTransactions=0;Threads=3;UID=admin;UserCommitSync=Yes;”
  12. From the How to Choose the Import Data panel, select to Write a Query and click the Next button.
  13. From the Specify a SQL Query panel, define a useful name and the following as a SQL Statement and click the Finish button:
    select * from c:\data\report1\report1.txt
  14. Data should now be imported to a PowerPivot table. Check/set the Data Type for each column & refresh to ensure PowerPivot can handle the desired transforms.

Import subsequent CSV reports:

  1. From the PowerPivot Window, Design Ribbon, select Existing Connections
  2. From the Existing Connections panel, select to Open the “CSV Files Via SQL” connection just created.
  3. As above, specify the desired table name and SQL query for the next report and click Finish
  4. As above check/set the column data types and refresh…

PowerPivot Connection & Import Notes:

  • Only 1 PowerPivot connection needs to be defined. You can import all your CSV tables via this connection assuming your SQL query fully specifies the file path. Note that you can use UNC style file paths in the SQL query (\\<sharename>\data\report1\report1.txt as an example).
  • I have NOT figured out how to use environment variables or the like to make the SQL queries somewhat dynamic. At this point all my SQL queries are static. I am truly open to suggestions here. Large projects can result in *many* static queries that ultimately need to be modified.

SQL Notes:

  • the SQL functionality is limited by the Jet database (aka Microsoft Access). Tis not as good as T-SQL, but not too bad. Syntax is often the challenge, but most is well defined in various Internet forums.
  • When you need to combine multiple CSV source files into a single table use the Union All command. For example:
    select * from c:\data\SalesReports\2011full.csv
    union all
    select * from c:\data\SalesReports\2012full.csv
    union all
    select * from c:\data\SalesReports\2013ytd.csv
  • When you need to limit the records by another file, use a join such as:
    select *
    from c:\data\SalesReports\2013ytd.csv s, c:\data\Customers\CustListA c
    where s.custno=c.custno

Credits:

I beg forgiveness for not properly acknowledging those that I have learned from. It has been a long and winding path to this point and alas I have not taken good notes. The best can do at this point is call out those who I sponge from regularly:

Update 1/18/2014 – For a little more detail on the process of creating the schema.ini file, including screen shots, see Teo Lachev’s blog: http://prologika.com/CS/blogs/blog/archive/2014/01/17/auto-generating-schema-ini.aspx

Update 10/3/2014 – Hit a snag today. Tried to import a CSV file that had a dash “-” in the file name. Import routine did NOT like the dash. I tried various bracketing methods including `'[] to no avail. Ended up changing the file name using underscore “_” instead of dash. Just thought I’d share…

Update 3/25/2015 – Again had a challenge with a – in a file name. I was able to fix the issue using the character ` (found on the tilde ~ key on my keyboard). Trick is to tic the start of the path to the last \, then tic after the last \ to the end of the path. Example:
select * from `c:\somedir\anotherdir`\`some-file.txt`

Update 4/2016 – Again I had a challenge with a file name – but this time with an Excel file on a Windows Share where the share name had dashes and dollar signs. Excel file??? Yes, I use this same approach to read Excel files via CSV. Just pick the “Microsoft Excel Driver” instead of the “Microsoft Access Text Driver” described above. Anyway back to this path issue… the solution: square brackets
Example for an Excel file:
select * from [\\servername\sharename\somedir\some-file.xlsx].[sheetname]

Why do I keep making updates to this old blog? For MY future reference of course…