Category Archives: Development

A method for quickly importing and working with CSV files (part 1)

So many folks have asked me to write more on my coding style. So here it is: I’ve developed a two-parter blog post to present my thinking style and how I ultimately solved the problem.

The Problem

I have a client with several CSV files that must be checked and validated for certain data specifications. Each individual CSV file is essentially an output of a table from a query of a larger system. My client required I verify the contents of the CSV files and ensure the connections between the file are correct.

By connection, I mean that several of these CSV files have foreign and primary keys between them. The CSV files could also contain many errors. My client needs an Excel application that can verify these files link together with good data. With that information they can upload the files into another system.

Below is my thinking on how to solve the problem. In the next blog post, I’ll go through implementing the solution.

PowerPivot Won’t Cut It

Now, PowerPivot is actually a great candidate for this problem. But I couldn’t be sure my client (and the folks using my Excel application) would have the ability to install and use PowerPivot. So while I’d love to use PowerPivot for this in an ideal world, it just wasn’t a feasible solution for my client.

ADO is Out

So another would be to use ADO (here’s an example to see what I mean). The problem with using ADO is that it relies on references outside the normal VBA library. In previous years, that didn’t use to be a problem. But as I try to run many of my older Excle applications that had interfaced with Access and other Excel files using ADO, it’s just not a solution that ages well. The reason is that the libraries keep getting updated or names change. New versions of Excel require new connection strings. Over time, these types of solutions just break.

Iteration: My Own Worst Enemy

The most obvious solution—indeed, the one I really, really wanted to avoid—was to open these files all at the same time and then iterate through each record. This solution however is actually the bane of my existence.

For one, iterative algorithms like these explode in complexity as more records are added to the CSV file. And they take forever to complete.

Second, iterative algorithms for this type of problem really promotes something I don’t like. Specifically, it requires you create nested loops upon nested loops of iterations and compounding conditions.

This type of iterative solution is often the go-to for many organizations. It often results in hundreds of lines of VBA in a module with large sections of comments explaining what’s going on. But those comments never help. Indeed, when a company expresses concern about hiring an Excel developer because they fear what might happen when that person leaves, it’s because they’ve experienced one of these long-form iterative solutions. Let’s be honest: they take forever to complete; they’re a pain in the ass to debug; and when passed along to the next developer, they’re rarely touched because nobody understands them.

Copying/Pasting: Just Say No

So I decided early on I had no interest in opening these CSV files several at a time and creating a gigantic loop that only I could (barely) understand. I reasoned that I could do much of the required validation rather easily (and quickly) with formulas. But to use formulas on the data, I would have to load these CSV files into the current workbook somehow. My goal then was to have corresponding tabs filled in with the contents of each given CSV file.

Because each CSV files would always have the same name, I wouldn’t have to create new tabs on the fly. So that was nice: I could first create a tab for each corresponding CSV file (in this case, the tab’s name was the same as the CSV file).

So another obvious solution would be to open each file programmatically, and then copy its entire contents over using the Copy and Paste commands in VBA. This solution is common, but also fraught with problems. For one, it’s an expensive solution when the datasets are large. A large CSV file would require a lot of data copied to the clipboard. Several of these pasting operations happening one after another is a great way to crash a spreadsheet. Second, sometimes the clipboard just simply acts unpredictably; clearing out for now apparent reason.

So what did I do? Find out in the the next thrilling blog post… but feel free to share your ideas!

Developing a JSON Excel Add-In With VB.NET

Written by Bjoern Stiel

Add-Ins are a great way to increase Excel’s firepower. Add-Ins not only add additional functionality but are also an ideal vehicle for distributing your custom functions across multiple workbooks and users.

If writing a new Add-In is on your to-do list anyway, this is the perfect opportunity to look beyond VBA. VBA has become a bit middle-aged. Integration with modern stuff like NoSQL databases or web services can become quite painful as it involves messing around with DLLs and cumbersome APIs.

Excel-DNA is a free open source tool that lets you expose .NET functions to Excel via the XLL interface. This gives you immediate access to all .NET languages (VB.NET, C#, F# etc.) and libraries. Which makes developing modern looking WPF forms or hooking into a JSON API really a breeze.

In this post, I show you how to get started with Excel-DNA. We hit the ground running busting out our first two functions and have a brief look at what to watch out for when migrating over from VBA. To wrap it all up, we build a slick Excel JSON client in just a few lines of code and show use it to retrieve data from a web server with a simple =OBJECT_GET(…) function call.

All source code is available for download at GitHub:
https://github.com/Pathio/excel-json-client

An example spreadsheet is available at https://spreadgit.com/bjoern/excel-json-client.

1. Getting Started
There are two ways to write a User Defined Function in Excel-DNA. Either as a text file (.dna) or a .NET assembly (.dll). It’s easiest to start with the text file approach which we will do throughout this post. You need three things:
– a text editor such as Notepad
– the .NET Framework 2.0 runtime or later (.NET 4.0 or later if you are on Excel 64-bit), should already be installed with Windows anyway, so you probably do not need to worry about this bit
– the Excel-DNA XLL (https://exceldna.codeplex.com/releases/view/119190)

If you want to get jump-started, clone my git repository https://github.com/Pathio/excel-json-client and you’re good to go, it comes with all batteries. Get the example workbook at https://spreadgit.com/bjoern/excel-json-client and open one of the XLLs (ending with 32 or 64 -depending whether you are on Excel 32- or 64-bit) via drag & drop, file/open or the Add-In manager.

2. Writing Our First Function
I’ve defined the code in the excel-json-client.dna file which is just a text file. As mentioned above, you can Visual Studio studio (which you probably should) but to avoid all the overhead, I’ll keep it simple in this post.

Let’s look at the DISTINCT function defined in the Helpers module. It gives you a first idea of how we pass around data between .NET and Excel. We receive a two-dimensional array of type Object (think Variant), filter out all duplicates and return another two dimensional Object array. I’m pretty sure there’s a more elegant way of removing the duplicates via LINQ (.NET’s Language Integrated Query language) but it does the job.

Public Module Helpers
 Public Function DISTINCT(data As Object(,))
   Dim dict As Dictionary(Of String, Integer) = New Dictionary(Of String, Integer)
   Dim key As String
   Dim i As Integer = 0, j As Integer = 0
 
   For i = 0 To data.GetUpperBound(0)
     key = ""
     For j = 0 To data.GetUpperBound(1)
       key += data(i, j).ToString()
     Next
     If Not dict.ContainsKey(key) Then
       dict.Add(key, i)
     End If
   Next
   Dim out(dict.Count-1, data.GetUpperBound(1)) As Object
   i = 0
   For Each key In dict.Keys
     For j = 0 To data.GetUpperBound(1)
       out(i, j) = data(dict.Item(key), j)
     Next
     i += 1
   Next
   DISTINCT = out
 End Function
End Module

3. Migrating From VBA to VB.NET
Migrating from VBA to VB.NET/Excel-DNA is relatively straight forward. But there are a couple of gotchas and things to watch out for, most importantly:
– Variant is no longer supported: use the Object type instead
– Function parameters of type Object will arrive as one of the following: Double, String, Boolean, ExcelDna.Integration.ExcelError, ExcelDna.Integration.ExcelMissing, ExcelDna.Integration.ExcelEmpty, Object[,] containing an array with a mixture of the above types, ExcelReference (only if AllowReference=true in ExcelArgumentAttribute causing R type instead of P)
– Arrays: In VBA, the default lower bound of an array dimension is 0 but can be set to 1 using “Option Base”. VB.NET does not support “Option Base” and the lower bound of an array dimension must be 0. Be careful though that most Office collections begin with 1. For a more exhaustive list, I definitely recommend checking out Patrick O’Beirne’s excellent guide and Govert’s (the guy who wrote and maintains Excel-DNA) documentation on Excel-DNA’s Data marshalling.

4. A JSON Excel client
JSON is all over the web and has become the number one choice for RESTful APIs. Pulling JSON from a web server onto Excel is definitely a useful thing to have. For example, at spreadgit, we run our entire admin-backend off Excel via a JSON API. Unfortunately, building a JSON client in VBA is quite a messy job and involves a lot of HTTP and deserialization stuff which does really feel like re-inventing the wheel.

With Excel-DNA however, we can use whatever is available in the .NET world. HTTP request classes are built in and for handling the JSON bit we use the excellent Json.NET library, written and maintained by James Newton-King’s. All we need to do is some plumbing work. Before we roll up sleeves, let’s briefly design the data flow.

1. User requests data from a URL (that returns a JSON) by calling the Excel function =OBJ_GET(http://myobject#field)
2. Let’s see if the object http://myobject already exists in our cache from a previous call. If not, do an HTTP request to http://myobject and stick it into our cache.
3. Return the value for “field” on object http://myobject (now definitely in our cache)

As you can see, we make use of the hash mark separator to identify a field within our JSON document. Let’s start with a thin JSON class. We want to deserializes the JSON data we get from the web server as a (nested) dictionary so we can easily gain access to its fields via keys.

Public Class Json
 Public Shared Function Deserialize(json As String) As Object
   Deserialize = ToObject(JToken.Parse(json))
 End Function

 Public Shared Function ToObject(token As Newtonsoft.Json.Linq.JToken) As Object
   If token.Type = JTokenType.Object Then
   Dim dict As Dictionary(Of String, Object) = New Dictionary(Of String, Object)
   Dim prop As Object
   For Each prop In CType(token, JObject).Properties()
     dict.Add(prop.Name, ToObject(prop.Value))
   Next
   ToObject = dict
   ElseIf token.Type = JTokenType.Array Then
     Dim lst As List(Of Object) = New List(Of Object)
     Dim value As Object
     For Each value In token
       lst.Add(ToObject(value))
     Next
     ToObject = lst
   Else
     ToObject = DirectCast(token, JValue).Value
   End If
   End Function
 End Class

Next thing on our list is the cache. It is simply another dictionary, whereby the url is our key (up to the hash mark separator as mentioned above) and the JSON deserialized dictionary our value.

Public Class Cache
 Private _objects As New Dictionary(Of String, Object)

 Public Sub Clear()
   _objects.Clear()
 End Sub

 Public Sub ImportObject(baseUrl As String)
   Dim wrGETURL As WebRequest
   wrGETURL = WebRequest.Create(baseUrl)
   Dim dictionary As New Dictionary(Of String, Object)
   Dim objStream As Stream
   objStream = wrGETURL.GetResponse.GetResponseStream()
   Dim objReader As New StreamReader(objStream)
   _objects.Add(baseUrl, Json.Deserialize(objReader.ReadToEnd))
   objReader.Close()
   objStream.Close()
 End Sub

 Public Function GetObject(url As String) As Object
   Dim baseUrl As String = Nothing
   baseUrl = url.Split("#")(0)
   If Not _objects.ContainsKey(baseUrl) Then
     ImportObject(baseUrl)
   End If
   GetObject = _objects.Item(baseUrl)
 End Function

 Public Function GetObjectPartial(url As String) As Object
   Dim baseUrl As String = Nothing, obj As Object
   Dim key As String
   baseUrl = url.Split("#")(0)
   Try
     obj = GetObject(baseUrl)
     If url.Contains("#") Then
       For Each key In url.Split("#")(1).Split("/")
         obj = obj.Item(key)
       Next
     End If
     GetObjectPartial = obj
   Catch ex as Exception
     GetObjectPartial = ex.Message
   End Try
 End Function`
End Class

That’s pretty much it. Last thing to do is to wrap it all up nicely in a couple of Excel functions. We use a slight trick to be able to work with complex objects: Anything we cannot resolve on a two-dimensional grid, we just return the URL of the object instead of the data. Let’s assume we retrieve a JSON document from http://localhost:8000/test.json that looks like this (which is a nested object):

  {
    'id': 1234,
    'object': {
      'name': 'test', 
      'data': [['date' 'value'], ['2014-04-01', 100], ['2014-04-02', 101]]
    }
  }

In this instance
=OBJECT_GET(“http://localhost:8000/test.json”) returns http://localhost:8000/test.json
=OBJECT_GET(“http://localhost:8000/test.json#id”) returns 1234
=OBJECT_GET(“http://localhost:8000/test.json#object” returns http://localhost:8000/test.json#object
=OBJECT_GET(“http://localhost:8000/test.json#object#data”) returns [[‘date’ ‘value’], [‘2014-04-01’, 100], [‘2014-04-02’, 101]]

 Public Module Objects
 Dim cache As Cache = New Cache()
 
 Public Function OBJECT_GET(url As String)
   Dim path As String() = Nothing
   Dim obj As Object
   Dim i As Long = 0, j As Long
   Dim row As Object, col As Object, item As Object

   obj = cache.GetObjectPartial(url)
   If TypeOf obj Is List(Of Object) Then
   If TypeOf CType(obj, List(Of Object))(0) Is List(Of Object) Then
   Dim o(,) As Object
   ReDim o(CType(obj, List(Of Object)).Count()-1, CType(obj, List(Of Object))(0).Count()-1)
   For Each row In obj
     j = 0
     For Each col In row
       o(i, j) = col
       j += 1
     Next
     i += 1
   Next
   obj = o
   Else
     Dim o() As Object
     ReDim o(CType(obj, List(Of Object)).Count()-1)
     For Each item In obj
       o(i) = item
       i += 1
     Next
     obj = o
   End If
   ElseIf TypeOf obj Is Dictionary(Of String, Object) Then
     obj = url
   End If
   OBJECT_GET = obj
 End Function
End Module

Let’s give it a try and access some Apple stock data via the quandl.com API. Feel free to download the example workbook at https://spreadgit.com/bjoern/excel-json-client. Take it for a spin and try with any other JSON API you can find.

Please note that the code above doesn’t cover all cases, for instance arrays of objects aren’t handled properly. Another nice thing to have is an object explorer UI to explorer the objects in your cache easily. Or adding POST, PUT and DELETE requests, all of which can be easily added. This gives you a powerful, generic Add-In that lets you hook into literally any JSON API.

How do you like using VB.NET via Excel-DNA?

About the Author
Bjoern Stiel
Founder of Pathio (formerly known as spreadgit), a GitHub like version control platform for Excel. As a former quant developer and trader, used to push Excel beyond its limits, now jack of all trades. You can find me online, tweeting about Excel, writing posts or developing .NET Add-Ins. Contact me @path_io or drop me a line bjoern.stiel@pathio.com.

It’s time to say “Goodbye,” to Hungarian Notation

Most of us code using the Hungarian notation style when writing VBA code. Basically, that’s when you prefix an abbreviated description of a variables’ type or class in front of its name. For example, the ‘d’ below is a prefix for a variable of type double. You get the idea.

   1: Dim dDouble   As Double

   2: Dim sString   As String

   3: Dim cmdButton As CommandButton

I’m not really sure why Hungarian Notation has not yet been replaced—or, at the very least, discouraged for VBA coding. But here’s some speculation: 

(1) it’s a leftover from Visual Basic 6.0;

(2) because what has been put forth to replace it (specifically, Pascal or “CamelBack” Notation) looks like C/C++ and many VBA and Visual Basic 6.0 coders fear that C-languages are too overly complicated for them;

and (3) having a bad style guide is better than no style guide.

Let’s be honest: there’s really no value added with VBA and Hungarian Notation. It makes code ugly-looking and hard to follow (despite what we’ve been told). And, not every coder uses the same style abbreviations. Heck, even the list of abbreviations Microsoft recommends is too large to memorize. Think about this: Does cmbDisplayValues refer to a command button to display values or a combo-box of displayed values? It’s hard to tell—and I’m not the only one who thinks so.

Just take a look at the argument names of Excel’s internal object methods. Notice how virtually none of them use Hungarian Notation. They employee variable names like “Target,” and “Index.” How many times have you confused these names as referring to different types? Probably not that often, if it all. Certain keywords in variables names, like “Count” and “No” almost always refer to integers, so why not use them instead? I argue that you should use a descriptive name that imbues the meaning of the variable—nothing more. A variable called ProjectNPV is likely to be a double if it indeed refers to a net present value calculation for a project. You don’t need to put a ‘d’ in front of it.

Not convinced? Think about this Excel’s internal objects. What is the collection called that holds all the information about every series in the Excel chart? It’s called a SeriesCollection  and not cltSeries; ListObjects not lstObjects; Workbooks not wkbObjects. The point is made: if Microsoft doesn’t use Hungarian Notation anymore, then neither should we.

In the forthcoming weeks, I’ll be posting more on this subject. It’s time we changed the way we develop with Excel. Out with the old, in with the new. Let’s make 2014 the year of good code.

It’s time to say “Goodbye,” to Hungarian Notation

Most of us code using the Hungarian notation style when writing VBA code. Basically, that’s when you prefix an abbreviated description of a variables’ type or class in front of its name. For example, the ‘d’ below is a prefix for a variable of type double. You get the idea.

   1: Dim dDouble   As Double

   2: Dim sString   As String

   3: Dim cmdButton As CommandButton

I’m not really sure why Hungarian Notation has not yet been replaced—or, at the very least, discouraged for VBA coding. But here’s some speculation: 

(1) it’s a leftover from Visual Basic 6.0;

(2) because what has been put forth to replace it (specifically, Pascal or “CamelBack” Notation) looks like C/C++ and many VBA and Visual Basic 6.0 coders fear that C-languages are too overly complicated for them;

and (3) having a bad style guide is better than no style guide.

Let’s be honest: there’s really no value added with VBA and Hungarian Notation. It makes code ugly-looking and hard to follow (despite what we’ve been told). And, not every coder uses the same style abbreviations. Heck, even the list of abbreviations Microsoft recommends is too large to memorize. Think about this: Does cmbDisplayValues refer to a command button to display values or a combo-box of displayed values? It’s hard to tell—and I’m not the only one who thinks so.

Just take a look at the argument names of Excel’s internal object methods. Notice how virtually none of them use Hungarian Notation. They employee variable names like “Target,” and “Index.” How many times have you confused these names as referring to different types? Probably not that often, if it all. Certain keywords in variables names, like “Count” and “No” almost always refer to integers, so why not use them instead? I argue that you should use a descriptive name that imbues the meaning of the variable—nothing more. A variable called ProjectNPV is likely to be a double if it indeed refers to a net present value calculation for a project. You don’t need to put a ‘d’ in front of it.

Not convinced? Think about this Excel’s internal objects. What is the collection called that holds all the information about every series in the Excel chart? It’s called a SeriesCollection  and not cltSeries; ListObjects not lstObjects; Workbooks not wkbObjects. The point is made: if Microsoft doesn’t use Hungarian Notation anymore, then neither should we.

In the forthcoming weeks, I’ll be posting more on this subject. It’s time we changed the way we develop with Excel. Out with the old, in with the new. Let’s make 2014 the year of good code.