Category Archives: Advanced Excel

“Everyday” Random Shuffling with Formulas

In this article, we’ll talk about how to complete random shuffling with only formulas. Shuffling methods via VBA or a dime a dozen. But it’s much harder to perform a formula-based random shuffle since formulas don’t really allow for iteration. Non-VBA methods therefore require we think about these problems differently (what I describe in Dashboards as “Thinking Outside the Cell”). I call this the “everyday” method because it’s technically imperfect. From  a statistical standpoint, the results are not uniformly distributed. That means the method isn’t perfectly random at all times. We’ll attend to that later in the article.

Step 1. Lay it all out

For now, let’s take a look at a spreadsheet showing our deck of cards (the Excel file is available for download at the end of this article).

image

Though it’s slightly cut off in the picture, there is a complete 52-card deck.

Step 2. Add Some Randomness

Because this is a shuffle, we need to add some randomness to reorganize the cards in the deck. Perhaps the easiest method to do this would be to use RANDBETWEEN. Here, we’ll use =RANDBETWEEN(1,52) since there are 52 cards in total. In the next image, I’ve dragged this formula down so that it’s used 52 times.

image

Anyone who has attempted shuffling in Excel before should immediately realize the problem with using RANDBETWEEN as is. First, there are no guarantees RANDBETWEEN will produce every single number between 1 and 52. Second, there are no guarantees RANDBEETWEEN won’t produce a series of repeated numbers. In the image above, the cells highlighted in peach are duplicates! Clearly, our method needs a few more steps!

Step 3. Add Some Noise

The way we solve both issues presented above is by adding noise. Noise is a small amount of “error” we add to add each randomly generated number. This small amount of error will ensure we also produce a unique list of random numbers. In the image below, I’ve added a small amount to each random number. The small amount is essentially the current card index divided by an arbitrarily large number. As you know, small numbers divided by big numbers produce very small numbers. Using the card index ensures the small amount of noise added is always different, since no cards have the same index. Alternatively, I could have used the volatile function ROW instead of the card index.

If you look at rows 11, 18 and 22 in the picture, you’ll see the RANDBETWEEN resulted in a 50 for each. But because we’ve added some noise, each item has a different resulting value.

image

Step 4. Re-RANK ‘em

What we need to do now is to transform the values produced into numbers between 1 and 52. We can do this using the RANK function. Since each random number is now unique, we know that each number must have a unique rank within the entire range of random numbers produced. For instance if we were just to consider the first three numbers produced in the image above–21.00002, 49.00003 and 28.00004–these numbers, when ranked in descending largest-to-smallest order would yield rank results of 3 (21.0002 is the lowest numbers on the list), 1 (49.00003 is the greatest), and 2 (28.0004 is in the middle). We’ll do this RANKing across all cards (shown below).

image

Step 5. Use INDEX to Retrieve the Results

If our list of random numbers is unique then the their ranks must also be unique. So we’ve now addressed the two issues described above. We now have a resulting list of numbers that fully spans the range of integer numbers from 1 to 52. The last step is to use this resulting list as an index back onto the original card set.

image

And that’s it! You now have a shuffled deck.

Discussion

This setup was likely easier than you though to build. However, it does come with some unfortunate tradeoffs. The biggest trade off is that it’s not perfectly random. Here’s what I mean: let’s say two random numbers have a collision (i.e. the same random number is picked between them). In the image above, rows 7 and 14 collide at random number 42. We’ve seemingly resolved this issue by making each cell unique. However, because we’re using RANK, the item in row 7 will always be closer to the top of the deck than the item in row 14. This is because the noise is not randomly generated. Collisions in this method will always give sorting preference to the numbers that collide first. If we ran this random sorter a few thousand times, the resulting frequency of each card would demonstrate the sort order is not completely random.

The easy away around this is to have the random numbers collide less often. We use RANDBETWEEN(1,52) in this example because it makes thinking about the problem more approachable. But, since we’re checking the rank of unique numbers in a list, the mathematics do not dictate the random numbers be in any specific range. In fact, we could just as easily replace the random number generating part of the formula to something like RAND()*10000. Now that the range of possible numbers is bigger, it’s less likely they’ll be a collision in our shuffle. However, if we’re thinking broadly about shuffling items, the more items you have to shuffle the more likely they’ll be collisions. That’s just because the potential for collisions increases with each new random number generated.

image

Alternatively, you could implement a Fisher-Yates shuffle. This method will always generate random numbers, and it is indeed possible to use only formulas to implement. However, it’s a bit more complicated and perhaps best served as an article for another day (but you’re welcome to try it on your own). Until then, this method will suffice for most demonstrations.

Download the worksheet used for this article: Unique Random Sorting

Advertisements

A method for quickly importing and working with CSV files (part 2)

This is part 2 in which I explain how I solved the problem. Make sure to read the previous blog posts. Taken together, these show I though through and ultimately solved the problem.

Query Tables to the Rescue!

To solve this problem, I stumbled on something I rarely use, and that’s our good ol’ friend Query Tables. I don’t think they go by Query Tables anymore, but that’s what what they’re called in the code, so that’s the terminology I’m going to use here.

You might have used Query Tables as a manual feature before. For instance, to import a CSV file into Excel using Query Tables you would simply select From Text in the Get External Data Group from on the Data tab. See the snapshot below. These are Query Tables.

image

This functionality is really meant to create a connection between Excel and some other data source. The idea being you can manipulate and analyze data using Excel as your business logic and presentation layer.

But I didn’t need any of that here. I never expected the data in the underlying CSV file to change, so a connection that could refresh wasn’t necessary. Furthermore, unless you’re using these connections with PowerPivot, Excel has a bad history linking to other files.

But what Query Tables brings to the table (pun intended) is that they can dump an entire table’s worth of data rather quickly into a worksheet tab in one fell swoop. Try it for yourself manually to see how quick it is.

In fact, I was able to automate the entire process using the code below.

   1: Private Sub UploadFiles(FilePath As String, WorksheetTabName As String)

   2:     On Error Resume Next

   3:

   4:     Dim CurrentWorksheet As Excel.Worksheet

   5:     Set CurrentWorksheet = Worksheets(WorksheetTabName)

   6:

   7:     With CurrentWorksheet.QueryTables.Add(Connection:= _

   8:         "TEXT;" & FilePath _

   9:         , Destination:=CurrentWorksheet.Range("$A$1"))

  10:         .CommandType = 0

  11:         .Name = "DeleteMe"

  12:         .FieldNames = True

  13:         .RefreshOnFileOpen = False

  14:         .RefreshStyle = xlOverwriteCells

  15:         .AdjustColumnWidth = True

  16:         .RefreshPeriod = 0

  17:         .TextFilePlatform = 437

  18:         .TextFileStartRow = 1

  19:         .TextFileParseType = xlDelimited

  20:         .TextFileTextQualifier = xlTextQualifierDoubleQuote

  21:         .TextFileConsecutiveDelimiter = False

  22:         .TextFileCommaDelimiter = True

  23:         .TextFileTrailingMinusNumbers = True

  24:         .Refresh BackgroundQuery:=False

  25:     End With

  26:

  27:     CurrentWorksheet.QueryTables(1).Delete

  28:     ThisWorkbook.Names(WorksheetTabName & "!DeleteMe_1").Delete

  29: End Sub

So let’s go through how this works.

First, the way my program works is that the user supplies a folder containing the CSV files. Then, when they hit a button that says “Run,” I have Excel iterate through the contents of the folder. When I’ve verified a file in the folder is a correct file that will work with my application, I call the UploadFiles method just shown. The FilePath parameter contains to the direct path to the CSV file whose contents I’d like to upload; and the WorksheetTabName parameter refers to the worksheet where’d I’d like the contents to be dumped.

As you can see, on line 8, FilePath is supplied to the connection string when the query table is created. CurrentWorksheet is then used to tell Excel where the data will be placed (line 9).

On line 11, you see I named this query “DeleteMe.” Whenever Excel creates a new Query Table, it also creates a new named range in the background. There’s a default name Excel uses, but you can overwrite it with your own name. In this case, I use “DeleteMe” in place of the default as a reminder that the range shouldn’t exist (if my code errors out, I will see a DeleteMe named range in the name manager). You can see this happening below. It really helps with debugging.

image

The reason there are so many repeats is that each name created by the code above is scoped locally to the sheet in which it was created. Excel also adds that “_1” to the end. I imagine if I had multiple connections on the same sheet it would increment that last number. But maybe someone with more experience could confirm that for me?

In any event, we want to ensure there are no traces of the connection settings. So the BackgroundQuery setting to FALSE on line 24 ensures there are no direct updates from the backend data. Line 27 deletes the QueryTable we just created. Don’t worry this just deletes the connection behind the scenes, but it keeps the data that’s been dumped. Finally, line 28 removes the range that was just created. Since it’s local to the worksheet, we can refer to it as “WorksheetName!DeleteMe_1” and then delete it.

So let’s recap. I’ve created a mechanism to quickly import the data from CSV files. I used the functionality incorporated into Query Tables to bypass other solutions that either were not as good or weren’t feasible. In particular, I was  able to get around two solutions I particularly hate: (1) the iterative algorithm; and (2) copying and pasting from other files.

All told, this solution works great! Right now, Excel will iterate through a folder of about eight CSV files and place the data from each CSV file rather quickly. In fact, it happens so quickly, I had to add a completion msgbox just to ensure the code was actually being run! Neither the iterative or copying/pasting solution can boast these results.

Remember, whenever possible, do more with less.

A method for quickly importing and working with CSV files (part 1)

So many folks have asked me to write more on my coding style. So here it is: I’ve developed a two-parter blog post to present my thinking style and how I ultimately solved the problem.

The Problem

I have a client with several CSV files that must be checked and validated for certain data specifications. Each individual CSV file is essentially an output of a table from a query of a larger system. My client required I verify the contents of the CSV files and ensure the connections between the file are correct.

By connection, I mean that several of these CSV files have foreign and primary keys between them. The CSV files could also contain many errors. My client needs an Excel application that can verify these files link together with good data. With that information they can upload the files into another system.

Below is my thinking on how to solve the problem. In the next blog post, I’ll go through implementing the solution.

PowerPivot Won’t Cut It

Now, PowerPivot is actually a great candidate for this problem. But I couldn’t be sure my client (and the folks using my Excel application) would have the ability to install and use PowerPivot. So while I’d love to use PowerPivot for this in an ideal world, it just wasn’t a feasible solution for my client.

ADO is Out

So another would be to use ADO (here’s an example to see what I mean). The problem with using ADO is that it relies on references outside the normal VBA library. In previous years, that didn’t use to be a problem. But as I try to run many of my older Excle applications that had interfaced with Access and other Excel files using ADO, it’s just not a solution that ages well. The reason is that the libraries keep getting updated or names change. New versions of Excel require new connection strings. Over time, these types of solutions just break.

Iteration: My Own Worst Enemy

The most obvious solution—indeed, the one I really, really wanted to avoid—was to open these files all at the same time and then iterate through each record. This solution however is actually the bane of my existence.

For one, iterative algorithms like these explode in complexity as more records are added to the CSV file. And they take forever to complete.

Second, iterative algorithms for this type of problem really promotes something I don’t like. Specifically, it requires you create nested loops upon nested loops of iterations and compounding conditions.

This type of iterative solution is often the go-to for many organizations. It often results in hundreds of lines of VBA in a module with large sections of comments explaining what’s going on. But those comments never help. Indeed, when a company expresses concern about hiring an Excel developer because they fear what might happen when that person leaves, it’s because they’ve experienced one of these long-form iterative solutions. Let’s be honest: they take forever to complete; they’re a pain in the ass to debug; and when passed along to the next developer, they’re rarely touched because nobody understands them.

Copying/Pasting: Just Say No

So I decided early on I had no interest in opening these CSV files several at a time and creating a gigantic loop that only I could (barely) understand. I reasoned that I could do much of the required validation rather easily (and quickly) with formulas. But to use formulas on the data, I would have to load these CSV files into the current workbook somehow. My goal then was to have corresponding tabs filled in with the contents of each given CSV file.

Because each CSV files would always have the same name, I wouldn’t have to create new tabs on the fly. So that was nice: I could first create a tab for each corresponding CSV file (in this case, the tab’s name was the same as the CSV file).

So another obvious solution would be to open each file programmatically, and then copy its entire contents over using the Copy and Paste commands in VBA. This solution is common, but also fraught with problems. For one, it’s an expensive solution when the datasets are large. A large CSV file would require a lot of data copied to the clipboard. Several of these pasting operations happening one after another is a great way to crash a spreadsheet. Second, sometimes the clipboard just simply acts unpredictably; clearing out for now apparent reason.

So what did I do? Find out in the the next thrilling blog post… but feel free to share your ideas!

Developing a JSON Excel Add-In With VB.NET

Written by Bjoern Stiel

Add-Ins are a great way to increase Excel’s firepower. Add-Ins not only add additional functionality but are also an ideal vehicle for distributing your custom functions across multiple workbooks and users.

If writing a new Add-In is on your to-do list anyway, this is the perfect opportunity to look beyond VBA. VBA has become a bit middle-aged. Integration with modern stuff like NoSQL databases or web services can become quite painful as it involves messing around with DLLs and cumbersome APIs.

Excel-DNA is a free open source tool that lets you expose .NET functions to Excel via the XLL interface. This gives you immediate access to all .NET languages (VB.NET, C#, F# etc.) and libraries. Which makes developing modern looking WPF forms or hooking into a JSON API really a breeze.

In this post, I show you how to get started with Excel-DNA. We hit the ground running busting out our first two functions and have a brief look at what to watch out for when migrating over from VBA. To wrap it all up, we build a slick Excel JSON client in just a few lines of code and show use it to retrieve data from a web server with a simple =OBJECT_GET(…) function call.

All source code is available for download at GitHub:
https://github.com/Pathio/excel-json-client

An example spreadsheet is available at https://spreadgit.com/bjoern/excel-json-client.

1. Getting Started
There are two ways to write a User Defined Function in Excel-DNA. Either as a text file (.dna) or a .NET assembly (.dll). It’s easiest to start with the text file approach which we will do throughout this post. You need three things:
– a text editor such as Notepad
– the .NET Framework 2.0 runtime or later (.NET 4.0 or later if you are on Excel 64-bit), should already be installed with Windows anyway, so you probably do not need to worry about this bit
– the Excel-DNA XLL (https://exceldna.codeplex.com/releases/view/119190)

If you want to get jump-started, clone my git repository https://github.com/Pathio/excel-json-client and you’re good to go, it comes with all batteries. Get the example workbook at https://spreadgit.com/bjoern/excel-json-client and open one of the XLLs (ending with 32 or 64 -depending whether you are on Excel 32- or 64-bit) via drag & drop, file/open or the Add-In manager.

2. Writing Our First Function
I’ve defined the code in the excel-json-client.dna file which is just a text file. As mentioned above, you can Visual Studio studio (which you probably should) but to avoid all the overhead, I’ll keep it simple in this post.

Let’s look at the DISTINCT function defined in the Helpers module. It gives you a first idea of how we pass around data between .NET and Excel. We receive a two-dimensional array of type Object (think Variant), filter out all duplicates and return another two dimensional Object array. I’m pretty sure there’s a more elegant way of removing the duplicates via LINQ (.NET’s Language Integrated Query language) but it does the job.

Public Module Helpers
 Public Function DISTINCT(data As Object(,))
   Dim dict As Dictionary(Of String, Integer) = New Dictionary(Of String, Integer)
   Dim key As String
   Dim i As Integer = 0, j As Integer = 0
 
   For i = 0 To data.GetUpperBound(0)
     key = ""
     For j = 0 To data.GetUpperBound(1)
       key += data(i, j).ToString()
     Next
     If Not dict.ContainsKey(key) Then
       dict.Add(key, i)
     End If
   Next
   Dim out(dict.Count-1, data.GetUpperBound(1)) As Object
   i = 0
   For Each key In dict.Keys
     For j = 0 To data.GetUpperBound(1)
       out(i, j) = data(dict.Item(key), j)
     Next
     i += 1
   Next
   DISTINCT = out
 End Function
End Module

3. Migrating From VBA to VB.NET
Migrating from VBA to VB.NET/Excel-DNA is relatively straight forward. But there are a couple of gotchas and things to watch out for, most importantly:
– Variant is no longer supported: use the Object type instead
– Function parameters of type Object will arrive as one of the following: Double, String, Boolean, ExcelDna.Integration.ExcelError, ExcelDna.Integration.ExcelMissing, ExcelDna.Integration.ExcelEmpty, Object[,] containing an array with a mixture of the above types, ExcelReference (only if AllowReference=true in ExcelArgumentAttribute causing R type instead of P)
– Arrays: In VBA, the default lower bound of an array dimension is 0 but can be set to 1 using “Option Base”. VB.NET does not support “Option Base” and the lower bound of an array dimension must be 0. Be careful though that most Office collections begin with 1. For a more exhaustive list, I definitely recommend checking out Patrick O’Beirne’s excellent guide and Govert’s (the guy who wrote and maintains Excel-DNA) documentation on Excel-DNA’s Data marshalling.

4. A JSON Excel client
JSON is all over the web and has become the number one choice for RESTful APIs. Pulling JSON from a web server onto Excel is definitely a useful thing to have. For example, at spreadgit, we run our entire admin-backend off Excel via a JSON API. Unfortunately, building a JSON client in VBA is quite a messy job and involves a lot of HTTP and deserialization stuff which does really feel like re-inventing the wheel.

With Excel-DNA however, we can use whatever is available in the .NET world. HTTP request classes are built in and for handling the JSON bit we use the excellent Json.NET library, written and maintained by James Newton-King’s. All we need to do is some plumbing work. Before we roll up sleeves, let’s briefly design the data flow.

1. User requests data from a URL (that returns a JSON) by calling the Excel function =OBJ_GET(http://myobject#field)
2. Let’s see if the object http://myobject already exists in our cache from a previous call. If not, do an HTTP request to http://myobject and stick it into our cache.
3. Return the value for “field” on object http://myobject (now definitely in our cache)

As you can see, we make use of the hash mark separator to identify a field within our JSON document. Let’s start with a thin JSON class. We want to deserializes the JSON data we get from the web server as a (nested) dictionary so we can easily gain access to its fields via keys.

Public Class Json
 Public Shared Function Deserialize(json As String) As Object
   Deserialize = ToObject(JToken.Parse(json))
 End Function

 Public Shared Function ToObject(token As Newtonsoft.Json.Linq.JToken) As Object
   If token.Type = JTokenType.Object Then
   Dim dict As Dictionary(Of String, Object) = New Dictionary(Of String, Object)
   Dim prop As Object
   For Each prop In CType(token, JObject).Properties()
     dict.Add(prop.Name, ToObject(prop.Value))
   Next
   ToObject = dict
   ElseIf token.Type = JTokenType.Array Then
     Dim lst As List(Of Object) = New List(Of Object)
     Dim value As Object
     For Each value In token
       lst.Add(ToObject(value))
     Next
     ToObject = lst
   Else
     ToObject = DirectCast(token, JValue).Value
   End If
   End Function
 End Class

Next thing on our list is the cache. It is simply another dictionary, whereby the url is our key (up to the hash mark separator as mentioned above) and the JSON deserialized dictionary our value.

Public Class Cache
 Private _objects As New Dictionary(Of String, Object)

 Public Sub Clear()
   _objects.Clear()
 End Sub

 Public Sub ImportObject(baseUrl As String)
   Dim wrGETURL As WebRequest
   wrGETURL = WebRequest.Create(baseUrl)
   Dim dictionary As New Dictionary(Of String, Object)
   Dim objStream As Stream
   objStream = wrGETURL.GetResponse.GetResponseStream()
   Dim objReader As New StreamReader(objStream)
   _objects.Add(baseUrl, Json.Deserialize(objReader.ReadToEnd))
   objReader.Close()
   objStream.Close()
 End Sub

 Public Function GetObject(url As String) As Object
   Dim baseUrl As String = Nothing
   baseUrl = url.Split("#")(0)
   If Not _objects.ContainsKey(baseUrl) Then
     ImportObject(baseUrl)
   End If
   GetObject = _objects.Item(baseUrl)
 End Function

 Public Function GetObjectPartial(url As String) As Object
   Dim baseUrl As String = Nothing, obj As Object
   Dim key As String
   baseUrl = url.Split("#")(0)
   Try
     obj = GetObject(baseUrl)
     If url.Contains("#") Then
       For Each key In url.Split("#")(1).Split("/")
         obj = obj.Item(key)
       Next
     End If
     GetObjectPartial = obj
   Catch ex as Exception
     GetObjectPartial = ex.Message
   End Try
 End Function`
End Class

That’s pretty much it. Last thing to do is to wrap it all up nicely in a couple of Excel functions. We use a slight trick to be able to work with complex objects: Anything we cannot resolve on a two-dimensional grid, we just return the URL of the object instead of the data. Let’s assume we retrieve a JSON document from http://localhost:8000/test.json that looks like this (which is a nested object):

  {
    'id': 1234,
    'object': {
      'name': 'test', 
      'data': [['date' 'value'], ['2014-04-01', 100], ['2014-04-02', 101]]
    }
  }

In this instance
=OBJECT_GET(“http://localhost:8000/test.json”) returns http://localhost:8000/test.json
=OBJECT_GET(“http://localhost:8000/test.json#id”) returns 1234
=OBJECT_GET(“http://localhost:8000/test.json#object” returns http://localhost:8000/test.json#object
=OBJECT_GET(“http://localhost:8000/test.json#object#data”) returns [[‘date’ ‘value’], [‘2014-04-01’, 100], [‘2014-04-02’, 101]]

 Public Module Objects
 Dim cache As Cache = New Cache()
 
 Public Function OBJECT_GET(url As String)
   Dim path As String() = Nothing
   Dim obj As Object
   Dim i As Long = 0, j As Long
   Dim row As Object, col As Object, item As Object

   obj = cache.GetObjectPartial(url)
   If TypeOf obj Is List(Of Object) Then
   If TypeOf CType(obj, List(Of Object))(0) Is List(Of Object) Then
   Dim o(,) As Object
   ReDim o(CType(obj, List(Of Object)).Count()-1, CType(obj, List(Of Object))(0).Count()-1)
   For Each row In obj
     j = 0
     For Each col In row
       o(i, j) = col
       j += 1
     Next
     i += 1
   Next
   obj = o
   Else
     Dim o() As Object
     ReDim o(CType(obj, List(Of Object)).Count()-1)
     For Each item In obj
       o(i) = item
       i += 1
     Next
     obj = o
   End If
   ElseIf TypeOf obj Is Dictionary(Of String, Object) Then
     obj = url
   End If
   OBJECT_GET = obj
 End Function
End Module

Let’s give it a try and access some Apple stock data via the quandl.com API. Feel free to download the example workbook at https://spreadgit.com/bjoern/excel-json-client. Take it for a spin and try with any other JSON API you can find.

Please note that the code above doesn’t cover all cases, for instance arrays of objects aren’t handled properly. Another nice thing to have is an object explorer UI to explorer the objects in your cache easily. Or adding POST, PUT and DELETE requests, all of which can be easily added. This gives you a powerful, generic Add-In that lets you hook into literally any JSON API.

How do you like using VB.NET via Excel-DNA?

About the Author
Bjoern Stiel
Founder of Pathio (formerly known as spreadgit), a GitHub like version control platform for Excel. As a former quant developer and trader, used to push Excel beyond its limits, now jack of all trades. You can find me online, tweeting about Excel, writing posts or developing .NET Add-Ins. Contact me @path_io or drop me a line bjoern.stiel@pathio.com.