Power Query: Fuzzy Fixing Part 1
20 November 2024
Welcome to our Power Query blog. This week, I consider how to fix data that is almost correct!
My imaginary salespeople have been working abroad; I have a series of transactions that have been rejected because the country name has not been recognised.
I am going to start by converting the data into an Excel Table by clicking anywhere in my data and using the shortcut CTRL + T.
I take the defaults and using the context-specific ‘Table Design’ tab, I change the ‘Table Name’ to Rejected_Transactions:
I right-click on my Table and choose to ‘Get Data from Table/Range’.
This brings my data into the Power Query Editor:
I am going to be using an M function called Table.AddFuzzyClusterColumn(). Hopefully, it is easier to use than to spell!
The syntax is:
Table.AddFuzzyClusterColumn(table as table, columnname as text, newcolumnname as text, optional options as nullable record) as table
This function will create a new column newcolumnname to table with values which are calculated by fuzzily matching values in columnname for each row.
The options I may use are extensive, and include:
- Culture: allows grouping records based upon culture-specific rules. It can be any valid culture name. For example, a Culture option of "ja-JP" groups records based on the Japanese culture. The default value is "", which groups based on the Invariant English culture
- IgnoreCase: a logical (true / false) value that allows case-insensitive key grouping. For example, when true, "Grapes" is grouped with "grapes". The default value is true
- IgnoreSpace: a logical (true / false) value that allows combining of text parts to find groups. For example, when true, "Gra pes" is grouped with "Grapes". The default value is true
- SimilarityColumnName: a name for the column that shows the similarity between an input value and the representative value for that input. The default value is null, in which case a new column for similarities will not be added
- Threshold: a number between 0.00 and 1.00 that specifies the similarity score at which two values will be grouped. For example, "Grapes" and "Graes" (missing the "p") are grouped together only if this option is set to less than 0.90. A threshold of 1.00 only allows exact matches. (Note that a fuzzy "exact match" might ignore differences like casing, word order and punctuation.) The default value is 0.80.
- TransformationTable: a table that allows grouping records based upon custom value mappings. It should contain "From" and "To" columns. For example, "Grapes" is grouped with "Raisins" if a transformation table is provided with the "From" column containing "Grapes" and the "To" column containing "Raisins". Note that the transformation will be applied to all occurrences of the text in the transformation table. With the above transformation table, "Grapes are sweet" will also be grouped with "Raisins are sweet".
I will begin by accepting the default values.
I have entered the following M code:
= Table.AddFuzzyClusterColumn(#"Changed Type", "Country", "FuzzyCountry")
So far, the results are not impressive. It needs something to go on; I will start by entering the values I am expecting to see. I’ll add these to the original Excel Table. I ‘Close & Load To…’ and make my query ‘Connection Only’ by choosing to ‘Only Create Connection’:
Now, I may add the proper country names:
I go back into my query and refresh:
Now, this is more promising: five [5] of my original rows have been fixed now there are reference values.
Next week, I will continue fixing my first dataset and look at what else I can do with this functionality.
Come back next time for more ways to use Power Query!