Power Query: Fuzzy Fixing Part 2
27 November 2024
Welcome to our Power Query blog. This week, I continue to fix data that is almost correct!
My imaginary salespeople have been working abroad; I have a series of transactions that have been rejected because the country name has not been recognised.
Last time, I converted my data into a Table, and added rows showing what the names should look like:
I used the M function Table.AddFuzzyClusterColumn() in its default form to create a new column.
Five [5] of my original rows have been fixed using the reference values. I would like to see how I need to use Table.AddFuzzyClusterColumn() to fix the rest of the data. As a reminder, the syntax is:
Table.AddFuzzyClusterColumn(table as table, columnname as text, newcolumnname as text, optional options as nullable record) as table
To solve my current challenge, I am interested in the options I may use, which are extensive and include:
- Culture: allows grouping records based upon culture-specific rules. It can be any valid culture name. For example, a Culture option of "ja-JP" groups records based on the Japanese culture. The default value is "", which groups based on the Invariant English culture
- IgnoreCase: a logical (true / false) value that allows case-insensitive key grouping. For example, when true, "Grapes" is grouped with "grapes". The default value is true
- IgnoreSpace: a logical (true / false) value that allows combining of text parts to find groups. For example, when true, "Gra pes" is grouped with "Grapes". The default value is true
- SimilarityColumnName: a name for the column that shows the similarity between an input value and the representative value for that input. The default value is null, in which case a new column for similarities will not be added
- Threshold: a number between 0.00 and 1.00 that specifies the similarity score at which two values will be grouped. For example, "Grapes" and "Graes" (missing the "p") are grouped together only if this option is set to less than 0.90. A threshold of 1.00 only allows exact matches. (Note that a fuzzy "exact match" might ignore differences like casing, word order and punctuation.) The default value is 0.80
- TransformationTable: a table that allows grouping records based upon custom value mappings. It should contain "From" and "To" columns. For example, "Grapes" is grouped with "Raisins" if a transformation table is provided with the "From" column containing "Grapes" and the "To" column containing "Raisins". Note that the transformation will be applied to all occurrences of the text in the transformation table. With the above transformation table, "Grapes are sweet" will also be grouped with "Raisins are sweet".
Let’s take a closer look at how the fixed rows have been changed. There is an option to show the SimilarityColumnName, which I add to the M code for this step. I change the code from:
= Table.AddFuzzyClusterColumn(#"Changed Type", "Country", "FuzzyCountry")
to
= Table.AddFuzzyClusterColumn(#"Changed Type", "Country", "FuzzyCountry", [SimilarityColumnName = “Similarity”])
This will give me another column, Similarity, which tells me how similar Country is to FuzzyCountry:
Where it has corrected the case, the similarity is one [1], as it is essentially the same word. The misspelled words have a similarity of 0.95. I will look at how this varies later in the series, here, I will consider the names that have not been fixed. Power Query has not tried to change ‘La Suisse’ or ‘Holland’. It has not recognised there is a link to the correct name, therefore, for these I need to create a translation table. This needs to have a specific format: two columns where the first column is called ‘From’ and contains the values to be translated, and the second column is called ‘To’ and contains the correct values. I can create this in the Power Query Editor, by selecting ‘Enter Data’ in the ‘New Query’ section of the Home tab:
In the dialog, I enter the two columns and populate two rows:
I call the table CountryTranslation, and click OK:
This gives me a new query, which I will link to in my fuzzy fixing query:
The M code I have used is:
= Table.AddFuzzyClusterColumn(#"Changed Type", "Country", "FuzzyCountry",[SimilarityColumnName = "Similarity", TransformationTable = CountryTranslation])
This has fixed the data is my simple example. Next time, I will look at the options in more detail.
Come back next time for more ways to use Power Query!