Power Query: Big Comparison
23 April 2020
Welcome to our Power Query blog. This week, I look at comparing data in two large CSV files.
This week, I am looking at some vaccination data available from the WHO: I have two versions of the data, and I want to detect the differences.
data:image/s3,"s3://crabby-images/b1df0/b1df0a0546667b8a73150465cff8f4b143d8ec6c" alt=""
I begin by extracting my first CSV file to Power Query by using ‘From CSV’ on the ‘From File’ section of the ‘New Query’ option in the ‘Get & Transform’ section of the ‘Data’ tab (that’s a mouthful!).
data:image/s3,"s3://crabby-images/daa77/daa77146bc272e4b5c4bde05913b1800083338a9" alt=""
I find the CSV file and import it.
data:image/s3,"s3://crabby-images/f1512/f151255be8a151810769ca7a7e23bd7c25a880cf" alt=""
I select ‘Transform Data’ so that I may select choose to load it.
data:image/s3,"s3://crabby-images/93610/93610d826d1c727958a2d556bb7c4f2d510b0b43" alt=""
My data is initially extracted, with the top row automatically promoted to headers. However, I don’t want to use this row as my headers. Thus, I demote this row.
data:image/s3,"s3://crabby-images/ec0c4/ec0c41690b48b5f006d0776864b6b0f77463fefd" alt=""
I can do this from the Home tab on the ‘Use First Row as Headers’ dropdown. ‘Using Headers as First Row’ demotes the headers accordingly.
data:image/s3,"s3://crabby-images/8d460/8d4603659913e4734fdc27ef738cb5a00b65e2ee" alt=""
I remove the first row using the ‘Remove Top Rows’ option on the ‘Remove Rows’ dropdown.
data:image/s3,"s3://crabby-images/5fa4b/5fa4b4b03b3bdf349ce75ba11ad4661eb7c5b39c" alt=""
I promote the new top row into headers using the ‘Use First Row as Headers’ option.
data:image/s3,"s3://crabby-images/17366/173665448ae809557173706fb314d9cee79db81e" alt=""
I close and load this query as ‘connection only’ since I do not need the overhead of loading it to the workbook.
data:image/s3,"s3://crabby-images/887d0/887d0834cffaff38d5bed3b4f65444a8deb6a472" alt=""
I create a similar query for the ‘vaccinations – Copy’ CSV.
data:image/s3,"s3://crabby-images/210f3/210f3f491d0ed4b509ff690172b75368257cda35" alt=""
I merge my queries using a full outer join so that I have all rows in case any exist in one query but not the other. I use Country to merge my queries.
data:image/s3,"s3://crabby-images/e3df8/e3df861f36e966003dd3675c80e681fb3cb37d8f" alt=""
This gives me all rows of ‘vaccinations’ and a column linking to ‘vaccinations – Copy’.
data:image/s3,"s3://crabby-images/f102a/f102a47421caa780c4585d9654696b524c482067" alt=""
I expand vaccinations – Copy, using the column name as a prefix so I can distinguish between my similar columns.
data:image/s3,"s3://crabby-images/d51ac/d51acda0aeb7d4464aa52abaf8301db3434fe135" alt=""
I need a column which will tell me if any of the year’s totals don’t match, so I add a conditional column from the ‘Add Column’ tab.
data:image/s3,"s3://crabby-images/18ce7/18ce7dc4f029f86feb92df41184b93c0c38beb48" alt=""
In order to see which rows have differences, I filter on Differences to select those rows where the value is ‘Yes’.
data:image/s3,"s3://crabby-images/e0bbf/e0bbf84eefde1e0c0f0036dd4c95d6ede2d695d9" alt=""
Once I have my rows, I reorder the columns to make it easier to see which columns have differences:
data:image/s3,"s3://crabby-images/c2e99/c2e994be02c7093218fe2dd6e5c9d1b8b4d875ec" alt=""
I can now see where the differences occur.
Come back next time for more ways to use Power Query!