In the retail world, there isn’t gold standard when it comes to data and its structure or even an exchange that you can go to for data retrieval. When you’re on the retailer side you may have access to raw data in tabular form – the best form of data for manipulation, but, when you’re on the vendor side you’re at the whim of your retail partner. The structure and form in which your retail partner sends you data is typically what you’ve got to work with. This means you need to be ready to tackle data that comes to you in many different forms.
Today we’re going to walk through a particular scenario that may seem a bit of an annoyance but becomes an easy solve once we’ve got our logic laid out. The scenario I’m referring to is when you have an excel file that has multiple sheets that you need to combine into one succinct data set. You’ll come across this scenario when your retail partner sends you data where the periods, be it weeks or months, are broken out into their own sheets. Today we’ll learn how to handle this type of problem using the Knime Analytics Platform.
Let’s get started!
The data we’re working with today is pictured below. While my data is broken out by months the workflow will still work if your data is broken out into sheets by week.
Now, you could easily say, “Hey Nick, you can manually combine the data in all those tabs in under a minute!” While I do agree with that statement, it is totally feasible to combine all these sheets manually in excel, but that’s because we’re working with only a handful of different sheets. When the number of sheets is high, then it doesn’t make sense to manually combine them when we know Knime can get it done for us with just the drag and drop of a few nodes. Let’s keep going!
First, lets lay out all the nodes we will need to get this task done. Go into your node repository and search for the following nodes:
· Read Excel Sheet Names
· Table row to variable loop start
· Excel Reader
· Loop end
These are the only nodes we’ll need. Line them up as shown below:
First, I’ll walk you through the logic a bit more depth and then afterwards we’ll walk through the actual execution of the workflow.
1. Provide Excel file that has the sheets/tabs we want combined
2. Convert the Excel Sheet names to variables
a. These variables will be used to loop through each one
3. Read in the Excel file that has the sheets we want to combine
a. The caveat here is that we’re going delineate the sheet name we want read in
b. We delineate the sheet name using the sheet name variables we created in the previous step
4. We’re going to hold the data from the sheet we just read in
a. The loop ends when all variables are looped through
That’s the logic that will get us to our end goal. Below is the workflow with notes added in – a touch of professionalism with our notetaking here! The nodes are also connected in the below. Pay attention to where the variable connects, it’s on the top left of the node. The red connection line and different connection point really help distinguish variable data vs regular table data.
Now that we’ve got a clear breakdown of the steps involved we can get started on the actual setup and execution of the workflow.
Step 1: Double click into the Read Excel Sheet Names node so we can configure it. All we’ve got to do here is provide the path and file we’re working with. Since we’re only working with one file, we leave the “File” bubble selected for the Mode option. Below is what my configured node looks like:
The output of this node is the following:
This shows the file name along with each sheet name in the file. The important piece here is the sheet name, as this will be converted into a variable in the next step.
Step 2: Next we’ll configure the Table row to variable loop start node. Double click into the node to configure it. You should see the following menu:
The most important settings here are the red and green boxes at the bottom. The red box is where you should put the columns you don’t want converted into a variable while the green box is where you put the columns you want converted into a variable. For our purposes, we need to make sure the sheet name column is in the green include box because we want the sheet name to be the variable over which we loop. In other words, we want to pull the data for each sheet name in the file, so we need the sheet name to be the variable that drives the loop. Now that we’re configured, we can execute this node. The output should look like the below:
Step 3: Now we’re going to configure the Excel Reader node. In this node, we’re going to provide the File that we want to pull the data from(like we did in the first node of our workflow). After we provide that, the next detail we’ll provide is the variable which contains the tab names that we’re trying to combine. To do so, we need to click on the Flow Variable tab. The flow variable tab should look like the below:
Notice the red arrows I added to the image. If you click the plus sign next to “Settings” that the first arrow is pointing to, you’ll seethe drop down with a “Sheet name” show up. Click on the “Sheet name” drop down that the second arrow is pointing to and then select “sheet” as the value. Your Flow Variable tab should now look like this:
Here, we have the sheet name option of the node set to the sheet name variable, the same variable we created in the previous step! We’re essentially grabbing the data from file X for the specific sheet that is currently set in the loop.
Step 4: The last step is to configure the Loop End node. This setup for this node comes down to preference. Most of the selections available here have no significant effect in the core outcome of what we’re trying to achieve – combine the data from all the sheets in our source excel file.
Now we can execute the full workflow. After the loop finishes running, our output should look like the below:
As you can see, we’ve successfully combined all the data from the varying sheets within our excel file!! That’s how easy it is to loop through all the sheets within an excel file.
Like I mentioned previously, you might find yourself in this scenario when you’re not directly connected to a data source – when there’s an intermediary that send over the data in a semi consistent format. Data might also be structured like this when you’re grabbing data that isn’t necessarily a product of the business, for example when grabbing industry data that is compiled by anon-data person. No need to worry though, you’re now prepared to tackle such cases!
If you have any questions or need additional help, don’t hesitate to reach to me through my Twitter DMs @Cest_Nick ! Otherwise, I’ll catch you on the next post!