Hey data friends! Today we’re going to learn how to aggregate data in Knime using the Groupby Node. While the Knime’s Groupby node offers the functionality to perform a wide array of aggregations, today’s focus will be on using the sum aggregation, to sum up, data from a lower level of detail to a higher level of detail. I‘ll write up a separate post in the future on the different aggregation types offered by the Groupby node, so be on the lookout for that!
Let’s get started!
The dataset we’ll be working with is pictured below:
This dataset has the weekly pay detail for different employees by department. Our goal will be to aggregate each employee’s pay from a weekly level view to a monthly level view.
Finding the Groupby Node
Go to the node repository and search for “Groupby”. The Groupby node looks like this:
Drag and drop the node onto your Knime workspace and connect it to the node that has the data you’re working with.
The next step is to configure the Groupby Node, but first, let’s walk through the different tabs the configuration menu offers. Double-click the node to access the configuration menu. The menu should look like this:
The first tab is the Groups tab. This tab is where we’ll delineate the level that we want to aggregate the data by. You’ll notice the red and green squares. The red square shows all the columns available in the dataset while the green square is currently blank. We’ll use the green box to delineate the level that we want our data aggregated at.
So, for our example, we need to move over the Employees Name, Year, Month, and Department columns from the red square to the green square. We can level the Week column in the red square since we want to aggregate up from the weekly level to the monthly level.
We should only include dimensions of the new level we’re aggregating up to in the green square. We are leaving the measures (aka numerical values) in the red box as we will want to use these in the Aggregation tabs. Once a column is included in the groups tab, it will no longer be available for use in the other aggregation tabs.
There are 3 tabs we can use to perform aggregations on our data in Knime. The 3 tabs share the same functionality – we can aggregate/sum our data or find min, max, average, etc. The primary difference between the 3 tabs is how we select the columns that we want to perform our aggregations on.
• The Manual Aggregation tab allows us to manually select the tabs that we want to perform aggregations on
o We only want to aggregate the Weekly Pay column, we can specifically select it in this tab
• The Pattern Based Aggregation tab allows us to dynamically select the columns using Regex syntax
o For example, select all columns that match a particular string pattern in the column name
• The Type Based Aggregation allows us to select all columns that match specific data types
o For example, select all integers and perform the designated calculation
For our example, we will use the Manual Aggregation tab since it’s much easier to select the one column we’re working with than it is to try and use the other two methods.
The last part of the configuration menu is the advanced settings window that appears in the bottom third of the configuration menu. The configurations in this menu are optional, we don’t need to change them for our node to execute correctly. My favorite option in this menu is the Column Naming setting. With this option, we can delineate whether we want our output column names modified to include the aggregation in the column name (2 options for this) or whether we want them to remain the same and untouched (1 option for this).
I love that Knime gives us the flexibility to keep our output column names unchanged when we’re aggregating data. One hard con in Excel and using Pivot tables is that the column names of the pivot tables are modified to include the aggregation in the output column names.
I might be a rookie b/c maybe there is a way to toggle that option in Excel’s pivot tables, but if the option is available it isn’t easy to find because I never found it! In my opinion, the column names with the aggregations just look so tacky and unprofessional and “unclean”. I always use adjust this setting to “keep original name(s)”. The only caveat is when you are doing multiple aggregations to the same column. In those cases you’re no longer able to “keep original name(s)”, and rightfully so, you truly need distinction in the column names in those cases.
The value delimited is used when aggregating strings to list, but that’s another topic for another post. The rest of the options are rather self-explanatory.
That wraps us the walk-through of the Groupby Node’s configuration menu. Let’s knock out the original goal we set, let’s sum the weekly page column to a monthly level!
For our example, we are aggregating from a weekly level to a monthly level, so we need to include the following columns in the group columns window:
Now in the Manual Aggregation tab, we need to double-click on the weekly pay column so that it jumps into the aggregation window like so:
Now, we can click on the word “Mean” that’s currently set under the Aggregation header.
Once we click, we’ll see a list of different aggregations available. The list of aggregations is sorted alphabetically so we’ll need to scroll down until we find the “Sum” option. Clicking on the “Sum” option will let Knime know that we want to sum the Weekly Pay.
Of course, I adjust the column naming option to keep the original column names. Finally, we can click on “Apply”,” Ok”, and then execute the node by hitting F7. Our output should look like this:
We’ve successfully aggregated our data in Knime using the Groupby Node!
As always, if you have any questions or need anything clarified, don’t hesitate to reach out through Twitter DMs @Cest_Nick; otherwise, I’ll catch you on the next post!