Deduplicate
Deduplicate repeated entries from your dataframe.
Last updated
Deduplicate repeated entries from your dataframe.
Last updated
© Mito
Mito's deduplication feature is a sneakily powerful tool for removing unwanted data. Let's look at how it works, and then look at an example of why its so powerful.
To use the deduplicate feature:
Select the Data
tab in the toolbar.
Click on the Remove Duplicates
button.
Select which record of the duplicated data you want to keep: first
, last
or none
Configure which columns to use for looking for duplicated data. Two rows are considered duplicates of eachother if they have the same value in all of the columns that you select in the Columns to Deduplicate On section.
Deduplicate becomes really powerful when we combine it with sorting. In this example, we will use sorting and deduplicating to find the highest scoring NBA player on each team.
The dataset that we're looking at has 3 columns:
Player -- the name of the player
Tm -- the team the player is on
PTS - the average number of points the player has scored in the 2021-2022 NBA basketball season
Our strategy for figuring out the highest scoring player on each team is to sort the data in ascending order of points scored, and then use the dedupe feature to keep only one player from each team, making sure that we keep the last entry of each duplicated row.
The first step is to sort the PTS column in ascending order. To do so, double click on the filter icon in the PTS column header and click the ascending
sort button in the taskpane.
Sorting the data is a crucial part of this analysis because it ensures that the highest scoring player of each team will be further down in the dataset than any other player on their team.
Since we're trying to find the highest scoring player on each team, our answer should only have one player on each team. So we're going to use the toggles to only look for duplicated values in the Tm column.
Let's bring it all together. So far, we've sorted our data in ascending order so that the highest scoring player on each team is at the bottom, and we've told Mito to only look for duplicates in the Tm column. So all that is left to do is tell Mito that when you find duplicates in the Tm column, keep the last instance of the duplicated row.
Since the highest scoring player is always going to be lower down in the data set than any other player on his team, this removal technique will always leave us with the highest scoring player on each team.
A quick sanity check tells us that our analysis is correct!