This documentation will teach you how to deduplicate repeated entries from your dataframe.
Using deduplicate to find the highest scoring NBA player on each team
Mito's deduplication feature is a sneakily powerful tool for removing unwanted data. Let's look at how it works, and then look at an example of why its so powerful.
To use the deduplicate feature:
- 1.Click on the
dedupebutton in the Mito toolbar
- 2.Select which record of the duplicated data you want to keep:
- 3.Configure which columns to use for looking for duplicated data. Two rows are considered duplicates of eachother if they have the same value in all of the columns that you select in the Columns to Deduplicate On section.
Deduplicate becomes really powerful when we combine it with sorting. In this example, we will use sorting and deduplicating to find the highest scoring NBA player on each team.
The dataset that we're looking at has 3 columns:
- Player -- the name of the player
- Tm -- the team the player is on
- PTS - the average number of points the player has scored in the 2021-2022 NBA basketball season
Our strategy for figuring out the highest scoring player on each team is to sort the data in ascending order of points scored, and then use the dedupe feature to keep only one player from each team, making sure that we keep the last entry of each duplicated row.
The first step is to sort the PTS column in ascending order. To do so, double click on the filter icon in the PTS column header and click the
ascendingsort button in the taskpane.
Sorting the data is a crucial part of this analysis because it ensures that the highest scoring player of each team will be further down in the dataset than any other player on their team.
Sort the data in ascending order
Since we're trying to find the highest scoring player on each team, our answer should only have one player on each team. So we're going to use the toggles to only look for duplicated values in the Tm column.
Configure which columns to deduplicate on
Let's bring it all together. So far, we've sorted our data in ascending order so that the highest scoring player on each team is at the bottom, and we've told Mito to only look for duplicates in the Tm column. So all that is left to do is tell Mito that when you find duplicates in the Tm column, keep the last instance of the duplicated row.
Since the highest scoring player is always going to be lower down in the data set than any other player on his team, this removal technique will always leave us with the highest scoring player on each team.
Configure which duplicated entries to keep
A quick sanity check tells us that our analysis is correct!
The highest scoring player on each NBA team