Mito
Mito for Streamlit
  • Mito Documentation
  • Getting Started
    • Installing Mito
      • Fixing Common Install Errors
      • Installing Mito in a Docker Container
      • Installing Mito for Streamlit
      • Installing Mito for Dash
      • Installing Mito in a Jupyter Notebook Directly
      • Installing Mito in Vertex AI
      • Setting Up a Virtual Environment
  • Data Copilot
    • Data Copilot Core Concepts
    • Agent
    • Chat
    • Autocomplete
    • Smart Debugging
    • Configuration Options
    • Database Connectors
    • AI Data Usage FAQ
  • Apps (Beta)
    • Mito Apps
  • Mito Spreadsheet
    • Core Concepts
    • Creating a Mitosheet
      • Open Existing Virtual Environments
    • Importing Data
      • Importing CSV Files
      • Importing from Excel Files
      • Importing Dataframes
      • Importing from a remote drive
      • Import: Generated UI from any Python Function
      • Importing from other sources
    • Graphing
      • Graph Creation
      • Graph Styling
      • Graph Export
    • Pivoting/Group By
    • Filter
      • Filter By Condition
      • Filter By Value
    • Mito AI
    • Summary Statistics
    • Type Changes
    • Spreadsheet Formulas
      • Custom Spreadsheet Functions
      • Formula Reference
      • Using VLOOKUP
    • Editing Individual Cells
    • Combining Dataframes
      • Merge (horizontal)
      • Concatenate (horizontal)
      • Anti-merge (unique)
    • Sort Data
    • Split Text to Columns
    • Deleting Columns
    • Deleting Rows
    • Column Headers
      • Editing Column Headers
      • Promote Row to Header
    • Deduplicate
    • Fill NaN Values
    • Transpose
    • Reset Index
    • Unpivot a Dataframe (Melt)
    • Formatting
      • Column Formatting
      • Dataframe Colors
      • Conditional Formatting
    • Exporting Data
      • Download as CSV
      • Download as Excel
      • Generate code to create Excel and CSV reports
    • Using the Generated Code
      • Turn generated code into functions
    • Changing Imported Data
    • Code Snippets
    • Custom Editors: Autogenerate UI from Any Function
    • Find and Replace
    • Bulk column header edits
    • Code Options
    • Scheduling your Automation
    • Keyboard Shortcuts
    • Upgrading Mito
    • Enterprise Logging
  • Mito for Streamlit
    • Getting Started with Mito for Streamlit
    • Streamlit Overview
    • Create a Mito for Streamlit App
    • API Reference
      • Understanding import_folder
      • RunnableAnalysis class
      • Column Definitions
    • Streamlit App Gallery
    • Experienced Streamlit Users
    • Common Design Patterns
      • Deploying Mito for Streamlit in a Docker Image
      • Using Mito for Final Mile Data Cleaning
  • Mito for Dash
    • Getting Started
    • Dash Overview
    • Your First Dash App with Mito
    • Mito vs. Other Dash Components
    • API Reference
      • Understanding import_folder
    • Dash App Gallery
    • Common Design Patterns
      • Refresh Sheet Data Periodically
      • Change Sheet Data from a Select
      • Filter Other Elements to Data Selected in Mito
      • Graph New Data after Edits to Mito
      • Set Mito Spreadsheet Theme
  • Tutorials
    • Pass a dataframe into Mito
    • Create a line chart of time series data
    • Delete Columns with Missing Values
    • Split a column on delimiter
    • Rerun analysis on new data
    • Calculate the difference between rows
    • Calculate each cell's percent total of column
    • Import multiple tables from one Excel sheet
    • Share Mito Spreadsheets Across Users
  • Misc
    • Release Notes
      • May 28 - Just a Query Away
      • April 15 - Now Streaming (0.1.18)
      • March 21 - Smarter, Faster, Stronger Agents
      • February 25 - Agent Mode QoL Improvements
      • February 18 - Mito Agents
      • January 2nd - Inline Completions Arrive
      • December 6th - Smarter Workflow
      • November 27th - @ Mentions, Mito AI Server
      • November 4th, 2024 - Hello Mito AI
      • October 8, 2024 - JupyterLab 4
      • Aug 29th, 2024
      • June 12, 2024
      • March 19, 2024
      • March 13th, 2024
      • February 12th, 2024: Graphing Improvements
      • January 25th, 2024
      • January 5th, 2023: Keyboard Shortcuts
      • December 6, 2023: New Context Menu
      • November 28, 2023: Mito's New Toolbar
      • November 7, 2023: Multiplayer Dash
      • October 23, 2023: RunnableAnalysis class
      • October 16, 2023: Mito for Dash, Custom Editors
      • September 29, 2023: VLOOKUP and Find and Replace!
      • September 7, 2023
      • August 2, 2023: Mito for Streamlit!
      • July 10, 2023
      • May 31, 2023: Mito AI Recon
      • May 19, 2023: Mito AI Chat!
      • April 27, 2023: Generate Functions, Performance improvements, bulk column header transformations
      • April 18, 2023: Cell Editor Improvements, BYO Large Language Model, and more
      • April 10, 2023: AI Access, Excel-like Cell Editor, Performance Improvements
      • April 5, 2023: Range formulas, Pandas 2.0, Snowflake Views
      • March 29, 2023: Excel Range Import Improvements
      • March 14, 2023: Mito AI, Public Interface Versioning
      • February 28, 2023: In-place Pivot Errors
      • February 7, 2023: Excel-like Formulas, Snowflake Import
      • January 23, 2023: Excel range importing
      • January 8, 2023: Custom Code snippets
      • December 26, 2022: Code snippets and bug fixes
      • December 12, 2022: Group Dates in Pivot Tables, Reduced Dependencies
      • November 15, 2022: Filter in Pivot
      • November 9, 2022: Import and Enterprise Config
      • October 31, 2022: Replay Analysis Improvements
      • Old Release Notes
      • August 10, 2023: Export Formatting to Excel
    • Mito Enterprise Features
    • FAQ
    • Terms of Service
    • Privacy Policy
  • Mito
Powered by GitBook

© Mito

On this page

Was this helpful?

  1. Mito Spreadsheet

Deduplicate

Deduplicate repeated entries from your dataframe.

PreviousPromote Row to HeaderNextFill NaN Values

Last updated 1 year ago

Was this helpful?

Mito's deduplication feature is a sneakily powerful tool for removing unwanted data. Let's look at how it works, and then look at an example of why its so powerful.

To use the deduplicate feature:

  1. Select the Data tab in the toolbar.

  2. Click on the Remove Duplicates button.

  3. Select which record of the duplicated data you want to keep: first, last or none

  4. Configure which columns to use for looking for duplicated data. Two rows are considered duplicates of eachother if they have the same value in all of the columns that you select in the Columns to Deduplicate On section.

Example: Using deduplicate and sort together

The dataset that we're looking at has 3 columns:

  • Player -- the name of the player

  • Tm -- the team the player is on

  • PTS - the average number of points the player has scored in the 2021-2022 NBA basketball season

Our strategy for figuring out the highest scoring player on each team is to sort the data in ascending order of points scored, and then use the dedupe feature to keep only one player from each team, making sure that we keep the last entry of each duplicated row.

Sorting the data

The first step is to sort the PTS column in ascending order. To do so, double click on the filter icon in the PTS column header and click the ascending sort button in the taskpane.

Sorting the data is a crucial part of this analysis because it ensures that the highest scoring player of each team will be further down in the dataset than any other player on their team.

Select the columns used for finding duplicated values

Since we're trying to find the highest scoring player on each team, our answer should only have one player on each team. So we're going to use the toggles to only look for duplicated values in the Tm column.

Keep the last instance of duplicated data

Let's bring it all together. So far, we've sorted our data in ascending order so that the highest scoring player on each team is at the bottom, and we've told Mito to only look for duplicates in the Tm column. So all that is left to do is tell Mito that when you find duplicates in the Tm column, keep the last instance of the duplicated row.

Since the highest scoring player is always going to be lower down in the data set than any other player on his team, this removal technique will always leave us with the highest scoring player on each team.

Checkout the results!

A quick sanity check tells us that our analysis is correct!

Deduplicate becomes really powerful when we combine it with . In this example, we will use sorting and deduplicating to find the highest scoring NBA player on each team.

sorting
Configure which columns to deduplicate on
Configure which duplicated entries to keep
The highest scoring player on each NBA team