How to Drop Columns in Pandas: A Beginner‘s Guide

Hey there! As someone diving into the world of data analysis with Python, you may have heard about how Pandas is one of the best libraries around for manipulating tabular data.

However, when first importing datasets into Pandas DataFrames, we often get absolutely massive tables with tons of unnecessary columns. And sifting through all that repetitive data makes it really hard to analyze what‘s actually important!

So today, I‘m going to walk you through different techniques for dropping columns in Pandas. We‘ll go way beyond just explaining the basic code…I‘ll share plenty of examples and visuals to really help cement these concepts.

Here‘s what we‘ll cover:

A Quick History of Pandas (so you understand why it‘s so popular!)
Using the drop() Method to Remove Columns
Deleting Columns by Index Number
Leveraging iloc[] and loc[] Slicing Syntax
Dropping Columns with the del Statement
Returning Deleted Columns by "Popping" Them
Trimming Down Imported CSV Data
Visual Examples and Code Samples Galore!

Let‘s get started!

A Brief History of Pandas

Since we‘ll be working with the Pandas library throughout this tutorial, I wanted to give you some quick context on where Pandas comes from and why data analysts love using it.

Pandas was created in 2008 by Wes McKinney, who was frustrated by the inability to perform quick data analysis in languages like R and MATLAB. Wes had a background in finance, so he was mainly interested in working with financial datasets containing time series data.

To scratch his own itch for efficient data wrangling, he created Pandas based on two existing data structures:

Dataframes: For storing tabular data (like spreadsheets or SQL tables)
Series: For working with time series data and performing quick math

After open sourcing Pandas in 2011, the project absolutely exploded in popularity for a few reasons:

Developers loved Pandas for its simplicity. DataFrames made working with complex nested data dead simple.
It was more computationally efficient than existing solutions, meaning it could crunch datasets faster.
The timing was perfect as interest in data science and Python was surging globally.

Basically, Pandas combined easy syntax with blazing speed…allowing data analysts to work faster and more productively than ever.

Wes now works at Nvidia, but still remains a lead on the Pandas project, which is maintained by over 600 contributors worldwide!

So with this rich history in mind, let‘s jump in and start slicing unnecessary columns off our DataFrames!

Getting Started: Importing Pandas and Creating a Sample Dataset

For all examples today, we‘ll work inside JupyterLab. To start, let‘s import Pandas and create a sample DataFrame:

import pandas as pd

data = {‘Apples‘:[30, 21, 55], 
        ‘Oranges‘:[24, 40, 28],
        ‘Bananas‘:[20, 35, 41],
        ‘Grapes‘:[35, 29, 40]}

purchases = pd.DataFrame(data)
print(purchases)

This outputs our starter DataFrame:

	Apples	Oranges	Bananas	Grapes
0	30	24	20	35
1	21	40	35	29
2	55	28	41	40

As you can see, we have a table showing fruit purchases broken down by four fruit categories (columns) and three regions (rows).

Great – now let‘s start practicing removing columns we don‘t need!

Method #1: Using .drop() to Remove Columns by Name

The simplest way to drop columns is using Pandas‘ built-in .drop() method.

.drop() works by specifying the name of the column(s) you want to remove. Let‘s remove Oranges and Grapes:

# Drop by column name  
purchases.drop([‘Oranges‘, ‘Grapes‘], axis=1, inplace=True)  

print(purchases)

	Apples	Bananas
0	30	20
1	21	35
2	55	41

And we‘re down to just Apples and Bananas! Notice a few things:

We passed in a list [‘Oranges‘,‘Grapes‘] containing all column names to remove
Set axis=1 to target columns instead of rows
inplace=True modifies the DataFrame directly instead of needing to reassign the returned result

You can drop any columns easily this way by just changing the list of names.

Pro Tip: You can pass a list of column indexes instead if you prefer integers over names!

Method #2: Removing Columns by Index Number

Speaking of indexes, what if we don‘t know the column names?

Pandas allows referencing the column index instead:

# Drop column index 1
purchases.drop(purchases.columns[1], axis=1, inplace=True)   

print(purchases)

	Apples
0	30
1	21
2	55

Here‘s how it works:

purchases.columns gives us the list of column names
We pass [1] to grab just the 2nd name (Bananas)
This returns the index for that column
.drop removes the column at that index

So if you only know position and not column names, this method is quite handy!

Method #3: Slicing Ranges of Columns with .iloc[]

Manually looking up indexes can get tedious. Wouldn‘t it be nice to slice ranges of columns instead?

That‘s where .iloc[] comes in handy. .iloc allows slicing DataFrames by integer position.

Let‘s recreate our purchases DataFrame then try out .iloc[]:

import pandas as pd

data = {‘Apples‘:[35,41,28], 
        ‘Oranges‘:[30, 22, 40],
        ‘Bananas‘:[24, 20, 55],
        ‘Grapes‘:[40, 35, 21]}  

purchases = pd.DataFrame(data)
print(purchases)

# Slice from index 1 up to index 3  
purchases.drop(purchases.iloc[:, 1:3], axis=1, inplace=True)
print(purchases)

	Apples	Oranges	Bananas	Grapes
0	35	30	24	40
1	41	22	20	35
2	28	40	55	21

	Apples	Grapes
0	35	40
1	41	35
2	28	21

See how we cleanly removed the Oranges and Bananas columns?

The syntax is:

iloc[:, 1:3] slices columns from indexes 1 to 3
Index 3 is excluded though, so only indexes 1 and 2 get removed

Being able to slice like this keeps our code DRY instead of having to manually specify each column!

Method #4: Leveraging .loc[] to Drop by Name

Similar to .iloc, .loc[] accomplishes the same type of slicing…but using labels instead of integer positions.

Observe:

purchases.drop(purchases.loc[:, ‘Apples‘:‘Bananas‘].columns, axis=1, inplace=True)
print(purchases)

	Grapes
0	40
1	35
2	21

Here‘s the breakdown:

loc[:, ‘Apples‘:‘Bananas‘] slices between Apples and Bananas based on their names
We tack on .columns to grab the actual column objects
.drop removes those columns, leaving us with just Grapes

So .loc[] gives you a bit more flexibility to slice based on labels/names instead of positions.

Method #5: Using del to Remove Columns by Name

Now let me show you a simpler way than .drop() to remove columns by name in Python…the del operator!

del just takes a column name and deletes that column:

import pandas as pd
from IPython.display import display

data = {‘Apples‘:[35,41], 
        ‘Oranges‘:[40,33],
        ‘Bananas‘:[24,40]}  

purchases = pd.DataFrame(data)

print(‘Original DataFrame:‘)
display(purchases)  

del purchases[‘Apples‘]

print(‘After Deleting Apples:‘)
display(purchases)

Original DataFrame:

	Apples	Oranges	Bananas
0	35	40	24
1	41	33	40

After Deleting Apples:

	Oranges	Bananas
0	40	24
1	33	40

The major difference versus .drop()?

del removes columns permanently instead of returning a copy. So if you definitely want to delete columns from the original DataFrame, del gets the job done easily.

We will miss you Apples column! ????

Method #6: Using .pop() to Remove and Return Columns

What if we realize we made a mistake…and shouldn‘t have removed Apples?

No worries, we can leverage .pop() to remove and capture columns.

Check it:

oranges = purchases.pop(‘Oranges‘)  

print(‘Oranges Column:‘)
print(oranges)

print(‘Remaining DataFrame:‘) 
display(purchases)

Oranges Column:

0    40
1    33 
Name: Oranges, dtype: int64

Remaining DataFrame:

	Bananas
0	24
1	40

Instead of just deleting the Oranges column, .pop() returns it as a Pandas Series.

We can store that Series and add it back later if needed:

purchases[‘Oranges‘] = oranges # Add column back 
print(purchases)

	Bananas	Oranges
0	24	40
1	40	33

So when you may want to just "set aside" columns instead of permanently axing them, .pop() is the way to go!

Method #7: Dropping Columns after Importing CSV Data

All the examples so far started with a Pandas DataFrame already created in Python. But what about dropping columns after loading an external dataset?

Pandas makes it just as easy to narrow down imported CSV data with .drop().

Let me show you with an example retail sales dataset:

import pandas as pd

sales = pd.read_csv(‘sales_data.csv‘)  

print(sales.head()) 
print(f‘Shape: {sales.shape}‘)

	Date	Channel	Product	Units Sold	Revenue
0	05/02/2022	Online	Keyboard	203	$19,685
1	05/15/2022	Retail	Mouse	94	$8,500
2	05/19/2022	Online	Monitor	79	$35,245
3	06/01/2022	Retail	Desk	173	$77,865
4	06/12/2022	Online	Chair	456	$92,304

Shape: (2000, 6)

Okay, we imported this sales data containing:

2000 rows
6 columns

Let‘s say we only care about Units Sold, Revenue, and Channel…so we want to drop Date and Product:

reduced_cols = [‘Units Sold‘, ‘Revenue‘, ‘Channel‘]  

sales.drop(columns=sales.columns.difference(reduced_cols), inplace=True)  

print(sales.head())
print(f‘Shape: {sales.shape}‘)

	Units Sold	Revenue	Channel
0	203	$19,685	Online
1	94	$8,500	Retail
2	79	$35,245	Online
3	173	$77,865	Retail
4	456	$92,304	Online

Shape: (2000, 3)

By storing the columns we want to keep in reduced_cols, we can reference the remaining columns and drop them in a single line with .difference()!

And now we have a much more focused view on the sales metrics we care about, all without altering the original raw CSV data. Pretty neat right?

This should help give you a good starter approach to narrowing down datasets imported from external sources!

Key Takeaways

We‘ve covered a ton of techniques here. So before we wrap up, let me offer a high-level summary of the key concepts:

??? .drop() is the simplest method, just list column names to remove them
??? Refer to columns by index instead of name with .columns[]
??? .iloc[] slices columns by integer position
??? .loc[] works like .iloc[] but uses labels instead of positions
??? del deletes columns permanently
??? .pop() removes columns BUT returns them as well
??? For imported CSV data, read with Pandas then use .drop()

I hope these examples give you a full toolbox to wield on your data journey! Whether you‘re dealing with messy Excel exports or pristine CSV datasets, judiciously dropping distracting columns will set you up for analysis success.

Now go let Pandas help narrow your views! ????