Hey there! As someone diving into the world of data analysis with Python, you may have heard about how Pandas is one of the best libraries around for manipulating tabular data.
However, when first importing datasets into Pandas DataFrames, we often get absolutely massive tables with tons of unnecessary columns. And sifting through all that repetitive data makes it really hard to analyze what‘s actually important!
So today, I‘m going to walk you through different techniques for dropping columns in Pandas. We‘ll go way beyond just explaining the basic code…I‘ll share plenty of examples and visuals to really help cement these concepts.
Here‘s what we‘ll cover:
- A Quick History of Pandas (so you understand why it‘s so popular!)
- Using the drop() Method to Remove Columns
- Deleting Columns by Index Number
- Leveraging iloc[] and loc[] Slicing Syntax
- Dropping Columns with the del Statement
- Returning Deleted Columns by "Popping" Them
- Trimming Down Imported CSV Data
- Visual Examples and Code Samples Galore!
Let‘s get started!
A Brief History of Pandas
Since we‘ll be working with the Pandas library throughout this tutorial, I wanted to give you some quick context on where Pandas comes from and why data analysts love using it.
Pandas was created in 2008 by Wes McKinney, who was frustrated by the inability to perform quick data analysis in languages like R and MATLAB. Wes had a background in finance, so he was mainly interested in working with financial datasets containing time series data.
To scratch his own itch for efficient data wrangling, he created Pandas based on two existing data structures:
- Dataframes: For storing tabular data (like spreadsheets or SQL tables)
- Series: For working with time series data and performing quick math
After open sourcing Pandas in 2011, the project absolutely exploded in popularity for a few reasons:
-
Developers loved Pandas for its simplicity. DataFrames made working with complex nested data dead simple.
-
It was more computationally efficient than existing solutions, meaning it could crunch datasets faster.
-
The timing was perfect as interest in data science and Python was surging globally.
Basically, Pandas combined easy syntax with blazing speed…allowing data analysts to work faster and more productively than ever.
Wes now works at Nvidia, but still remains a lead on the Pandas project, which is maintained by over 600 contributors worldwide!
So with this rich history in mind, let‘s jump in and start slicing unnecessary columns off our DataFrames!
Getting Started: Importing Pandas and Creating a Sample Dataset
For all examples today, we‘ll work inside JupyterLab. To start, let‘s import Pandas and create a sample DataFrame:
import pandas as pd
data = {‘Apples‘:[30, 21, 55],
‘Oranges‘:[24, 40, 28],
‘Bananas‘:[20, 35, 41],
‘Grapes‘:[35, 29, 40]}
purchases = pd.DataFrame(data)
print(purchases)
This outputs our starter DataFrame:
Apples | Oranges | Bananas | Grapes | |
---|---|---|---|---|
0 | 30 | 24 | 20 | 35 |
1 | 21 | 40 | 35 | 29 |
2 | 55 | 28 | 41 | 40 |
As you can see, we have a table showing fruit purchases broken down by four fruit categories (columns) and three regions (rows).
Great – now let‘s start practicing removing columns we don‘t need!
Method #1: Using .drop() to Remove Columns by Name
The simplest way to drop columns is using Pandas‘ built-in .drop()
method.
.drop()
works by specifying the name of the column(s) you want to remove. Let‘s remove Oranges and Grapes:
# Drop by column name
purchases.drop([‘Oranges‘, ‘Grapes‘], axis=1, inplace=True)
print(purchases)
Apples | Bananas | |
---|---|---|
0 | 30 | 20 |
1 | 21 | 35 |
2 | 55 | 41 |
And we‘re down to just Apples and Bananas! Notice a few things:
- We passed in a list
[‘Oranges‘,‘Grapes‘]
containing all column names to remove - Set
axis=1
to target columns instead of rows inplace=True
modifies the DataFrame directly instead of needing to reassign the returned result
You can drop any columns easily this way by just changing the list of names.
Pro Tip: You can pass a list of column indexes instead if you prefer integers over names!
Method #2: Removing Columns by Index Number
Speaking of indexes, what if we don‘t know the column names?
Pandas allows referencing the column index instead:
# Drop column index 1
purchases.drop(purchases.columns[1], axis=1, inplace=True)
print(purchases)
Apples | |
---|---|
0 | 30 |
1 | 21 |
2 | 55 |
Here‘s how it works:
purchases.columns
gives us the list of column names- We pass
[1]
to grab just the 2nd name (Bananas
) - This returns the index for that column
.drop
removes the column at that index
So if you only know position and not column names, this method is quite handy!
Method #3: Slicing Ranges of Columns with .iloc[]
Manually looking up indexes can get tedious. Wouldn‘t it be nice to slice ranges of columns instead?
That‘s where .iloc[]
comes in handy. .iloc
allows slicing DataFrames by integer position.
Let‘s recreate our purchases DataFrame then try out .iloc[]
:
import pandas as pd
data = {‘Apples‘:[35,41,28],
‘Oranges‘:[30, 22, 40],
‘Bananas‘:[24, 20, 55],
‘Grapes‘:[40, 35, 21]}
purchases = pd.DataFrame(data)
print(purchases)
# Slice from index 1 up to index 3
purchases.drop(purchases.iloc[:, 1:3], axis=1, inplace=True)
print(purchases)
Apples | Oranges | Bananas | Grapes | |
---|---|---|---|---|
0 | 35 | 30 | 24 | 40 |
1 | 41 | 22 | 20 | 35 |
2 | 28 | 40 | 55 | 21 |
Apples | Grapes | |
---|---|---|
0 | 35 | 40 |
1 | 41 | 35 |
2 | 28 | 21 |
See how we cleanly removed the Oranges and Bananas columns?
The syntax is:
iloc[:, 1:3]
slices columns from indexes 1 to 3- Index 3 is excluded though, so only indexes 1 and 2 get removed
Being able to slice like this keeps our code DRY instead of having to manually specify each column!
Method #4: Leveraging .loc[] to Drop by Name
Similar to .iloc
, .loc[]
accomplishes the same type of slicing…but using labels instead of integer positions.
Observe:
purchases.drop(purchases.loc[:, ‘Apples‘:‘Bananas‘].columns, axis=1, inplace=True)
print(purchases)
Grapes | |
---|---|
0 | 40 |
1 | 35 |
2 | 21 |
Here‘s the breakdown:
loc[:, ‘Apples‘:‘Bananas‘]
slices between Apples and Bananas based on their names- We tack on
.columns
to grab the actual column objects .drop
removes those columns, leaving us with just Grapes
So .loc[]
gives you a bit more flexibility to slice based on labels/names instead of positions.
Method #5: Using del to Remove Columns by Name
Now let me show you a simpler way than .drop()
to remove columns by name in Python…the del
operator!
del
just takes a column name and deletes that column:
import pandas as pd
from IPython.display import display
data = {‘Apples‘:[35,41],
‘Oranges‘:[40,33],
‘Bananas‘:[24,40]}
purchases = pd.DataFrame(data)
print(‘Original DataFrame:‘)
display(purchases)
del purchases[‘Apples‘]
print(‘After Deleting Apples:‘)
display(purchases)
Original DataFrame:
Apples | Oranges | Bananas | |
---|---|---|---|
0 | 35 | 40 | 24 |
1 | 41 | 33 | 40 |
After Deleting Apples:
Oranges | Bananas | |
---|---|---|
0 | 40 | 24 |
1 | 33 | 40 |
The major difference versus .drop()
?
del
removes columns permanently instead of returning a copy. So if you definitely want to delete columns from the original DataFrame, del
gets the job done easily.
We will miss you Apples column! ????
Method #6: Using .pop() to Remove and Return Columns
What if we realize we made a mistake…and shouldn‘t have removed Apples?
No worries, we can leverage .pop()
to remove and capture columns.
Check it:
oranges = purchases.pop(‘Oranges‘)
print(‘Oranges Column:‘)
print(oranges)
print(‘Remaining DataFrame:‘)
display(purchases)
Oranges Column:
0 40
1 33
Name: Oranges, dtype: int64
Remaining DataFrame:
Bananas | |
---|---|
0 | 24 |
1 | 40 |
Instead of just deleting the Oranges column, .pop()
returns it as a Pandas Series.
We can store that Series and add it back later if needed:
purchases[‘Oranges‘] = oranges # Add column back
print(purchases)
Bananas | Oranges | |
---|---|---|
0 | 24 | 40 |
1 | 40 | 33 |
So when you may want to just "set aside" columns instead of permanently axing them, .pop()
is the way to go!
Method #7: Dropping Columns after Importing CSV Data
All the examples so far started with a Pandas DataFrame already created in Python. But what about dropping columns after loading an external dataset?
Pandas makes it just as easy to narrow down imported CSV data with .drop()
.
Let me show you with an example retail sales dataset:
import pandas as pd
sales = pd.read_csv(‘sales_data.csv‘)
print(sales.head())
print(f‘Shape: {sales.shape}‘)
Date | Channel | Product | Units Sold | Revenue | |
---|---|---|---|---|---|
0 | 05/02/2022 | Online | Keyboard | 203 | $19,685 |
1 | 05/15/2022 | Retail | Mouse | 94 | $8,500 |
2 | 05/19/2022 | Online | Monitor | 79 | $35,245 |
3 | 06/01/2022 | Retail | Desk | 173 | $77,865 |
4 | 06/12/2022 | Online | Chair | 456 | $92,304 |
Shape: (2000, 6)
Okay, we imported this sales data containing:
- 2000 rows
- 6 columns
Let‘s say we only care about Units Sold, Revenue, and Channel…so we want to drop Date and Product:
reduced_cols = [‘Units Sold‘, ‘Revenue‘, ‘Channel‘]
sales.drop(columns=sales.columns.difference(reduced_cols), inplace=True)
print(sales.head())
print(f‘Shape: {sales.shape}‘)
Units Sold | Revenue | Channel | |
---|---|---|---|
0 | 203 | $19,685 | Online |
1 | 94 | $8,500 | Retail |
2 | 79 | $35,245 | Online |
3 | 173 | $77,865 | Retail |
4 | 456 | $92,304 | Online |
Shape: (2000, 3)
By storing the columns we want to keep in reduced_cols
, we can reference the remaining columns and drop them in a single line with .difference()
!
And now we have a much more focused view on the sales metrics we care about, all without altering the original raw CSV data. Pretty neat right?
This should help give you a good starter approach to narrowing down datasets imported from external sources!
Key Takeaways
We‘ve covered a ton of techniques here. So before we wrap up, let me offer a high-level summary of the key concepts:
??? .drop()
is the simplest method, just list column names to remove them
??? Refer to columns by index instead of name with .columns[]
??? .iloc[]
slices columns by integer position
??? .loc[]
works like .iloc[]
but uses labels instead of positions
??? del
deletes columns permanently
??? .pop()
removes columns BUT returns them as well
??? For imported CSV data, read with Pandas then use .drop()
I hope these examples give you a full toolbox to wield on your data journey! Whether you‘re dealing with messy Excel exports or pristine CSV datasets, judiciously dropping distracting columns will set you up for analysis success.
Now go let Pandas help narrow your views! ????