The common mistakes with Pandas

By JoeVu, at: Aug. 11, 2023, 10:31 p.m.

Estimated Reading Time: 11 min read

The common mistakes with Pandas
The common mistakes with Pandas

Pandas is a powerful and versatile Python library that provides data manipulation and analysis capabilities. However, like any tool, it comes with its own set of pitfalls that junior developers often fall into. In this article, we will explore some common mistakes to avoid when working with Pandas to ensure smoother and more efficient data processing

1. Misunderstanding DataFrame vs. Series

Confusion between DataFrames and Series is quite common. DataFrames are two-dimensional data structures with rows and columns, while Series are one-dimensional labeled arrays (columns of a DataFrame).

# Creating a DataFrame and a Series
import pandas as pd

data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

# Accessing a column as a Series
column_series = df['A']

# Accessing a row as a Series (incorrect)
# This will raise a KeyError
row_series = df[0]


# Instead, we should do 
# Accessing a row as a Series
row_series = df.loc[0] # Using the index label

 

2. SettingWithCopyWarning 

Modifying a subset of a DataFrame without explicit assignment can result in a SettingWithCopyWarning. This usually happens when trying to modify data in a slice of the DataFrame without properly using .loc or .iloc.

# Creating a DataFrame
import pandas as pd

data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

# Modifying a subset without proper assignment
subset = df[df['A'] > 1]
subset['B'] = 10  # This might trigger a SettingWithCopyWarning

# Accessing a row as a Series
row_series = df.loc[0] # Using the index label

# Instead, we should do 

subset = df[df['A'] > 1]
subset.loc[:, 'B'] = 10 # Use .loc[] to avoid the warning

 

3. Chained Indexing

Chaining multiple indexing operations (df['column']['row']) is discouraged as it might lead to unpredictable behavior and bugs.

# Creating a DataFrame
import pandas as pd

data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

# Chained indexing (incorrect)
value = df['A']['B']  # This might not work as expected

# Instead, we should do 

value = df.loc['B', 'A']  # Using .loc[] for row and column labels

 

4. Not Using .copy() When Necessary 

When creating a new DataFrame or Series from an existing one and modifying it, make sure to use .copy() to avoid unintentional changes to the original data.

# Creating a DataFrame
import pandas as pd

data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

# Modifying a copy without using .copy()
subset = df[df['A'] > 1]
subset['B'] = 10  # This might affect the original DataFrame 'df'

# Instead, we should do 
subset = df[df['A'] > 1].copy()
subset['B'] = 10

 

5. Missing Values Handling

Not properly handling missing values (NaN or None) can lead to errors in calculations and analyses. Understanding methods like dropna(), fillna(), and interpolate() is important.

# Creating a DataFrame with missing values
import pandas as pd
import numpy as np

data = {'A': [1, np.nan, 3], 'B': [4, 5, np.nan]}
df = pd.DataFrame(data)

# Dropping rows with missing values
cleaned_df = df.dropna()

# Filling missing values with a specific value
filled_df = df.fillna(0)

 

6. Applying Functions Incorrectly

Applying functions to DataFrames or Series using apply() without understanding its purpose and behavior can lead to unexpected results.

# Creating a DataFrame
import pandas as pd

data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

# Applying a function to a Series (correct)
square_root = df['A'].apply(lambda x: x ** 0.5)

# Applying a function to the entire DataFrame (incorrect)
# This will raise an error
squared_df = df.apply(lambda x: x ** 2)

# Instead, we should do 

square_root = df['A'].apply(lambda x: x ** 0.5)
squared_df = df.apply(lambda x: x ** 2, axis=0)  # Apply function to columns

 

7. Using iterrows() and itertuples()

These methods are less efficient for iterating over DataFrames compared to vectorized operations or using the .apply() function.

# Creating a DataFrame
import pandas as pd

data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

# Using iterrows (less efficient)
for index, row in df.iterrows():
    print(row['A'], row['B'])

# Using itertuples (more efficient)
for row in df.itertuples():
    print(row.A, row.B)

 

8. Not Utilizing Vectorized Operations 

Pandas is optimized for vectorized operations. Performing element-wise calculations using loops can be slow and inefficient.

# Creating a DataFrame
import pandas as pd

data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

# Using a for loop (less efficient)
squared_list = []
for value in df['A']:
    squared_list.append(value ** 2)

# Using vectorized operation (more efficient)
squared_array = df['A'] ** 2

 

9. Misusing GroupBy

Incorrect usage of the groupby() function can lead to improper aggregation results. Also, forgetting to reset the index after grouping can cause indexing issues.

# Creating a DataFrame
import pandas as pd

data = {'Category': ['A', 'B', 'A'], 'Value': [10, 20, 30]}
df = pd.DataFrame(data)

# Incorrect grouping
grouped = df.groupby('Category')
mean_values = grouped.mean()  # This might give unexpected results

# Correct grouping
grouped = df.groupby('Category').sum()

 

10. Misunderstanding Indexing

Not understanding how to set, reset, or manipulate the index can cause confusion in data selection and merging.

# Creating a DataFrame
import pandas as pd

data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

# Setting an index (correct)
df.set_index('A', inplace=True)

# Incorrect resetting of index
# This will create an additional index column
df.reset_index(inplace=True)

 

11. Inefficient Data Manipulation

Junior developers might overuse the for loop to modify DataFrame values, which is usually less efficient than using built-in functions or vectorized operations.

# Creating a DataFrame
import pandas as pd

data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

# Using a for loop for modification (inefficient)
for index, row in df.iterrows():
    df.at[index, 'B'] = row['B'] * 2

# Using vectorized operation (efficient)
df['B'] = df['B'] * 2

 

12. Mixing up & and | vs. and and or

When filtering DataFrames, using & and | for boolean operations is different from using and and or.

# Creating a DataFrame
import pandas as pd

data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

# Mixing up operators (incorrect)
subset = df[(df['A'] > 1) and (df['B'] > 4)]  # This will raise an error

# Using correct operators
subset = df[(df['A'] > 1) & (df['B'] > 4)]

 

13. Memory Usage Ignorance

Large DataFrames can consume a lot of memory. Not being mindful of memory usage can lead to crashes or slowdowns.

# Generating a large DataFrame
import pandas as pd
import numpy as np

data = {'A': np.random.random(1000000)}
df = pd.DataFrame(data)

# Displaying memory usage
print(df.memory_usage(deep=True).sum())

 

14. Not Reading Documentation

Pandas has a rich documentation that provides examples and explanations for all its functions. Not consulting the documentation can lead to confusion and mistakes.

# Creating a DataFrame
import pandas as pd

data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

# Incorrect usage of a function
# Refer to documentation for correct usage
df.replace(1, 10)

 

15. Unoptimized Code

Writing code that iterates over large DataFrames without leveraging Pandas' built-in optimizations can result in slow performance.

# Creating a large DataFrame
import pandas as pd

data = {'A': range(100000)}
df = pd.DataFrame(data)

# Inefficient loop-based calculation
result = []
for value in df['A']:
    result.append(value * 2)

# Optimized vectorized calculation
result = df['A'] * 2

 

16. Ignoring Method Chaining

Pandas supports method chaining, where multiple operations are applied in sequence. Ignoring this practice can lead to less readable and less efficient code.

# Without method chaining
subset = df[df['A'] > 1]
subset = subset.dropna()
subset['B'] = subset['B'] * 2

# With method chaining
subset = df[df['A'] > 1].dropna().assign(B=lambda x: x['B'] * 2)

 

17. Not Checking Data Types

Pandas infers data types when reading data, but sometimes it might guess incorrectly. Not checking and correcting data types can lead to errors.

# Incorrect data type conversion
df['A'] = df['A'].astype(str)  # If 'A' contains non-numeric values

# Correct data type conversion
df['A'] = df['A'].astype(int)  # Ensure 'A' contains valid integers

 

 

By being aware of these common mistakes and practicing good coding habits, junior developers can enhance their Pandas proficiency and reduce errors when working with data. Familiarity with Pandas' documentation, seeking advice from experienced developers, and consistent practice will pave the way for efficient and effective data manipulation and analysis.