Pandas: Data manipulation in Python 🐼🐍📊💻🔍

December 5, 2022

Pandas is a popular open-source data manipulation library in Python. It is built on top of NumPy and provides easy-to-use data structures and data analysis tools for handling tabular data. Pandas is widely used in data science, data analysis, and machine learning projects. In this article, we will explore the key features of Pandas and learn how to use it for data manipulation.

Pandas Data Structures

Pandas provides two primary data structures: Series and DataFrame.

Series

A Series is a one-dimensional array-like object that can hold any data type. It is similar to a column in a spreadsheet or a SQL table. A Series consists of two arrays: one for the data and one for the index. The data can be a NumPy array, a Python list, a scalar value, or a dictionary.

Here’s an example of creating a Series:

import pandas as pd

data = [10, 20, 30, 40, 50]
index = ['a', 'b', 'c', 'd', 'e']

s = pd.Series(data, index=index)

print(s)

Output:

a    10
b    20
c    30
d    40
e    50
dtype: int64

DataFrame

A DataFrame is a two-dimensional tabular data structure with rows and columns. It is similar to a spreadsheet or a SQL table. A DataFrame consists of three components: the data, the index, and the columns. The data can be a NumPy array, a Python list of lists, a dictionary of dictionaries, or a Pandas Series.

Here’s an example of creating a DataFrame:

import pandas as pd

data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
    'age': [25, 32, 18, 47, 33],
    'country': ['USA', 'Canada', 'USA', 'Canada', 'USA']
}

df = pd.DataFrame(data)

print(df)

Output:

       name  age country
0     Alice   25     USA
1       Bob   32  Canada
2   Charlie   18     USA
3     David   47  Canada
4     Emily   33     USA

Data Manipulation with Pandas

Pandas provides powerful tools for data manipulation, including indexing, slicing, filtering, grouping, merging, and pivoting.

Indexing and Slicing

Pandas provides two primary indexing operators: .loc and .iloc. .loc is label-based indexing, and .iloc is integer-based indexing.

Here’s an example of indexing and slicing a DataFrame using .loc and .iloc:

import pandas as pd

data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
    'age': [25, 32, 18, 47, 33],
    'country': ['USA', 'Canada', 'USA', 'Canada', 'USA']
}

df = pd.DataFrame(data)

# Select rows by label
print(df.loc[0]) # First row
print(df.loc[0:2]) # First three rows

# Select rows by integer
print(df.iloc[0]) # First row
print(df.iloc[0:3]) # First three rows

# Select columns by label
print(df.loc[:, 'name']) # Name column
print(df.loc[:, ['name', 'age']]) # Name and age columns

# Select columns by integer
print(df.iloc[:, 0]) #