Pandas: Data manipulation in Python ๐ผ๐๐๐ป๐
Pandas is a popular open-source data manipulation library in Python. It is built on top of NumPy and provides easy-to-use data structures and data analysis tools for handling tabular data. Pandas is widely used in data science, data analysis, and machine learning projects. In this article, we will explore the key features of Pandas and learn how to use it for data manipulation.
Pandas Data Structures
Pandas provides two primary data structures: Series and DataFrame.
Series
A Series is a one-dimensional array-like object that can hold any data type. It is similar to a column in a spreadsheet or a SQL table. A Series consists of two arrays: one for the data and one for the index. The data can be a NumPy array, a Python list, a scalar value, or a dictionary.
Here’s an example of creating a Series:
import pandas as pd
data = [10, 20, 30, 40, 50]
index = ['a', 'b', 'c', 'd', 'e']
s = pd.Series(data, index=index)
print(s)
Output:
a 10
b 20
c 30
d 40
e 50
dtype: int64
DataFrame
A DataFrame is a two-dimensional tabular data structure with rows and columns. It is similar to a spreadsheet or a SQL table. A DataFrame consists of three components: the data, the index, and the columns. The data can be a NumPy array, a Python list of lists, a dictionary of dictionaries, or a Pandas Series.
Here’s an example of creating a DataFrame:
import pandas as pd
data = {
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
'age': [25, 32, 18, 47, 33],
'country': ['USA', 'Canada', 'USA', 'Canada', 'USA']
}
df = pd.DataFrame(data)
print(df)
Output:
name age country
0 Alice 25 USA
1 Bob 32 Canada
2 Charlie 18 USA
3 David 47 Canada
4 Emily 33 USA
Data Manipulation with Pandas
Pandas provides powerful tools for data manipulation, including indexing, slicing, filtering, grouping, merging, and pivoting.
Indexing and Slicing
Pandas provides two primary indexing operators: .loc
and .iloc
. .loc
is label-based indexing, and .iloc
is integer-based indexing.
Here’s an example of indexing and slicing a DataFrame using .loc
and .iloc
:
import pandas as pd
data = {
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
'age': [25, 32, 18, 47, 33],
'country': ['USA', 'Canada', 'USA', 'Canada', 'USA']
}
df = pd.DataFrame(data)
# Select rows by label
print(df.loc[0]) # First row
print(df.loc[0:2]) # First three rows
# Select rows by integer
print(df.iloc[0]) # First row
print(df.iloc[0:3]) # First three rows
# Select columns by label
print(df.loc[:, 'name']) # Name column
print(df.loc[:, ['name', 'age']]) # Name and age columns
# Select columns by integer
print(df.iloc[:, 0]) #