Why Pandas? A Beginners Guide

M Shehzen
Geek Culture
Published in
7 min readJan 2, 2023

--

Pandas

Pandas is a powerful and popular library for working with data in Python. It provides tools for handling and manipulating large and complex datasets, and is widely used in fields such as finance, economics, statistics, and data science.

Pandas is built on top of NumPy, a library for working with numerical data in Python, and provides a high-level interface for working with structured data. It provides two main data structures:

  1. The Series
  2. The DataFrame

The Series is a one-dimensional labelled array that can hold any data type. It is similar to a column in a spreadsheet while the DataFrame is a two-dimensional labelled data structure with columns of potentially different types. It is similar to a whole spreadsheet or a SQL table.

Pandas is particularly useful for cleaning, transforming, and manipulating data in preparation for analysis. It provides a wide range of functions and methods for filtering, grouping, and aggregating data and handling missing or incomplete data.

Pandas is also often used in conjunction with other data visualisation and machine learning libraries, making it a valuable tool for data scientists and analysts.

Data structures in Pandas

It provides two main data structures.

Series

A Series is a one-dimensional labelled array that can hold any data type. It is similar to a column in a spreadsheet. A Series is created by passing a list of data to the pd.Series() function and specifying an index, which is a list of labels for the data. The thing in the case of giving your labels is that the length of the labels list should be equal to the length of the data list. Otherwise, the error will be thrown. The index is optional; a default index will be created if not specified. So when we have to make large datasets, then it is better to leave the labels to the pandas. For Example:

import pandas as pd
data = [1, 2, 3, 4, 5]
s = pd.Series(data, index=['a', 'b', 'c', 'd', 'e'])
print(s)

Output:
a 1
b 2
c 3
d 4
e 5
dtype: int64

DataFrame

A DataFrame is a two-dimensional labelled data structure with columns of potentially different types. It is similar to a spreadsheet or a SQL table. A DataFrame is created by passing a dictionary of Series or dictionaries to the pd.DataFrame() function. The keys in the dictionary are used as column names, and the values are the data for the corresponding columns. For Example:

import pandas as pd

data = {'a': pd.Series([1, 2, 3]),
'b': pd.Series([4, 5, 6])}
# or
data = {'a': [1, 2, 3],
'b': [4, 5, 6]}
df = pd.DataFrame(data)
print(df)
Output:
a b
0 1 4
1 2 5
2 3 6

Importing and Exporting data

Pandas provides several functions for importing and exporting data from various sources. Some of the most common sources of data include CSV files, Excel files, and SQL databases.

You can use the pd.read_csv() function to import data from a CSV file. This function takes the file path or URL as an argument and returns a DataFrame. For Example: import pandas as pd df = pd.read_csv('data.csv')

To import data from an Excel file, you can use the pd.read_excel() function. This function takes the file path or URL and the name of the sheet as arguments and returns a DataFrame. For Example:

import pandas as pd
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# sheet name - if we have many sheets in the file.

To import data from a SQL database, you can use the pd.read_sql() function. This function takes a SQL query and a connection object as arguments and returns a DataFrame. For Example:

import pandas as pd
import pyodbc
cnxn = pyodbc.connect("Driver={SQL Server Native Client 11.0};"
"Server=server_name;"
"Database=database_name;"
"Trusted_Connection=yes;")

query = "SELECT * FROM table_name"
df = pd.read_sql(query, cnxn)

In addition to importing data, pandas also provides functions for exporting data. To export a DataFrame to a CSV file, you can use the df.to_csv() method. To export a DataFrame to an Excel file, you can use the df.to_excel() method. For Example:

import pandas as pd

# Export DataFrame to CSV file
df.to_csv('data.csv', index=False)

# Export DataFrame to Excel file
df.to_excel('data.xlsx', sheet_name='Sheet1', index=False)

Data Cleaning and Preparation

Pandas is a useful tool for cleaning and preparing data for analysis. It provides several functions and methods for tasks such as removing duplicates, handling missing values and reformatting data.

Dealing with Duplicates

To remove duplicates from a DataFrame, you can use the df.drop_duplicates() method. This method removes rows with duplicate values in all or a subset of the columns. You can specify the columns to consider for duplicate values using the subset argument, or specify to keep the first or last occurrence of duplicates using the keep argument. For Example:

import pandas as pd

df = pd.DataFrame({'A': [1, 1, 2, 3, 3],
'B': [2, 2, 3, 4, 5],
'C': [3, 4, 5, 6, 7]})

# Remove duplicates in all columns
df_deduplicated = df.drop_duplicates()

# Remove duplicates in columns 'A' and 'B'
df_deduplicated = df.drop_duplicates(subset=['A', 'B'])

# Keep the first occurrence of duplicates
df_deduplicated = df.drop_duplicates(keep='first')

# Keep the last occurrence of duplicates
df_deduplicated = df.drop_duplicates(keep='last')

Dealing with Missing Values.

To handle missing values in a DataFrame, you can use the df.isnull() method to identify missing values, and the df.dropna() method to remove rows or columns with missing values. You can also use the df.fillna() method to fill in missing values with a specified value. For Example:

import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],
'B': [2, 3, 4, 5, 6],
'C': [3, 4, 5, 6, 7]})

# Add missing values
df.loc[1, 'A'] = None
df.loc[3, 'B'] = None

# Identify missing values
df_missing = df[df.isnull().any(axis=1)]

# Remove rows with missing values
df_cleaned = df.dropna()

# Fill missing values with 0
df_cleaned = df.fillna(0)

Data Visualization

Pandas can be used in conjunction with libraries like Matplotlib and Seaborn to create visually appealing and informative plots and charts. These libraries provide a wide range of plotting functions that can be easily used with Pandas data structures.

To create a simple line chart using pandas, you can use the df.plot() method and specify the kind argument as 'line'. The df.plot() method takes several optional arguments that allow you to customize the appearance of the chart, such as the x and y-axis labels, the title, and the legend. For Example:

import pandas as pd
import matplotlib.pyplot as plt

# Create a DataFrame
df = pd.DataFrame({'x': [1, 2, 3, 4],
'y': [3, 4, 5, 6]})

# Plot the DataFrame as a line chart
df.plot(x='x', y='y', kind='line', title='Line Chart')
plt.show()
Line chart

To create a bar chart using pandas, you can use the same df.plot() method and specify the kind argument as 'bar'. For Example:

import pandas as pd
import matplotlib.pyplot as plt

# Create a DataFrame
df = pd.DataFrame({'x': ['A', 'B', 'C', 'D'],
'y': [3, 4, 5, 6]})

# Plot the DataFrame as a bar chart
df.plot(x='x', y='y', kind='bar', title='Bar Chart')
plt.show()
Bar chart

In addition to the simple line and bar charts, pandas also provides functions for creating more advanced charts, such as scatter plots, histograms, and box plots. You can use the df.plot() method with different combinations of arguments to create these charts, or you can use the functions provided by Matplotlib and Seaborn directly.

For example, to create a scatter plot using pandas, you can use the df.plot() method and specify the kind argument as 'scatter':

import pandas as pd
import matplotlib.pyplot as plt

# Create a DataFrame
df = pd.DataFrame({'x': [1, 2, 3, 4],
'y': [3, 4, 5, 6]})

# Plot the DataFrame as a scatter plot
df.plot(x='x', y='y', kind='scatter', title='Scatter Plot')
plt.show()
Scatter Plot

To create a histogram using pandas, you can use the `df.plot()` method and specify the kind argument as ‘hist’:

import pandas as pd
import matplotlib.pyplot as plt

# Create a Series
s = pd.Series([1, 2, 3, 3, 4, 5, 6, 6, 7, 8])

# Plot the Series as a histogram
s.plot(kind='hist', title='Histogram')
plt.show()
Histogram

Conclusion

So overall, we can say that Pandas is a powerful and popular library for working with data in Python. It provides tools for handling and manipulating large and complex datasets, and is widely used in fields such as finance, economics, statistics, and data science. Pandas is built on top of NumPy, a library for working with numerical data in Python, and provides a high-level interface for working with structured data. It provides two main data structures: the Series and the DataFrame.

Pandas is particularly useful for cleaning, transforming, and manipulating data in preparation for analysis. It provides a wide range of functions and methods for filtering, grouping, and aggregating data and handling missing or incomplete data. It is also often used in conjunction with other libraries for data visualization and machine learning. Pandas can be used to import and export data from various sources, such as CSV files, Excel files, and SQL databases, using functions such as pd.read_csv(), pd.read_excel(), and pd.read_sql()

This was all the basics of pandas and all that needs to be known to get you started. If you liked this post, follow me on Twitter @shehzensidiq and also give a follow on medium. If You have any questions or queries, please leave a comment. Thanks and have a nice day.

--

--

M Shehzen
Geek Culture

I am student, Blogger and trying to teach and learn from others. Happy learning and Happy reading.