Exploratory data analysis using Python Pandas and SQL

Pandas Python library is becoming more and more popular between data scientists and analysts. It allows you quickly to load, process, transform, analyze, and visualize the data.

When you work with Pandas, the most important thing to understand is that there are two main data structures – Series and DataFrame:

Series is a one-dimensional indexed array which can hold any data type (integer, float, etc).
DataFrame is a primary Pandas data structure. It is a two-dimensional data class (rows and columns), where every column may contain different data type. You may also pass an index and additional columns to a given DataFrame.

Below I will use some common commands for exploratory data analysis using Pandas and SQL for a public sample of random Reddit posts.

Importing the packages

We start from importing the necessary packages which will be used for our data analysis:

import numpy as np #linear algebra
import pandas as pd #data processing 
import seaborn as sns #statistical graph package 
import matplotlib.pyplot as plt #plot package for visualisations 
import pandasql as ps #sql package

You will need pandasql package only if you are going to run SQL. If you are working with visuals, you may have to choose the plot style:

#plt.style.use('bmh') #setting up 'bmh' as "Bayesian Methods for Hackers" style sheet 
plt.style.use('ggplot') #R ggplot style

To see all available plot styles, you can run:

print(plt.style.available)

Reading the data

As I mentioned above, I’ll be running the analysis for a publicly released dataset of random Reddit posts published on Kaggle. You can find it by clicking on this Kaggle link.

Reading CSV file in Pandas:

df = pd.read_csv('/kaggle/input/dataisbeautiful/r_dataisbeautiful_posts.csv') 
df.head(5)

We can see that there are 170,674 entries in the given dataset. Caveat, not all columns are completed, and there are some missing/null values. You can clean the data by returning only non-missing or NULL values by running df.notnull() or df.isnull() accordingly.

Running EDA

Exploratory data analysis is often a first brief look at your dataset which helps you understand its structure, form, size, and discover patterns. Below I’ll demonstrate a few common commands for EDA and will show a way how to run SQL statements in Pandas.

We can start with running basic DataFrame exploratory commands:

df.info()

df.describe() #or df.count()

Now we know that the DataFrame we’re working with contains 12 columns with boolean, float, integer, and Python object data classes. We also can see which columns have missing values and learn a basic numerical data distribution for score, num_comments, and total_awards_received columns.

Additionally, we can dig deeper into columns by running built-in statistical commands as mean(), sum(), max(), shape(), or dtypes(), and etc. These can be applied to the whole DataFrame and to each column separately as well:

df.score.describe()

df.describe() #or df.count()

df.score.median()

Running SQL in Pandas

One of the biggest advantages of the Pandas library is that it can work well with SQL and tabular data. One of the ways to run SQL statements is to import pandasql package and call the following commands:

q1 = """SELECT removed_by, count(distinct id)as number_of_removed_posts
FROM df 
where removed_by is not null 
group by removed_by 
order by 2 desc """

grouped_df = ps.sqldf(q1, locals())
grouped_df

Visualizing data using SQL output:

Let’s build a bar chart from the output from the code above to visualize our data:

removed_by = grouped_df['removed_by'].tolist()
number_of_removed_posts = grouped_df['number_of_removed_posts'].tolist()
plt.figure(figsize=(12,8))
plt.ylabel("Number of deleted reddits")
plt.bar(removed_by, number_of_removed_posts)
plt.show()

From the chart above we can see that from all deleted reddits, 68% are removed by moderator and 6% – by users.

There are many types of plots and analysis, from basic statistics to complex visuals and forecasts, that you can build and perform in Pandas. You can see a more complete analysis in Kaggle here.

Like what you see?

Connect

Olga Berezovsky is a Senior Data Analyst. She has extensive experience in the Big Data industry—specifically in data acquisition, transformation, and analysis—and deep expertise in building quantitative and qualitative user profile analysis that reveals user insights and behavior.

Exploratory data analysis using Python Pandas and SQL

Importing the packages

Reading the data

Running EDA

Running SQL in Pandas

Visualizing data using SQL output:

Like what you see?

Author

Contact Us

Services

Insights

Industries

Who We Are

Join Us