E-commerce Using Pandas


Pandas can be useful when you need to create a data frame and access to certain information from a dataset. Pandas can easily read csv and xlsx files. You can even create csv and xlsx files right on Python. Here I am using some transaction data from Amazon to show you how the package works. The dataset is presented below, and it is fictional. 
# Import the Pandas Package
import pandas as pd
# Read in the file as csv
ecom = pd.read_csv('Ecommerce Purchases')

# Find the columns and rows
ecom.info()

This tells us we have a total of 10,000 rows and 14 columns.  To note, in Python, index starts from 0.
# Find the mean, max, and min of purchase price
ecom['Purchase Price'].mean()
ecom['Purchase Price'].max()
ecom['Purchase Price'].min()
Mean: 50.35 Max: 99.99 Min: 0.0
# Find how many people used the English service
ecom[ecom['Language']=='en'].count()
Here we see there are 1098 English users on Amazon in this database. 
# Find out the number of AM and PM purchases 
ecom.groupby('AM or PM')['AM or PM'].count()

There are 4932 customers buying in the AM and 5068 customers buying in the PM. 

I here want to note the difference between count() and value_counts(). They both count up the total for the variable of interest. However, value_counts() sort the data from the highest to the lowest. Here is an example of sorting professions. 
# Using count(), which is not sorted 
ecom.groupby('Job')['Job'].count().head(5)
# Using value_counts(), which is sorted 
ecom['Job'].value_counts().head(5)

Now, based on the sorted 'Job', we found the top five professions above. 

If we want to do something fancy, we can use the apply(). For instance,  we want to find the top 5 email providers in our Amazon database, but before we do that, I want to dissect the solution. 
# First, we want to split the email address at @ by using x.split
ecom['Email'].apply(lambda x: x.split('@')).head(5)
# Only reading the second column to extract all the providers
ecom['Email'].apply(lambda x: x.split('@')[1]).head(5)
# As usual, we count by using value_counts()
ecom['Email'].apply(lambda x: x.split('@')[1]).value_counts().head(5)
Hotmail seems to be the most popular provider in our Amazon database. See how useful is Pandas!

1 comment: