This is a report directly from an Amazon Seller who primarily sells used books. The goal for this analysis will be to gain insights on the behavior of the business and then create a web-based dashboard to display important KPI's.
import pandas as pd
import numpy as np
import plotly.express as px
import statsmodels as sm
import matplotlib.pyplot as plt
import re
We'll start with the listing data. First we want to get familiar with the data set and begin cleaning the set if needed.
lists = pd.read_excel("listingdata.xlsx")
pd.set_option('display.max_columns',None)
lists.head()
SKU | Source | Date Code | ASIN | Cost Per Item | List Price Per Item | Sales Rank | Quantity | Condition | Notes | Date Listed | Total Ship Cost | Ship Cost Per Item | Total Cost | Total List Price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 440-29-10-2021-722K-0.06 | Bulk | NaN | 877842892 | 0.06 | 20.00 | 722010.0 | 25 | UsedGood | Book has signs of use and shelf wear. Front an... | 2021-10-18 | 9.00 | 0.36 | 1.50 | 10200.00 |
1 | 6/11/19-233-21.88 | Online | NaN | 393614891 | 21.88 | 248.00 | 126131.0 | 4 | UsedLikeNew | Little to no signs of use. Front and back cove... | 2019-06-03 | 3.04 | 0.76 | 87.52 | 992.00 |
2 | 197-21-07-2021-1.5M-0.00 | Bulk | NaN | 875870708 | 0.00 | 506.38 | 1468886.0 | 1 | UsedAcceptable | Book may show heavy signs of wear along the fr... | 2021-07-21 | 0.33 | 0.33 | 0.00 | 506.38 |
3 | 65P7-JT1Z | Thrift | NaN | 133918920 | 0.00 | 400.00 | NaN | 1 | Used Very Good | Small amounts of shelf wear on the bottom of t... | 2018-07-09 | NaN | NaN | 0.00 | 400.00 |
4 | 399-17-02-2022-1.6M-0.06 | Bulk | NaN | 1319217788 | 0.06 | 363.56 | 1562458.0 | 1 | UsedGood | Book has signs of use and shelf wear. Front an... | 2021-12-16 | 0.33 | 0.33 | 0.06 | 363.56 |
print(lists.shape)
lists.isna().sum()
(8646, 15)
SKU 0 Source 0 Date Code 8646 ASIN 0 Cost Per Item 183 List Price Per Item 0 Sales Rank 418 Quantity 0 Condition 0 Notes 0 Date Listed 0 Total Ship Cost 495 Ship Cost Per Item 495 Total Cost 0 Total List Price 0 dtype: int64
So we clearly have some NA values. Here's the approach we'll use for cleaning this data:
That leaves the Sales Rank. Amazon sales rank is a metric that describes how well that product is selling in a snapshot of time. For example, the #1 best selling book has a sales rank of 1. The 100th best selling book has a sales rank of 100 at that snapshot of time when the data is pulled. Let's look into the sales rank a bit more before we choose how to handle it.
lists['Sales Rank'].describe()
salesbox = px.box(lists,x='Sales Rank')
salesbox.show()
Considering the max rank is 22,000,000 but the top 75% is ~940,000, I think the median is a more accurate measurement to fill the blanks with. That 22 million heavily skews the mean and it can be seen in the box plot. The median will give a more symmetric distribution.
lists['Sales Rank'] = lists['Sales Rank'].fillna(lists['Sales Rank'].median())
lists.isna().sum()
SKU 0 Source 0 Date Code 8646 ASIN 0 Cost Per Item 183 List Price Per Item 0 Sales Rank 0 Quantity 0 Condition 0 Notes 0 Date Listed 0 Total Ship Cost 495 Ship Cost Per Item 495 Total Cost 0 Total List Price 0 dtype: int64
lists = lists.drop(columns='Date Code', axis=1)
print(lists.columns.values)
['SKU' 'Source' 'ASIN' 'Cost Per Item' 'List Price Per Item' 'Sales Rank' 'Quantity' 'Condition' 'Notes' 'Date Listed' 'Total Ship Cost' 'Ship Cost Per Item' 'Total Cost' 'Total List Price']
lists['Cost Per Item'] = lists['Cost Per Item'].fillna(lists['Cost Per Item'].mean())
lists.isna().sum()
SKU 0 Source 0 ASIN 0 Cost Per Item 0 List Price Per Item 0 Sales Rank 0 Quantity 0 Condition 0 Notes 0 Date Listed 0 Total Ship Cost 495 Ship Cost Per Item 495 Total Cost 0 Total List Price 0 dtype: int64
lists['Total Ship Cost'] = lists['Total Ship Cost'].fillna(lists['Total Ship Cost'].mean())
lists['Ship Cost Per Item'] = lists['Ship Cost Per Item'].fillna(lists['Ship Cost Per Item'].mean())
lists.isna().sum()
SKU 0 Source 0 ASIN 0 Cost Per Item 0 List Price Per Item 0 Sales Rank 0 Quantity 0 Condition 0 Notes 0 Date Listed 0 Total Ship Cost 0 Ship Cost Per Item 0 Total Cost 0 Total List Price 0 dtype: int64
We'll start with some basic analysis and exploration of the listing data.
lists.describe()
Cost Per Item | List Price Per Item | Sales Rank | Quantity | Total Ship Cost | Ship Cost Per Item | Total Cost | Total List Price | |
---|---|---|---|---|---|---|---|---|
count | 8646.000000 | 8646.000000 | 8.646000e+03 | 8646.000000 | 8646.000000 | 8646.000000 | 8646.000000 | 8646.000000 |
mean | 2.354808 | 30.506833 | 6.676401e+05 | 1.015036 | 0.435269 | 0.427242 | 2.251007 | 32.325003 |
std | 4.922704 | 24.575828 | 9.735394e+05 | 0.309862 | 0.267764 | 0.224111 | 5.520884 | 112.943751 |
min | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 16.550000 | 9.530275e+04 | 1.000000 | 0.300000 | 0.300000 | 0.000000 | 16.552500 |
50% | 1.900000 | 28.000000 | 3.828970e+05 | 1.000000 | 0.360000 | 0.360000 | 1.400000 | 28.500000 |
75% | 2.990000 | 38.000000 | 9.055845e+05 | 1.000000 | 0.540000 | 0.520000 | 2.930000 | 38.000000 |
max | 79.360000 | 506.380000 | 2.280078e+07 | 25.000000 | 9.000000 | 1.150000 | 149.280000 | 10200.000000 |
# Total Listings by Condition
salescondition = lists.groupby('Condition').count().reset_index()
salesconditionfig = px.bar(salescondition,x='Condition',y='SKU')
salesconditionfig.show()
So there's some issues with the conditions having repeated values. To solve this, I'm going to remove the 'Used' and provide proper spacing for 'Like New' and 'Very Good'.
lists['Condition'] = lists['Condition'].str.replace('Used ', '')
lists['Condition'] = lists['Condition'].str.replace('Used', '')
lists['Condition'] = lists['Condition'].str.replace('LikeNew', 'Like New')
lists['Condition'] = lists['Condition'].str.replace('VeryGood', 'Very Good')
print(lists['Condition'].unique())
salescondition = lists.groupby('Condition').count().reset_index()
salesconditionfig = px.bar(salescondition,x='Condition',y='SKU',title='Total Listings by Condition',template='seaborn',labels={"SKU":"Listings"})
salesconditionfig.update_xaxes(categoryorder = 'array',categoryarray=['Acceptable','Good','Very Good','Like New','New'])
salesconditionfig.show()
print('Total Listings in "Good" Condition:')
print(lists['Condition'][lists['Condition'] == 'Good'].count())
print('or')
print(((lists['Condition'][lists['Condition'] == 'Good'].count())/lists['SKU'].count()*100),'%')
['Good' 'Like New' 'Acceptable' 'Very Good' 'New']
Total Listings in "Good" Condition: 5185 or 59.969928290538974 %
#Listings per week
listings=lists
listings['Date Listed'] = pd.to_datetime(lists['Date Listed'])
weeklylistings = listings.groupby(listings['Date Listed'].dt.to_period('W')).count()
weeklylistings = weeklylistings.resample('W').asfreq().fillna(0)
weeklylistingsfig = px.line(weeklylistings,x=weeklylistings.index.to_timestamp(),y='SKU', title='Weekly Listings', labels={"SKU":'Total Listings'},template='seaborn')
weeklylistingsfig.update_layout(xaxis_title=None)
weeklylistingsfig.show()
#Listing count and average list price per year
yearlylistings = listings.groupby(listings['Date Listed'].dt.to_period('Y')).count()
avglistpriceyearly=listings.groupby(listings['Date Listed'].dt.to_period('Y')).mean()
print(yearlylistings['SKU'])
print(avglistpriceyearly['List Price Per Item'])
Date Listed 2018 962 2019 3090 2020 2498 2021 2096 Freq: A-DEC, Name: SKU, dtype: int64 Date Listed 2018 32.988285 2019 34.964858 2020 29.662062 2021 23.802533 Freq: A-DEC, Name: List Price Per Item, dtype: float64
Looks like listings became more inconsistent but there were more units per shipment.
#Sales rank histogram
srhist = px.histogram(lists,x='Sales Rank')
srhist.show()
Lets filter out any book that has a sales rank of over 3 million for this histogram.
I was getting errors when trying to filter the sales rank. Checking the type, it can be seen that Sales Rank is being shown as an object. The error is shown below:
ValueError: invalid literal for int() with base 10: '100K'
So we have some sales ranks showing as 100k, 1M, etc. We can replace these values.
lowsr = lists[lists['Sales Rank']<=3000000]
lowsrhist = px.histogram(lowsr,x='Sales Rank',nbins=30,template='seaborn',title='Sales Rank Histogram',labels={'count':'Sales Rank'})
lowsrhist.show()
Taking note of what we know:
It will be interesting to look at the sales data. My hypothesis will be that the sales have steadily been declining since 2019. The lower average list price and volume year over year will result in declining sales.
First, we want to get familiar with the dataset see what we're working with.
sales = pd.read_excel("salesdata.xlsx")
pd.set_option('display.max_columns',None)
print(sales.shape)
sales.head()
(8132, 28)
date/time | settlement id | type | order id | description | quantity | marketplace | account type | fulfillment | order city | order state | order postal | tax collection model | product sales | product cost | product sales tax | shipping credits | shipping credits tax | gift wrap credits | giftwrap credits tax | promotional rebates | promotional rebates tax | Marketplace Withheld Tax | selling fees | fba fees | other tran fees | other | total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Jun 2, 2021 9:35:37 AM PDT | 14323467191 | Order | 114-6318818-1993044 | Algebra 2, Student Edition (MERRILL ALGEBRA 2) | 1 | amazon.com | Standard Orders | Amazon | TUCSON | AZ | 85747-9073 | MarketplaceFacilitator | 19.1 | -18.66 | 1.66 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -1.66 | -4.67 | -6.58 | 0.0 | 0 | -141.43 |
1 | Jun 2, 2021 7:32:44 PM PDT | 14323467191 | Order | 113-4128200-8189858 | Algebra 2, Student Edition (MERRILL ALGEBRA 2) | 1 | amazon.com | Standard Orders | Amazon | PENDLETON | IN | 46064-9216 | MarketplaceFacilitator | 19.1 | -18.66 | 1.34 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -1.34 | -4.67 | -6.58 | 0.0 | 0 | -141.43 |
2 | Jun 4, 2021 2:51:33 AM PDT | 14323467191 | Order | 111-1473468-9293007 | Algebra 2, Student Edition (MERRILL ALGEBRA 2) | 1 | amazon.com | Standard Orders | Amazon | LOS ANGELES | CA | 90006-2420 | MarketplaceFacilitator | 19.1 | -18.66 | 1.81 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -1.81 | -4.67 | -6.58 | 0.0 | 0 | -141.43 |
3 | Jun 4, 2021 1:23:40 PM PDT | 14323467191 | Order | 112-4297224-2108216 | Algebra 2, Student Edition (MERRILL ALGEBRA 2) | 1 | amazon.com | Standard Orders | Amazon | SIERRA MADRE | CA | 91024-1439 | MarketplaceFacilitator | 19.1 | -18.66 | 1.96 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -1.96 | -4.67 | -6.58 | 0.0 | 0 | -141.43 |
4 | Jun 6, 2021 7:55:05 AM PDT | 14323467191 | Order | 113-1136356-2776221 | Algebra 2, Student Edition (MERRILL ALGEBRA 2) | 1 | amazon.com | Standard Orders | Amazon | ALHAMBRA | CA | 91801-1746 | MarketplaceFacilitator | 19.1 | -18.66 | 1.96 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -1.96 | -4.67 | -6.58 | 0.0 | 0 | -141.43 |
sales.isna().sum()
date/time 0 settlement id 0 type 0 order id 0 description 0 quantity 0 marketplace 0 account type 0 fulfillment 0 order city 0 order state 0 order postal 2 tax collection model 0 product sales 0 product cost 0 product sales tax 0 shipping credits 0 shipping credits tax 0 gift wrap credits 0 giftwrap credits tax 0 promotional rebates 0 promotional rebates tax 0 Marketplace Withheld Tax 0 selling fees 0 fba fees 0 other tran fees 0 other 0 total 0 dtype: int64
sales.describe()
settlement id | quantity | product sales | product cost | product sales tax | shipping credits | shipping credits tax | gift wrap credits | giftwrap credits tax | promotional rebates | promotional rebates tax | Marketplace Withheld Tax | selling fees | fba fees | other tran fees | other | total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 8.132000e+03 | 8132.000000 | 8132.000000 | 8132.000000 | 8132.000000 | 8132.000000 | 8132.000000 | 8132.000000 | 8132.000000 | 8132.000000 | 8132.000000 | 8132.000000 | 8132.000000 | 8132.000000 | 8132.000000 | 8132.0 | 8132.000000 |
mean | 1.298230e+10 | 1.000123 | 18.787409 | -2.526578 | 0.790920 | 0.820627 | 0.017381 | 0.001227 | 0.000049 | -0.194108 | 0.008328 | -0.816677 | -4.635160 | -4.757322 | -0.023253 | 0.0 | 7.344343 |
std | 1.118912e+09 | 0.011089 | 18.313503 | 6.127053 | 1.219104 | 2.474425 | 0.105221 | 0.078251 | 0.004436 | 1.002575 | 0.178240 | 1.242031 | 2.766585 | 2.663814 | 0.149940 | 0.0 | 15.174279 |
min | 1.059662e+10 | 1.000000 | 1.080000 | -139.770000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -9.640000 | 0.000000 | -38.360000 | -76.050000 | -53.460000 | -0.990000 | 0.0 | -141.430000 |
25% | 1.207199e+10 | 1.000000 | 9.060000 | -2.930000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -1.060000 | -4.862500 | -5.260000 | 0.000000 | 0.0 | 1.017500 |
50% | 1.280267e+10 | 1.000000 | 13.255000 | -2.090000 | 0.630000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -0.640000 | -3.810000 | -4.760000 | 0.000000 | 0.0 | 3.490000 |
75% | 1.374972e+10 | 1.000000 | 20.310000 | 0.000000 | 1.010000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -3.180000 | -3.280000 | 0.000000 | 0.0 | 8.792500 |
max | 1.533458e+10 | 2.000000 | 494.990000 | 0.000000 | 38.360000 | 49.980000 | 2.830000 | 4.990000 | 0.400000 | 0.000000 | 11.300000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 413.990000 |
The dataset looks pretty good. No nan values in important columns, no extreme min or max values, and since all the counts are the same, there are no blanks either. I want to quickly rename the 'date/time' column to just 'date'.
We can also see that the mean sales price is 19.00 with the average product cost being 3.00
sales = sales.rename(columns={'date/time':'date'})
print(sales.columns.values)
['date' 'settlement id' 'type' 'order id' 'description' 'quantity' 'marketplace' 'account type' 'fulfillment' 'order city' 'order state' 'order postal' 'tax collection model' 'product sales' 'product cost' 'product sales tax' 'shipping credits' 'shipping credits tax' 'gift wrap credits' 'giftwrap credits tax' 'promotional rebates' 'promotional rebates tax' 'Marketplace Withheld Tax' 'selling fees' 'fba fees' 'other tran fees' 'other' 'total']
salesgroup = sales
salesgroup['date'] = pd.to_datetime(sales['date'])
weeklysales = salesgroup.groupby(salesgroup['date'].dt.to_period('W')).sum(numeric_only=True)
weeklysales = weeklysales.resample('W').asfreq().fillna(0)
weeklyfig = px.bar(weeklysales,x=weeklysales.index.to_timestamp(),y='product sales', title='Weekly Sales', labels={"product sales":'Total Sales'},template='seaborn')
weeklyfig.update_layout(xaxis_title=None)
weeklyfig.show()
First few things to note:
Let's see if we can look at average sales price over time as well as the amount of products sold over time.
# Average sales over time
avgsales = salesgroup.groupby(salesgroup['date'].dt.to_period('W')).mean(numeric_only=True)
avgsales = avgsales.resample('W').asfreq().fillna(0)
meanfig = px.bar(avgsales,x=avgsales.index.to_timestamp(),y='product sales', title='Average Sales Price', labels={"product sales":'Average Sales Price'},template='seaborn')
meanfig.update_layout(xaxis_title=None)
meanfig.show()
# Amount of sales over time
cntsales = salesgroup.groupby(salesgroup['date'].dt.to_period('W')).count()
cntsales = cntsales.resample('W').asfreq().fillna(0)
countfig = px.bar(cntsales,x=avgsales.index.to_timestamp(),y='order id', title='Items Sold', labels={"order id":'Items Sold'},template='seaborn')
countfig.update_layout(xaxis_title=None)
countfig.show()
# Average buy cost over time
avgsalescost = salesgroup.groupby(salesgroup['date'].dt.to_period('W')).mean(numeric_only=True)
avgsalescost = avgsalescost.resample('W').asfreq().fillna(0)
costfig = px.bar(avgsalescost,x=avgsales.index.to_timestamp(),y='product cost', title='Product Cost', labels={"order id":'Product Cost'},template='seaborn')
costfig.update_layout(xaxis_title=None)
costfig.update_yaxes(autorange="reversed")
costfig.show()
These two graphs help shine a lot of light on this situation. Firstly, the January and August trends are highlighted again with the items sold chart. Secondly, it seems like the main cause of the total sales trend is the average sales prices. We can see that the average product cost decreased significantly along with the average sales price. Perhaps a new low-cost, high-volume business model was attempted. It would be obvious though that the highest correlation with the decreasing total sales would be with the sales count. We can calculate that correlation.
corr_sales_count = cntsales['order id'].corr(weeklysales['product sales'])
print("Correlation:", corr_sales_count)
Correlation: 0.9045558679104653
It's known that the sales are decreasing but it would be nice to see how much they've decreased year over year.
It should be noted that the dataset includes the first half of January 2022, so we should expect to see a large decrease from 2021 to 2022. Similarly with 2018 to 2019. Only the last half of 2018 is accounted for in this dataset, therefore there should be a large increase in change for 2019.
yearlysales = salesgroup.groupby(salesgroup['date'].dt.to_period('Y')).sum(numeric_only=True)
yearlysales = yearlysales.resample('Y').asfreq().fillna(0)
yoychange = (yearlysales.pct_change()*100)
print("Total sales and profit by year")
print(yearlysales['product sales']),print(yearlysales['total'])
print("")
print("Year over year gross sales change:")
print(yoychange['product sales'])
print("")
print('Year over year profit change:')
print(yoychange['total'])
Total sales and profit by year date 2018 8230.41 2019 65885.05 2020 44213.13 2021 32535.58 2022 1915.04 Freq: A-DEC, Name: product sales, dtype: float64 date 2018 4472.58 2019 26967.19 2020 16974.41 2021 10577.49 2022 732.53 Freq: A-DEC, Name: total, dtype: float64 Year over year gross sales change: date 2018 NaN 2019 700.507508 2020 -32.893532 2021 -26.411950 2022 -94.114013 Freq: A-DEC, Name: product sales, dtype: float64 Year over year profit change: date 2018 NaN 2019 502.944833 2020 -37.055325 2021 -37.685669 2022 -93.074633 Freq: A-DEC, Name: total, dtype: float64
We can clearly see that 2019 was the best performing year. Every year since then has steadily decreased an average ~30% for gross sales and ~37% for profit.